- Rename doc to "Density Alignment Test Results" (covers both modes)
- Add Offload mode test results (3.7K-64.9K tokens, all passed)
- Add Layer 5 GPU-only test results (threshold=0.9, density=6.24%)
- Enhance test script to support both GPU-only and Offload data formats
- Add batch testing commands for all data files
- Update CLAUDE.md index
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Implement three-phase KV chunking for sparse attention estimation:
1. softmax_compute_partial_stats: compute (m, l) per KV chunk
2. merge_softmax_stats: merge partial stats on host
3. softmax_normalize_and_block_sum: normalize with global stats
This allows computing sparse attention masks without storing full
raw attention scores in GPU memory, reducing peak memory usage
from O(q_len * k_full_len) to O(q_len * k_chunk_len).
Key changes:
- Add softmax_partial_stats_kernel with causal mask support
- Add softmax_normalize_block_sum_kernel with kv_offset parameter
- Add Python wrappers for new kernels
- Update test script to validate KV chunking alignment
- Add documentation for the new kernels
Test results show perfect alignment with xattn_estimate API:
- Density difference: 0.000000
- Mask difference: 0.0044%
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Match xattn_estimate internal logic by processing Q in chunks:
- Reduces peak memory for attn_scores tensor
- Enables testing 64K sequences without OOM
- All 5 test files pass (3.6K to 64K)
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Test that xattn_estimate produces the same results as manually calling:
- flat_group_gemm_fuse_reshape
- softmax_fuse_block_sum
- find_blocks_chunked
Uses real KV cache data from results/kvcache/ directory.
Verifies density calculation matches between high-level API and kernels.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>