Implement XAttention-based block selection for sparse attention:
- Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores
- Apply softmax_fuse_block_sum to aggregate into block-level attention
- Use find_blocks_chunked for threshold-based block selection
- Handle GQA by aggregating within KV head groups first
- Use majority voting (>50%) across heads instead of any() for better sparsity
- Align block_size with CPU offload block size (1024 tokens / stride = 128)
Test results show ~45% density at chunk 40 (down from 100% with any() aggregation).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape
and softmax_fuse_block_sum Triton kernels with structured data
- Update testing.md with new test code style guidelines
- Update xattn.py and xattn_bsa.py with improvements
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>