Implement XAttention-based block selection for sparse attention: - Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores - Apply softmax_fuse_block_sum to aggregate into block-level attention - Use find_blocks_chunked for threshold-based block selection - Handle GQA by aggregating within KV head groups first - Use majority voting (>50%) across heads instead of any() for better sparsity - Align block_size with CPU offload block size (1024 tokens / stride = 128) Test results show ~45% density at chunk 40 (down from 100% with any() aggregation). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
15 KiB
15 KiB