nano-vllm/nanovllm/kvcache/sparse/xattn_bsa.py at 832b352afa965b118f784342b25b3ae6266ab3fb

Files

Zijie Tian 832b352afa ✨ feat(xattn): implement select_blocks with majority voting aggregation

Implement XAttention-based block selection for sparse attention:
- Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores
- Apply softmax_fuse_block_sum to aggregate into block-level attention
- Use find_blocks_chunked for threshold-based block selection
- Handle GQA by aggregating within KV head groups first
- Use majority voting (>50%) across heads instead of any() for better sparsity
- Align block_size with CPU offload block size (1024 tokens / stride = 128)

Test results show ~45% density at chunk 40 (down from 100% with any() aggregation).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-23 08:19:05 +08:00

15 KiB

Raw Blame History

View Raw

15 KiB Raw Blame History

15 KiB

Raw Blame History