nano-vllm/nanovllm/kvcache/sparse/xattn_bsa.py at 2c2383c7869427b653765ca7843c5ac74c6cbcc5

Files

Zijie Tian 2c2383c786 ⚡️ perf: optimize XAttention estimate with hierarchical block sum

Replace slow softmax_fuse_block_sum (block_size=4096) with optimized
hierarchical approach (estimate_block_size=1024):

- Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024)
- Rewrite select_blocks to use hierarchical aggregation:
  1. Fine-grained softmax with small block size (15x faster kernel)
  2. Aggregate to CPU block level via reshape + sum
  3. Score + threshold selection (replaces mask + voting)

Performance improvement (CPU Offload mode):
- softmax_fuse_block_sum: 48% → 1% of total time (44x faster)
- 128K: XAttention now +2.4% faster than Full (was -59%)
- 64K: -3.8% (was -21%)
- 32K: -6.0% (was -14%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 06:47:13 +08:00

33 KiB

Raw Blame History

View Raw

33 KiB Raw Blame History

33 KiB

Raw Blame History