nano-vllm/nanovllm/kvcache/sparse/xattn_bsa.py at 076656c9c2ab37471e6ade6fb121c6ebfa38a8f5

Files

Zijie Tian 076656c9c2 ✨ feat: add GPU-only XAttention BSA sparse attention support

- Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode
  - Uses xattn_estimate to compute sparse block mask
  - Uses block_sparse_attn_func for efficient sparse attention
  - Handles GQA by expanding K/V heads
  - Falls back to flash_attn for paged KV cache (prefix cache)
- Implement compute_decode() by delegating to FullAttentionPolicy
- Add --policy xattn option to bench.py

Verified: RULER 32k niah_single_1 5/5 samples passed (100%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-27 05:19:24 +08:00

28 KiB

Raw Blame History

View Raw

28 KiB Raw Blame History

28 KiB

Raw Blame History