nano-vllm

Files

Zijie Tian 076656c9c2 ✨ feat: add GPU-only XAttention BSA sparse attention support

- Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode
  - Uses xattn_estimate to compute sparse block mask
  - Uses block_sparse_attn_func for efficient sparse attention
  - Handles GQA by expanding K/V heads
  - Falls back to flash_attn for paged KV cache (prefix cache)
- Implement compute_decode() by delegating to FullAttentionPolicy
- Add --policy xattn option to bench.py

Verified: RULER 32k niah_single_1 5/5 samples passed (100%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-27 05:19:24 +08:00

policies

[feat] Added chunked prefill and kvcache offload mechenism.

2025-12-10 03:47:37 +08:00

sparse

✨ feat: add GPU-only XAttention BSA sparse attention support

2026-01-27 05:19:24 +08:00

__init__.py

✨ feat: integrate sparse policy architecture into GPU-only mode

2026-01-27 05:08:02 +08:00

base_manager.py

[feat] Added chunked prefill and kvcache offload mechenism.