nano-vllm

Files

Zijie Tian 2c2383c786 ⚡️ perf: optimize XAttention estimate with hierarchical block sum

Replace slow softmax_fuse_block_sum (block_size=4096) with optimized
hierarchical approach (estimate_block_size=1024):

- Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024)
- Rewrite select_blocks to use hierarchical aggregation:
  1. Fine-grained softmax with small block size (15x faster kernel)
  2. Aggregate to CPU block level via reshape + sum
  3. Score + threshold selection (replaces mask + voting)

Performance improvement (CPU Offload mode):
- softmax_fuse_block_sum: 48% → 1% of total time (44x faster)
- 128K: XAttention now +2.4% faster than Full (was -59%)
- 64K: -3.8% (was -21%)
- 32K: -6.0% (was -14%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 06:47:13 +08:00

comm

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

debug

[refactor] Refactor the kvcache offload.

2026-01-04 19:37:03 +08:00

engine

📈 feat: add MemoryObserver for GPU-CPU communication tracking

2026-01-28 04:06:45 +08:00

kvcache

⚡️ perf: optimize XAttention estimate with hierarchical block sum