nano-vllm

Files

Zijie Tian 832b352afa ✨ feat(xattn): implement select_blocks with majority voting aggregation

Implement XAttention-based block selection for sparse attention:
- Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores
- Apply softmax_fuse_block_sum to aggregate into block-level attention
- Use find_blocks_chunked for threshold-based block selection
- Handle GQA by aggregating within KV head groups first
- Use majority voting (>50%) across heads instead of any() for better sparsity
- Align block_size with CPU offload block size (1024 tokens / stride = 128)

Test results show ~45% density at chunk 40 (down from 100% with any() aggregation).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-23 08:19:05 +08:00

__init__.py

[WIP] Before refactor the nanovllm sparse policy.

2026-01-19 22:34:44 +08:00

full_policy.py

♻️ refactor: move select_blocks from policy to attention layer

2026-01-23 05:21:28 +08:00

policy.py

♻️ refactor: move select_blocks from policy to attention layer

2026-01-23 05:21:28 +08:00

quest.py

[WIP] Before add Quest policy.

2026-01-07 02:32:30 +08:00

xattn_bsa.py

✨ feat(xattn): implement select_blocks with majority voting aggregation

2026-01-23 08:19:05 +08:00