Zijie Tian
ac1ccbceaa
feat: add XAttention sparse policy integration
Integrate COMPASS XAttention algorithm into nano-vllm's CPU offload
execution path. Uses FlashAttention with native GQA support for
offload mode.
New files:
- nanovllm/kvcache/sparse/utils.py: find_blocks_chunked() utility
- nanovllm/kvcache/sparse/kernels.py: Triton kernels for XAttention
- nanovllm/kvcache/sparse/xattn.py: XAttentionPolicy implementation
Modified:
- nanovllm/config.py: Add XATTN configuration parameters
- nanovllm/engine/model_runner.py: Support XATTN policy
- nanovllm/kvcache/sparse/__init__.py: Register XAttentionPolicy
- tests/test_ruler.py: Add --sparse-policy parameter
Test results (32k ruler):
- NIAH tasks: 12/12 (100%)
- QA/Recall tasks: 11/15 (73%)
- Overall: 23/27 (85%)
Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:04:46 +08:00
..
2025-12-24 23:48:52 +08:00
2026-01-04 19:37:03 +08:00
2026-01-14 10:04:46 +08:00
2026-01-14 10:04:46 +08:00
2026-01-14 07:02:02 +08:00
2026-01-12 00:16:37 +08:00
2026-01-08 20:53:08 +08:00
2025-06-15 10:36:45 +08:00
2026-01-14 10:04:46 +08:00
2025-06-15 01:31:24 +08:00
2025-08-31 22:55:34 +08:00