Zijie Tian
|
ac1ccbceaa
|
feat: add XAttention sparse policy integration
Integrate COMPASS XAttention algorithm into nano-vllm's CPU offload
execution path. Uses FlashAttention with native GQA support for
offload mode.
New files:
- nanovllm/kvcache/sparse/utils.py: find_blocks_chunked() utility
- nanovllm/kvcache/sparse/kernels.py: Triton kernels for XAttention
- nanovllm/kvcache/sparse/xattn.py: XAttentionPolicy implementation
Modified:
- nanovllm/config.py: Add XATTN configuration parameters
- nanovllm/engine/model_runner.py: Support XATTN policy
- nanovllm/kvcache/sparse/__init__.py: Register XAttentionPolicy
- tests/test_ruler.py: Add --sparse-policy parameter
Test results (32k ruler):
- NIAH tasks: 12/12 (100%)
- QA/Recall tasks: 11/15 (73%)
- Overall: 23/27 (85%)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2026-01-14 10:04:46 +08:00 |
|
Zijie Tian
|
ea4e904de0
|
[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST
|
2026-01-08 23:22:38 +08:00 |
|
Zijie Tian
|
d8a87da1c3
|
[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST
|
2026-01-08 03:36:39 +08:00 |
|
Zijie Tian
|
2a6e0a2c02
|
[feat] Added Quest Sparsity Policy.
|
2026-01-07 03:29:21 +08:00 |
|
Zijie Tian
|
c99a6f3d3f
|
[WIP] Before add Quest policy.
|
2026-01-07 02:32:30 +08:00 |
|
Zijie Tian
|
0e691f2d85
|
[WIP] move metadata to GPU.
|
2026-01-06 23:32:32 +08:00 |
|
Zijie Tian
|
690492e074
|
[WIP] Before refactor policies.
|
2026-01-06 20:47:55 +08:00 |
|
Zijie Tian
|
054aaff403
|
[fix] Fixed needle test bug.
|
2026-01-05 18:34:09 +08:00 |
|
Zijie Tian
|
051f2295c9
|
[feat] Added sparse KVcache feature, NEED VERIFY.
|
2025-12-22 08:51:02 +08:00 |
|