Zijie Tian
|
ac1ccbceaa
|
feat: add XAttention sparse policy integration
Integrate COMPASS XAttention algorithm into nano-vllm's CPU offload
execution path. Uses FlashAttention with native GQA support for
offload mode.
New files:
- nanovllm/kvcache/sparse/utils.py: find_blocks_chunked() utility
- nanovllm/kvcache/sparse/kernels.py: Triton kernels for XAttention
- nanovllm/kvcache/sparse/xattn.py: XAttentionPolicy implementation
Modified:
- nanovllm/config.py: Add XATTN configuration parameters
- nanovllm/engine/model_runner.py: Support XATTN policy
- nanovllm/kvcache/sparse/__init__.py: Register XAttentionPolicy
- tests/test_ruler.py: Add --sparse-policy parameter
Test results (32k ruler):
- NIAH tasks: 12/12 (100%)
- QA/Recall tasks: 11/15 (73%)
- Overall: 23/27 (85%)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2026-01-14 10:04:46 +08:00 |
|
Zijie Tian
|
a6cc703d73
|
[tests] Added test_niah_standalone.py.
|
2026-01-12 00:16:37 +08:00 |
|
Zijie Tian
|
a8c9f0d837
|
[claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST
|
2026-01-08 20:53:08 +08:00 |
|
Zijie Tian
|
d8a87da1c3
|
[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST
|
2026-01-08 03:36:39 +08:00 |
|
Zijie Tian
|
2a6e0a2c02
|
[feat] Added Quest Sparsity Policy.
|
2026-01-07 03:29:21 +08:00 |
|
Zijie Tian
|
c99a6f3d3f
|
[WIP] Before add Quest policy.
|
2026-01-07 02:32:30 +08:00 |
|
Zijie Tian
|
054aaff403
|
[fix] Fixed needle test bug.
|
2026-01-05 18:34:09 +08:00 |
|
Zijie Tian
|
484d0de9f9
|
[feat] Added debug hook to offload_engine.py.
|
2025-12-31 19:44:39 +08:00 |
|
Zijie Tian
|
782437c486
|
[WIP] remove num_prefetch_blocks varible.
|
2025-12-24 18:22:26 +08:00 |
|
Zijie Tian
|
051f2295c9
|
[feat] Added sparse KVcache feature, NEED VERIFY.
|
2025-12-22 08:51:02 +08:00 |
|
Zijie Tian
|
b8b6478506
|
[feat] Need to optimized with async prefetch.
|
2025-12-15 06:58:40 +08:00 |
|
Zijie Tian
|
babfa17354
|
[refactor] Translate into english, void Chinese due to claude.
|
2025-12-11 00:30:24 +08:00 |
|
Zijie Tian
|
e85c2b4776
|
[fix] Fixed kvcache offload bugs.
|
2025-12-10 22:34:00 +08:00 |
|
Zijie Tian
|
190df5f70d
|
[refactor] Refactor current gpu and cpu block allocation strategy.
|
2025-12-10 21:23:31 +08:00 |
|
Zijie Tian
|
0a247ccb1b
|
[feat] Added num_gpu_blocks limit gpu blocks.
|
2025-12-10 20:17:42 +08:00 |
|
Zijie Tian
|
0b6f19242d
|
[feat] Added chunked prefill and kvcache offload mechenism.
|
2025-12-10 03:47:37 +08:00 |
|
GeeeekExplorer
|
658520b788
|
warmup and allocate
|
2025-06-27 01:51:57 +08:00 |
|
GeeeekExplorer
|
fc778a4da9
|
better
|
2025-06-15 10:36:45 +08:00 |
|
cheunglei
|
53b3ef2e32
|
support tensor parallel
|
2025-06-15 01:31:24 +08:00 |
|
GeeeekExplorer
|
f16adb729e
|
refactor
|
2025-06-12 09:41:12 +08:00 |
|
GeeeekExplorer
|
386290d69e
|
refactor
|
2025-06-11 21:12:57 +08:00 |
|
GeeeekExplorer
|
b98e1ca305
|
fix
|
2025-06-10 21:25:54 +08:00 |
|
GeeeekExplorer
|
a5a4909e6a
|
init commit
|
2025-06-10 00:27:01 +08:00 |
|