Commit Graph

47 Commits

Author SHA1 Message Date
Zijie Tian
9f3ee9279e feat: add nanovllm.ops module with XAttention estimation kernels
Add ops module ported from tzj/minference branch containing:
- xattn.py: XAttention block importance estimation with Triton kernels
  - xattn_estimate(): standard estimation for sparse attention mask
  - xattn_estimate_chunked(): chunked prefill compatible version
  - flat_group_gemm_fuse_reshape(): fused stride reshape + GEMM kernel
  - softmax_fuse_block_sum(): online softmax + block-wise sum kernel
- chunked_attention.py: Flash attention with LSE output for chunk merging
- test_xattn_estimate_chunked.py: verification test (all seq_lens pass)

This prepares the foundation for AttentionPolicy refactoring where
XAttentionPolicy.estimate() will call these ops.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 06:00:42 +08:00
Zijie Tian
ac1ccbceaa feat: add XAttention sparse policy integration
Integrate COMPASS XAttention algorithm into nano-vllm's CPU offload
execution path. Uses FlashAttention with native GQA support for
offload mode.

New files:
- nanovllm/kvcache/sparse/utils.py: find_blocks_chunked() utility
- nanovllm/kvcache/sparse/kernels.py: Triton kernels for XAttention
- nanovllm/kvcache/sparse/xattn.py: XAttentionPolicy implementation

Modified:
- nanovllm/config.py: Add XATTN configuration parameters
- nanovllm/engine/model_runner.py: Support XATTN policy
- nanovllm/kvcache/sparse/__init__.py: Register XAttentionPolicy
- tests/test_ruler.py: Add --sparse-policy parameter

Test results (32k ruler):
- NIAH tasks: 12/12 (100%)
- QA/Recall tasks: 11/15 (73%)
- Overall: 23/27 (85%)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:04:46 +08:00
Zijie Tian
86633004ca 📝 docs: add 64k memory analysis and test configuration updates
Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:02:09 +08:00
Zijie Tian
cf168fd9b9 test: add comprehensive RULER benchmark test suite
- Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT)
- Implement RULER official evaluation metrics (string_match_all/part)
- Fix max_model_len to 32896 to prevent decode OOM on long inputs
- Add ruler_benchmark_report.md with full test results (92.1% accuracy)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-14 00:51:30 +08:00
Zijie Tian
1424e665e7 test: add parallel multi-GPU RULER NIAH test script
Add test_ruler_niah.sh for independent sample testing across multiple GPUs.
Each sample runs in a separate Python process to avoid state accumulation issues.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 21:08:27 +08:00
Zijie Tian
64971c8e8a Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:27:25 +08:00
Zijie Tian
de6f36bdb2 [docs] Added dist port issue. 2026-01-12 15:16:39 +08:00
Zijie Tian
a6cc703d73 [tests] Added test_niah_standalone.py. 2026-01-12 00:16:37 +08:00
Zijie Tian
1425510a2e [claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST 2026-01-09 16:05:36 +08:00
Zijie Tian
ea4e904de0 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST 2026-01-08 23:22:38 +08:00
Zijie Tian
d8a87da1c3 [claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST 2026-01-08 03:36:39 +08:00
Zijie Tian
2a6e0a2c02 [feat] Added Quest Sparsity Policy. 2026-01-07 03:29:21 +08:00
Zijie Tian
0e691f2d85 [WIP] move metadata to GPU. 2026-01-06 23:32:32 +08:00
Zijie Tian
edb5273e34 [WIP] Added basic test for quest. 2026-01-06 22:30:31 +08:00
Zijie Tian
535f2037ab [WIP] Before fix bench_offload.py. 2026-01-06 18:41:08 +08:00
Zijie Tian
e554d5482b [refactor] Delete unnesscessory test, and refacrtor the offload prefix cache. 2026-01-05 20:31:42 +08:00
Zijie Tian
d623043a3c [WIP] FIXED decode and prefill NEEDLE test. 2026-01-05 01:51:46 +08:00
Zijie Tian
e897380127 [test] Added test_align.py and Before change nanovllm attention. 2026-01-04 22:48:01 +08:00
Zijie Tian
24096431ed [refactor] refactor test_align.py. 2026-01-04 20:55:40 +08:00
Zijie Tian
00ed17c640 [feat] Added debug tools. 2026-01-03 22:36:40 +08:00
Zijie Tian
8c3418725b [refactor] Refactor needle test. 2026-01-03 19:19:37 +08:00
Zijie Tian
b3685c9190 [test] Added test_align.py 2026-01-03 18:55:58 +08:00
Zijie Tian
6927a75ac3 [refactor] refactor needle.py. 2026-01-03 18:33:48 +08:00
Zijie Tian
ff8b09cd35 [test] Added test_needle_ref.py. 2026-01-02 22:03:23 +08:00
Zijie Tian
74ee6d0895 [WIP] need to fix model to normally decode. 2026-01-01 05:18:27 +08:00
Zijie Tian
62b8a63314 [refactor] Refactor the test_chunked_prefill/decode. 2026-01-01 03:32:26 +08:00
Zijie Tian
965c8aff12 [WIP] need change flashattention to debug. 2026-01-01 00:58:22 +08:00
Zijie Tian
30462fe89a [WIP] Before fix needle. 2025-12-31 23:35:25 +08:00
Zijie Tian
ccd1b3d4ab [WIP] Before modify nanovllm CPU-GPU kvcache. 2025-12-31 22:41:07 +08:00
Zijie Tian
31e90a7268 [test] Added offload correct verify. 2025-12-31 20:59:53 +08:00
Zijie Tian
484d0de9f9 [feat] Added debug hook to offload_engine.py. 2025-12-31 19:44:39 +08:00
Zijie Tian
7af721c12c [WIP] Before modify to FlashInfer. 2025-12-30 01:11:13 +08:00
Zijie Tian
89f8020d38 [WIP] fixing attention compute error. 2025-12-30 00:31:48 +08:00
Zijie Tian
82ed34fc2d [opt] optimize nanovllm performance compareable with vllm. 2025-12-25 03:47:07 +08:00
Zijie Tian
16fcf8350b [WIP] replace merge attention with triton kernel. 2025-12-25 01:07:05 +08:00
Zijie Tian
cf5e7df093 [WIP] Added sgDMA operator for scatter kvcache communication. 2025-12-24 23:48:52 +08:00
Zijie Tian
6ec1b23982 [WIP] NEED to modify communication. 2025-12-24 21:57:51 +08:00
Zijie Tian
782437c486 [WIP] remove num_prefetch_blocks varible. 2025-12-24 18:22:26 +08:00
Zijie Tian
b264de903d [test] Added a simple test_prefill.py. 2025-12-23 00:26:25 +08:00
Zijie Tian
4dcef16c13 [WIP] NEED refactor nanovllm mechenism. 2025-12-22 23:52:56 +08:00
Zijie Tian
051f2295c9 [feat] Added sparse KVcache feature, NEED VERIFY. 2025-12-22 08:51:02 +08:00
Zijie Tian
1081ab51ea [refactor] Refactor offload code to multi-chunk. 2025-12-15 01:13:58 +08:00
Zijie Tian
61edb8a344 [feat] Finished offload. Still need optimize performance. 2025-12-12 02:27:40 +08:00
Zijie Tian
babfa17354 [refactor] Translate into english, void Chinese due to claude. 2025-12-11 00:30:24 +08:00
Zijie Tian
190df5f70d [refactor] Refactor current gpu and cpu block allocation strategy. 2025-12-10 21:23:31 +08:00
Zijie Tian
0a247ccb1b [feat] Added num_gpu_blocks limit gpu blocks. 2025-12-10 20:17:42 +08:00
Zijie Tian
0b6f19242d [feat] Added chunked prefill and kvcache offload mechenism. 2025-12-10 03:47:37 +08:00