Zijie Tian
f049971f84
✅ test: add hierarchical block sum estimation validation
...
Validate the hierarchical estimation approach for XAttention:
- Test 1: Math equivalence (diff = 0.0) between hierarchical and direct
- Test 2: Score + threshold selection strategy (replaces mask + voting)
- Test 3: Performance benchmark (41x speedup)
Uses pure torch + xattn kernels, independent of nanovllm framework.
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-28 06:24:35 +08:00
Zijie Tian
c90dc196b2
📝 docs: add estimate block_size performance analysis
...
Document the performance impact of block_size on softmax_fuse_block_sum:
- Current 4096 (reshaped 512) is the WORST point: 95ms
- Optimal 1024 (reshaped 128): 6ms - 15x faster
- Performance follows U-shaped curve
Add tests/bench_estimate_block_size.py for benchmarking and propose
hierarchical block sum approach for optimization.
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-28 06:24:28 +08:00
Zijie Tian
7c41032a2e
✨ feat: add configurable stride and chunk_size for XAttention BSA
...
- Add sparse_chunk_size config option (default: 16384)
- Pass stride, chunk_size, use_triton through factory function
- Add --sparse-stride CLI option to test_ruler.py
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-23 10:37:04 +08:00
Zijie Tian
4f35526457
🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan)
...
Resolve task_plan.md conflict by keeping remote version (CUDA Graph optimization plan).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-23 09:43:06 +08:00
Zijie Tian
ca32ea6f93
[WIP] Before refactor the compute)_chunked_prefill.
2026-01-23 03:36:12 +08:00
Zijie Tian
999858e82f
feat: add xattn kernels test and update testing rules
...
- Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape
and softmax_fuse_block_sum Triton kernels with structured data
- Update testing.md with new test code style guidelines
- Update xattn.py and xattn_bsa.py with improvements
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-23 03:01:25 +08:00
Zijie Tian
a5307fb124
📝 docs: add CUDA Graph optimization plan for offload mode decode
...
- Update task_plan.md with 6-phase segmented graph implementation plan
- Add findings.md documenting 7 key discoveries about current implementation
- Add progress.md for tracking implementation progress
- Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy
Key architecture decision: Split transformer layer into 3 segments:
- PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused)
- CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager)
- POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused)
Total: 4 graphs serving all layers via copy_() tensor updates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-22 02:12:24 +08:00
Zijie Tian
bc92c1fdb8
feat: add xattn_estimate_chunked for chunked prefill support
...
- Add xattn_estimate_chunked function ported from COMPASS
- Support chunked prefill with q_start_pos parameter
- Ensure 100% consistency with standard xattn_estimate when
using matching chunk_size parameter
- Add test and documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-22 01:13:17 +08:00
Zijie Tian
2866d4fd88
✨ feat: add chunk attention CUDA graph test for block sparse attention
...
Validates that pre-allocated CUDA graphs work for chunk-wise attention:
- Each (Q_chunk, K_chunk) pair has its own captured graph
- Zero copy_() during replay - all data pre-filled
- Uses nanovllm's flash_attn_with_lse and merge_attention_outputs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-22 00:57:05 +08:00
Zijie Tian
d21b40f48f
[test] Added test_cudagraph_memory.py.
2026-01-21 03:30:36 +08:00
Zijie Tian
1ab4676396
♻️ refactor: consolidate RULER test files and document root cause
...
- test_ruler.py: add --fresh-llm, --sample-indices, --json-output options
- test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py
- docs: update chunked offload issue with root cause (state leakage confirmed)
- docs: add single-sample test results showing 100% accuracy for niah_single_1
Deleted redundant test files:
- tests/test_ruler_single_sample.py
- tests/test_ruler_sequential.py
- tests/test_ruler_samples.py
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-20 23:41:17 +08:00
Zijie Tian
b1f292cf22
Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference
2026-01-20 02:16:39 +08:00
Zijie Tian
b5da802dff
[WIP] Before integrate the xattn operator.
2026-01-19 21:19:21 +08:00
Zijie Tian
50520a6c3c
[fix] fixed request to request error.
2026-01-19 00:55:26 +08:00
Zijie Tian
e6e0dc5d7d
✨ feat: add comprehensive RULER benchmark testing
...
- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics
Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup
Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)
2026-01-18 20:34:06 +08:00
Zijie Tian
2a6e0a2c02
[feat] Added Quest Sparsity Policy.
2026-01-07 03:29:21 +08:00
Zijie Tian
0e691f2d85
[WIP] move metadata to GPU.
2026-01-06 23:32:32 +08:00
Zijie Tian
edb5273e34
[WIP] Added basic test for quest.
2026-01-06 22:30:31 +08:00
Zijie Tian
535f2037ab
[WIP] Before fix bench_offload.py.
2026-01-06 18:41:08 +08:00
Zijie Tian
e554d5482b
[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.
2026-01-05 20:31:42 +08:00
Zijie Tian
d623043a3c
[WIP] FIXED decode and prefill NEEDLE test.
2026-01-05 01:51:46 +08:00
Zijie Tian
e897380127
[test] Added test_align.py and Before change nanovllm attention.
2026-01-04 22:48:01 +08:00
Zijie Tian
24096431ed
[refactor] refactor test_align.py.
2026-01-04 20:55:40 +08:00
Zijie Tian
00ed17c640
[feat] Added debug tools.
2026-01-03 22:36:40 +08:00
Zijie Tian
8c3418725b
[refactor] Refactor needle test.
2026-01-03 19:19:37 +08:00
Zijie Tian
b3685c9190
[test] Added test_align.py
2026-01-03 18:55:58 +08:00
Zijie Tian
6927a75ac3
[refactor] refactor needle.py.
2026-01-03 18:33:48 +08:00
Zijie Tian
ff8b09cd35
[test] Added test_needle_ref.py.
2026-01-02 22:03:23 +08:00
Zijie Tian
74ee6d0895
[WIP] need to fix model to normally decode.
2026-01-01 05:18:27 +08:00
Zijie Tian
62b8a63314
[refactor] Refactor the test_chunked_prefill/decode.
2026-01-01 03:32:26 +08:00
Zijie Tian
965c8aff12
[WIP] need change flashattention to debug.
2026-01-01 00:58:22 +08:00
Zijie Tian
30462fe89a
[WIP] Before fix needle.
2025-12-31 23:35:25 +08:00
Zijie Tian
ccd1b3d4ab
[WIP] Before modify nanovllm CPU-GPU kvcache.
2025-12-31 22:41:07 +08:00
Zijie Tian
31e90a7268
[test] Added offload correct verify.
2025-12-31 20:59:53 +08:00
Zijie Tian
484d0de9f9
[feat] Added debug hook to offload_engine.py.
2025-12-31 19:44:39 +08:00
Zijie Tian
7af721c12c
[WIP] Before modify to FlashInfer.
2025-12-30 01:11:13 +08:00
Zijie Tian
89f8020d38
[WIP] fixing attention compute error.
2025-12-30 00:31:48 +08:00
Zijie Tian
82ed34fc2d
[opt] optimize nanovllm performance compareable with vllm.
2025-12-25 03:47:07 +08:00
Zijie Tian
16fcf8350b
[WIP] replace merge attention with triton kernel.
2025-12-25 01:07:05 +08:00
Zijie Tian
cf5e7df093
[WIP] Added sgDMA operator for scatter kvcache communication.
2025-12-24 23:48:52 +08:00
Zijie Tian
6ec1b23982
[WIP] NEED to modify communication.
2025-12-24 21:57:51 +08:00
Zijie Tian
782437c486
[WIP] remove num_prefetch_blocks varible.
2025-12-24 18:22:26 +08:00
Zijie Tian
b264de903d
[test] Added a simple test_prefill.py.
2025-12-23 00:26:25 +08:00
Zijie Tian
4dcef16c13
[WIP] NEED refactor nanovllm mechenism.
2025-12-22 23:52:56 +08:00
Zijie Tian
051f2295c9
[feat] Added sparse KVcache feature, NEED VERIFY.
2025-12-22 08:51:02 +08:00
Zijie Tian
1081ab51ea
[refactor] Refactor offload code to multi-chunk.
2025-12-15 01:13:58 +08:00
Zijie Tian
61edb8a344
[feat] Finished offload. Still need optimize performance.
2025-12-12 02:27:40 +08:00
Zijie Tian
babfa17354
[refactor] Translate into english, void Chinese due to claude.
2025-12-11 00:30:24 +08:00
Zijie Tian
190df5f70d
[refactor] Refactor current gpu and cpu block allocation strategy.
2025-12-10 21:23:31 +08:00
Zijie Tian
0a247ccb1b
[feat] Added num_gpu_blocks limit gpu blocks.
2025-12-10 20:17:42 +08:00