nano-vllm

Author	SHA1	Message	Date
Zijie Tian	f049971f84	✅ test: add hierarchical block sum estimation validation Validate the hierarchical estimation approach for XAttention: - Test 1: Math equivalence (diff = 0.0) between hierarchical and direct - Test 2: Score + threshold selection strategy (replaces mask + voting) - Test 3: Performance benchmark (41x speedup) Uses pure torch + xattn kernels, independent of nanovllm framework. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:24:35 +08:00
Zijie Tian	c90dc196b2	📝 docs: add estimate block_size performance analysis Document the performance impact of block_size on softmax_fuse_block_sum: - Current 4096 (reshaped 512) is the WORST point: 95ms - Optimal 1024 (reshaped 128): 6ms - 15x faster - Performance follows U-shaped curve Add tests/bench_estimate_block_size.py for benchmarking and propose hierarchical block sum approach for optimization. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:24:28 +08:00
Zijie Tian	7c41032a2e	✨ feat: add configurable stride and chunk_size for XAttention BSA - Add sparse_chunk_size config option (default: 16384) - Pass stride, chunk_size, use_triton through factory function - Add --sparse-stride CLI option to test_ruler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 10:37:04 +08:00
Zijie Tian	4f35526457	🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan) Resolve task_plan.md conflict by keeping remote version (CUDA Graph optimization plan). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:43:06 +08:00
Zijie Tian	ca32ea6f93	[WIP] Before refactor the compute)_chunked_prefill.	2026-01-23 03:36:12 +08:00
Zijie Tian	999858e82f	feat: add xattn kernels test and update testing rules - Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape and softmax_fuse_block_sum Triton kernels with structured data - Update testing.md with new test code style guidelines - Update xattn.py and xattn_bsa.py with improvements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 03:01:25 +08:00
Zijie Tian	a5307fb124	📝 docs: add CUDA Graph optimization plan for offload mode decode - Update task_plan.md with 6-phase segmented graph implementation plan - Add findings.md documenting 7 key discoveries about current implementation - Add progress.md for tracking implementation progress - Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy Key architecture decision: Split transformer layer into 3 segments: - PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused) - CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager) - POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused) Total: 4 graphs serving all layers via copy_() tensor updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 02:12:24 +08:00
Zijie Tian	bc92c1fdb8	feat: add xattn_estimate_chunked for chunked prefill support - Add xattn_estimate_chunked function ported from COMPASS - Support chunked prefill with q_start_pos parameter - Ensure 100% consistency with standard xattn_estimate when using matching chunk_size parameter - Add test and documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 01:13:17 +08:00
Zijie Tian	2866d4fd88	✨ feat: add chunk attention CUDA graph test for block sparse attention Validates that pre-allocated CUDA graphs work for chunk-wise attention: - Each (Q_chunk, K_chunk) pair has its own captured graph - Zero copy_() during replay - all data pre-filled - Uses nanovllm's flash_attn_with_lse and merge_attention_outputs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 00:57:05 +08:00
Zijie Tian	d21b40f48f	[test] Added test_cudagraph_memory.py.	2026-01-21 03:30:36 +08:00
Zijie Tian	1ab4676396	♻️ refactor: consolidate RULER test files and document root cause - test_ruler.py: add --fresh-llm, --sample-indices, --json-output options - test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py - docs: update chunked offload issue with root cause (state leakage confirmed) - docs: add single-sample test results showing 100% accuracy for niah_single_1 Deleted redundant test files: - tests/test_ruler_single_sample.py - tests/test_ruler_sequential.py - tests/test_ruler_samples.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:41:17 +08:00
Zijie Tian	b1f292cf22	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:16:39 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	50520a6c3c	[fix] fixed request to request error.	2026-01-19 00:55:26 +08:00
Zijie Tian	e6e0dc5d7d	✨ feat: add comprehensive RULER benchmark testing - Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)	2026-01-18 20:34:06 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	0e691f2d85	[WIP] move metadata to GPU.	2026-01-06 23:32:32 +08:00
Zijie Tian	edb5273e34	[WIP] Added basic test for quest.	2026-01-06 22:30:31 +08:00
Zijie Tian	535f2037ab	[WIP] Before fix bench_offload.py.	2026-01-06 18:41:08 +08:00
Zijie Tian	e554d5482b	[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.	2026-01-05 20:31:42 +08:00
Zijie Tian	d623043a3c	[WIP] FIXED decode and prefill NEEDLE test.	2026-01-05 01:51:46 +08:00
Zijie Tian	e897380127	[test] Added test_align.py and Before change nanovllm attention.	2026-01-04 22:48:01 +08:00
Zijie Tian	24096431ed	[refactor] refactor test_align.py.	2026-01-04 20:55:40 +08:00
Zijie Tian	00ed17c640	[feat] Added debug tools.	2026-01-03 22:36:40 +08:00
Zijie Tian	8c3418725b	[refactor] Refactor needle test.	2026-01-03 19:19:37 +08:00
Zijie Tian	b3685c9190	[test] Added test_align.py	2026-01-03 18:55:58 +08:00
Zijie Tian	6927a75ac3	[refactor] refactor needle.py.	2026-01-03 18:33:48 +08:00
Zijie Tian	ff8b09cd35	[test] Added test_needle_ref.py.	2026-01-02 22:03:23 +08:00
Zijie Tian	74ee6d0895	[WIP] need to fix model to normally decode.	2026-01-01 05:18:27 +08:00
Zijie Tian	62b8a63314	[refactor] Refactor the test_chunked_prefill/decode.	2026-01-01 03:32:26 +08:00
Zijie Tian	965c8aff12	[WIP] need change flashattention to debug.	2026-01-01 00:58:22 +08:00
Zijie Tian	30462fe89a	[WIP] Before fix needle.	2025-12-31 23:35:25 +08:00
Zijie Tian	ccd1b3d4ab	[WIP] Before modify nanovllm CPU-GPU kvcache.	2025-12-31 22:41:07 +08:00
Zijie Tian	31e90a7268	[test] Added offload correct verify.	2025-12-31 20:59:53 +08:00
Zijie Tian	484d0de9f9	[feat] Added debug hook to offload_engine.py.	2025-12-31 19:44:39 +08:00
Zijie Tian	7af721c12c	[WIP] Before modify to FlashInfer.	2025-12-30 01:11:13 +08:00
Zijie Tian	89f8020d38	[WIP] fixing attention compute error.	2025-12-30 00:31:48 +08:00
Zijie Tian	82ed34fc2d	[opt] optimize nanovllm performance compareable with vllm.	2025-12-25 03:47:07 +08:00
Zijie Tian	16fcf8350b	[WIP] replace merge attention with triton kernel.	2025-12-25 01:07:05 +08:00
Zijie Tian	cf5e7df093	[WIP] Added sgDMA operator for scatter kvcache communication.	2025-12-24 23:48:52 +08:00
Zijie Tian	6ec1b23982	[WIP] NEED to modify communication.	2025-12-24 21:57:51 +08:00
Zijie Tian	782437c486	[WIP] remove num_prefetch_blocks varible.	2025-12-24 18:22:26 +08:00
Zijie Tian	b264de903d	[test] Added a simple test_prefill.py.	2025-12-23 00:26:25 +08:00
Zijie Tian	4dcef16c13	[WIP] NEED refactor nanovllm mechenism.	2025-12-22 23:52:56 +08:00
Zijie Tian	051f2295c9	[feat] Added sparse KVcache feature, NEED VERIFY.	2025-12-22 08:51:02 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	61edb8a344	[feat] Finished offload. Still need optimize performance.	2025-12-12 02:27:40 +08:00
Zijie Tian	babfa17354	[refactor] Translate into english, void Chinese due to claude.	2025-12-11 00:30:24 +08:00
Zijie Tian	190df5f70d	[refactor] Refactor current gpu and cpu block allocation strategy.	2025-12-10 21:23:31 +08:00
Zijie Tian	0a247ccb1b	[feat] Added `num_gpu_blocks` limit gpu blocks.	2025-12-10 20:17:42 +08:00

1 2

51 Commits