nano-vllm

Author	SHA1	Message	Date
Zijie Tian	bc92c1fdb8	feat: add xattn_estimate_chunked for chunked prefill support - Add xattn_estimate_chunked function ported from COMPASS - Support chunked prefill with q_start_pos parameter - Ensure 100% consistency with standard xattn_estimate when using matching chunk_size parameter - Add test and documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 01:13:17 +08:00
Zijie Tian	512e1e5401	🔧 chore: add Claude rules for agent result format and multi-GPU debugging - Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:41:08 +08:00
Zijie Tian	6180055ed8	📝 docs: add chunked attention solutions guide and update doc index Add comprehensive documentation analyzing the 32K chunked offload accuracy issues with proposed solutions covering LSE precision, ring buffer state management, and position encoding validation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:48:20 +08:00
Zijie Tian	4cbd451af7	📝 docs: add BSA interface documentation and cleanup temp files - Add docs/block_sparse_attn_interface.md with BSA function signatures - Update CLAUDE.md documentation index - Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md - Add notes.md to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:27:19 +08:00
Zijie Tian	e440c45e73	📝 docs: add XAttention algorithm guide based on COMPASS implementation - Create docs/xattention_algorithm_guide.md with detailed algorithm explanation - Stride reshape (inverse mode) for Q/K interleaved sampling - Triton kernels: flat_group_gemm_fuse_reshape, softmax_fuse_block_sum - Block selection via find_blocks_chunked with cumulative threshold - BSA (block_sparse_attn) dependency for sparse computation - Update docs/sparse_attention_guide.md XAttention section with accurate description - Add documentation index entry in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:50:03 +08:00
Zijie Tian	07f5220f40	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:27:10 +08:00
Zijie Tian	37aecd4d52	📝 docs: add SparsePolicy implementation guide and update rules - Create docs/sparse_policy_implementation_guide.md with comprehensive guide - Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements - Add new doc reference to CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:25:46 +08:00
Zijie Tian	16fbcf9e4c	docs: add RULER 32K chunked offload issue documentation - Document accuracy degradation issue in 32K context with chunked offload - Add detailed hypothesis analysis and debugging approach - Include 4-slot ring buffer experiment results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:16:21 +08:00
Zijie Tian	e5a17c832c	📝 docs: add SparsePolicy architecture documentation Add comprehensive documentation for the SparsePolicy abstraction: - SparsePolicy base class and abstract methods - FullAttentionPolicy prefill/decode flow - Ring buffer and cross-layer pipeline modes - Code conventions and testing guidelines Update CLAUDE.md documentation index with reference. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:36:09 +08:00
Zijie Tian	e6e0dc5d7d	✨ feat: add comprehensive RULER benchmark testing - Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)	2026-01-18 20:34:06 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	2fe50bab50	[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST	2026-01-07 03:27:27 +08:00
Zijie Tian	f240903013	[docs] Add GPU mutex instructions for multi-instance debugging Add instructions for Claude instances to check GPU availability before running CUDA operations, preventing conflicts when multiple instances debug in parallel on a single GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 01:42:59 +08:00
Zijie Tian	edb5273e34	[WIP] Added basic test for quest.	2026-01-06 22:30:31 +08:00
Zijie Tian	e554d5482b	[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.	2026-01-05 20:31:42 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00
Zijie Tian	9b52d25866	[docs] Update CLAUDE.md.	2026-01-03 20:46:00 +08:00
Zijie Tian	bf4c63c7ec	[docs] Added Sparse Attn.	2025-12-29 19:56:54 +08:00
Zijie Tian	82ed34fc2d	[opt] optimize nanovllm performance compareable with vllm.	2025-12-25 03:47:07 +08:00
Zijie Tian	16fcf8350b	[WIP] replace merge attention with triton kernel.	2025-12-25 01:07:05 +08:00
Zijie Tian	6ec1b23982	[WIP] NEED to modify communication.	2025-12-24 21:57:51 +08:00
Zijie Tian	782437c486	[WIP] remove num_prefetch_blocks varible.	2025-12-24 18:22:26 +08:00
Zijie Tian	1907b625b6	[refactor] Remove legacy mode path.	2025-12-22 20:17:56 +08:00
Zijie Tian	08d83185ce	[fix] fix bench*.py.	2025-12-22 19:53:50 +08:00
Zijie Tian	8df0c7517b	[docs] refactor CLAUDE.md.	2025-12-15 21:43:33 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	5949537faf	[docs] Start ues CLAUDE rules.	2025-12-15 00:20:54 +08:00
Zijie Tian	a37f07943c	[docs] Update the CLAUDE.md.	2025-12-15 00:13:27 +08:00
Zijie Tian	761929390e	[bench] Added vllm vs nano-vllm bench.	2025-12-10 00:44:57 +08:00

32 Commits