nano-vllm

Author	SHA1	Message	Date
Zijie Tian	cfb188c34a	docs: add chunked prefill analysis for ultra-long sequences Add comprehensive analysis document covering: - MLP activation memory bottlenecks with SwiGLU architecture - Chunked MLP strategy (98% memory reduction) - Chunked prefill for single layers (78% memory reduction) - Streaming Chunked Prefill (最优方案): GPU memory becomes constant - Memory formulas and implementation guidance - Theoretical maximum: 4M tokens on 24GB GPU (128× improvement) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-16 10:38:02 +08:00
Zijie Tian	2826a649de	docs: add XAttention integration guide Comprehensive documentation for XAttention sparse policy integration: - Algorithm principles (chunked estimation + block sparse attention) - COMPASS source code analysis - Design decisions for CPU offload mode - Implementation details (utils.py, kernels.py, xattn.py) - Problem-solving (OOM, GQA, abstract method) - Test validation results (RULER 32k benchmark) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 10:16:21 +08:00
Zijie Tian	57f4e9c6e6	docs: reorganize documentation files - Move notes.md to docs/development_notes.md - Move Xattention_analysis.md to docs/xattention_analysis.md - Delete DEBUG_SUMMARY.md (no longer needed) - Update CLAUDE.md with documentation index entries Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 10:08:41 +08:00
Zijie Tian	8d6fde3b23	docs: add Block-Sparse-Attention library reference Add comprehensive documentation for the MIT-Han-Lab Block-Sparse-Attention library (3rdparty submodule, branch: tzj/minference). The new document covers: - Four sparse attention modes (dense, token/block streaming, block sparse) - Hybrid mask support (different patterns per head) - Complete API reference for all three functions - Performance benchmarks (up to 3-4x speedup on A100) - Integration considerations for nano-vllm Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 08:39:03 +08:00
Zijie Tian	86633004ca	📝 docs: add 64k memory analysis and test configuration updates Add comprehensive memory analysis for 64k inference on Llama 3.1 8B: New documentation: - docs/64k_memory_analysis.md: GPU-only vs offload memory analysis, OOM root cause (memory fragmentation), RTX 3090 limitations, theoretical vs actual memory usage breakdown Test configuration updates: - tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer size tuning (default 4, can reduce to 1 for lower memory) - Update default data_dir to ruler_64k - Update default max_model_len to 65664 for 64k support CLAUDE.md updates: - Add 64k_memory_analysis.md to documentation index - Document num_kv_buffers parameter in Configuration section - Add 64k hardware requirements note to Model Limits Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload) due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the recommended hardware for 64k workloads. Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 07:02:09 +08:00
Zijie Tian	64971c8e8a	Merge branch 'zijie/fix-dist-3': Fix distributed port conflict - Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 16:27:25 +08:00
Zijie Tian	de6f36bdb2	[docs] Added dist port issue.	2026-01-12 15:16:39 +08:00
Zijie Tian	8e0888c20c	[docs] Added offload_acc issue.	2026-01-12 15:05:55 +08:00
Zijie Tian	2771312565	[docs] Add sparse prefill integration plan from int-minference analysis Consolidated analysis from int-minference-1/2/3 branches into a unified integration plan for MInference, XAttention, and FlexPrefill strategies. Key design decisions: - Backward compatible: Keep existing SparsePolicy interface - Unified BlockMask intermediate representation for new strategies - XAttention/FlexPrefill use block_sparse_attn_func kernel - MInference can optionally use block_sparse_attn (Phase 4) Five-phase implementation plan: 1. BlockMask + block_sparse_attn wrapper 2. XAttention implementation 3. FlexPrefill implementation 4. Optional MInference refactoring 5. Integration and testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 23:33:09 +08:00
Zijie Tian	de6eae472d	[docs] Update CLAUDE.md with multi-model support documentation - Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:29:39 +08:00
Zijie Tian	067e36f4a2	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST	2026-01-09 16:10:28 +08:00
Zijie Tian	79c4df4a27	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST	2026-01-08 23:42:30 +08:00
Zijie Tian	ea4e904de0	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST	2026-01-08 23:22:38 +08:00
Zijie Tian	0bfe1984ef	[docs] Refine GPU mutex: exclusive for benchmarks, port check for tests Benchmarks (bench*.py) still require exclusive GPU access for accurate measurements. Other scripts (tests, examples) now only check for distributed port 29500 conflicts, allowing parallel GPU sharing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 21:35:08 +08:00
Zijie Tian	105201b902	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST	2026-01-08 21:19:38 +08:00
Zijie Tian	b5c0ef3b7a	[docs] Replace chunked prefill docs with layer-wise offload strategy Remove all chunked prefill related documentation (ring buffer, sgDMA, Triton merge kernels, known issues) and replace with layer-wise offload system documentation including: - Design philosophy and benefits - Memory layout and per-layer KV size table - Prefill and decode flow pseudocode - Critical implementation details (sync offload, causal=False for decode) - Helper methods in HybridKVCacheManager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 05:39:26 +08:00
Zijie Tian	bbbfd1e7da	[docs] Simplify multi-instance development with direct PYTHONPATH Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method: - No pip install required - Code changes take effect immediately - Each worktree is completely isolated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 04:51:55 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	2fe50bab50	[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST	2026-01-07 03:27:27 +08:00
Zijie Tian	f240903013	[docs] Add GPU mutex instructions for multi-instance debugging Add instructions for Claude instances to check GPU availability before running CUDA operations, preventing conflicts when multiple instances debug in parallel on a single GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 01:42:59 +08:00
Zijie Tian	edb5273e34	[WIP] Added basic test for quest.	2026-01-06 22:30:31 +08:00
Zijie Tian	e554d5482b	[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.	2026-01-05 20:31:42 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00
Zijie Tian	9b52d25866	[docs] Update CLAUDE.md.	2026-01-03 20:46:00 +08:00
Zijie Tian	bf4c63c7ec	[docs] Added Sparse Attn.	2025-12-29 19:56:54 +08:00
Zijie Tian	82ed34fc2d	[opt] optimize nanovllm performance compareable with vllm.	2025-12-25 03:47:07 +08:00
Zijie Tian	16fcf8350b	[WIP] replace merge attention with triton kernel.	2025-12-25 01:07:05 +08:00
Zijie Tian	6ec1b23982	[WIP] NEED to modify communication.	2025-12-24 21:57:51 +08:00
Zijie Tian	782437c486	[WIP] remove num_prefetch_blocks varible.	2025-12-24 18:22:26 +08:00
Zijie Tian	1907b625b6	[refactor] Remove legacy mode path.	2025-12-22 20:17:56 +08:00
Zijie Tian	08d83185ce	[fix] fix bench*.py.	2025-12-22 19:53:50 +08:00
Zijie Tian	8df0c7517b	[docs] refactor CLAUDE.md.	2025-12-15 21:43:33 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	5949537faf	[docs] Start ues CLAUDE rules.	2025-12-15 00:20:54 +08:00
Zijie Tian	a37f07943c	[docs] Update the CLAUDE.md.	2025-12-15 00:13:27 +08:00
Zijie Tian	761929390e	[bench] Added vllm vs nano-vllm bench.	2025-12-10 00:44:57 +08:00

39 Commits