nano-vllm

Author	SHA1	Message	Date
Zijie Tian	5fb0f67295	[WIP] need refactor.	2026-01-22 22:20:34 +08:00
Zijie Tian	69b779e252	📝 docs: add layer offload planning notes and task plan Add planning documents for layer-wise offload implementation: - notes.md: Implementation notes and findings - task_plan.md: Detailed task breakdown and progress tracking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 06:04:36 +08:00
Zijie Tian	e313dd795a	✨ feat: add exec-plan command for automated task plan execution Add a new Claude command that executes task_plan.md refactoring with: - GPU isolation via --gpu <id> parameter (required) - Optional --no-interrupt mode for autonomous execution - Progress tracking via progress.md and findings.md - Strict CUDA_VISIBLE_DEVICES enforcement Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 06:03:42 +08:00
Zijie Tian	9f3ee9279e	✨ feat: add nanovllm.ops module with XAttention estimation kernels Add ops module ported from tzj/minference branch containing: - xattn.py: XAttention block importance estimation with Triton kernels - xattn_estimate(): standard estimation for sparse attention mask - xattn_estimate_chunked(): chunked prefill compatible version - flat_group_gemm_fuse_reshape(): fused stride reshape + GEMM kernel - softmax_fuse_block_sum(): online softmax + block-wise sum kernel - chunked_attention.py: Flash attention with LSE output for chunk merging - test_xattn_estimate_chunked.py: verification test (all seq_lens pass) This prepares the foundation for AttentionPolicy refactoring where XAttentionPolicy.estimate() will call these ops. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 06:00:42 +08:00
Zijie Tian	2826a649de	docs: add XAttention integration guide Comprehensive documentation for XAttention sparse policy integration: - Algorithm principles (chunked estimation + block sparse attention) - COMPASS source code analysis - Design decisions for CPU offload mode - Implementation details (utils.py, kernels.py, xattn.py) - Problem-solving (OOM, GQA, abstract method) - Test validation results (RULER 32k benchmark) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 10:16:21 +08:00
Zijie Tian	24baeb6d5a	chore: add planning-with-files rule configuration	2026-01-14 10:09:52 +08:00
Zijie Tian	57f4e9c6e6	docs: reorganize documentation files - Move notes.md to docs/development_notes.md - Move Xattention_analysis.md to docs/xattention_analysis.md - Delete DEBUG_SUMMARY.md (no longer needed) - Update CLAUDE.md with documentation index entries Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 10:08:41 +08:00
Zijie Tian	ac1ccbceaa	feat: add XAttention sparse policy integration Integrate COMPASS XAttention algorithm into nano-vllm's CPU offload execution path. Uses FlashAttention with native GQA support for offload mode. New files: - nanovllm/kvcache/sparse/utils.py: find_blocks_chunked() utility - nanovllm/kvcache/sparse/kernels.py: Triton kernels for XAttention - nanovllm/kvcache/sparse/xattn.py: XAttentionPolicy implementation Modified: - nanovllm/config.py: Add XATTN configuration parameters - nanovllm/engine/model_runner.py: Support XATTN policy - nanovllm/kvcache/sparse/__init__.py: Register XAttentionPolicy - tests/test_ruler.py: Add --sparse-policy parameter Test results (32k ruler): - NIAH tasks: 12/12 (100%) - QA/Recall tasks: 11/15 (73%) - Overall: 23/27 (85%) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 10:04:46 +08:00
Zijie Tian	029894118d	feat: add claude-flow MCP configuration Add .claude/settings.json to enable claude-flow MCP in all worktrees. This configuration includes: - SessionStart hook to auto-start claude-flow daemon - Auto-approval for claude-flow MCP tools and CLI commands - Basic claude-flow settings Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 09:18:09 +08:00
Zijie Tian	8d6fde3b23	docs: add Block-Sparse-Attention library reference Add comprehensive documentation for the MIT-Han-Lab Block-Sparse-Attention library (3rdparty submodule, branch: tzj/minference). The new document covers: - Four sparse attention modes (dense, token/block streaming, block sparse) - Hybrid mask support (different patterns per head) - Complete API reference for all three functions - Performance benchmarks (up to 3-4x speedup on A100) - Integration considerations for nano-vllm Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 08:39:03 +08:00
Zijie Tian	6a6bd75685	feat: add Block-Sparse-Attention submodule (tzj/minference branch) Add 3rdparty/Block-Sparse-Attention as a git submodule from the tzj/minference branch of Zijie-Tian/Block-Sparse-Attention repository. Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 08:07:07 +08:00
Zijie Tian	86633004ca	📝 docs: add 64k memory analysis and test configuration updates Add comprehensive memory analysis for 64k inference on Llama 3.1 8B: New documentation: - docs/64k_memory_analysis.md: GPU-only vs offload memory analysis, OOM root cause (memory fragmentation), RTX 3090 limitations, theoretical vs actual memory usage breakdown Test configuration updates: - tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer size tuning (default 4, can reduce to 1 for lower memory) - Update default data_dir to ruler_64k - Update default max_model_len to 65664 for 64k support CLAUDE.md updates: - Add 64k_memory_analysis.md to documentation index - Document num_kv_buffers parameter in Configuration section - Add 64k hardware requirements note to Model Limits Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload) due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the recommended hardware for 64k workloads. Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 07:02:09 +08:00
Zijie Tian	c51a640a29	🐛 fix: remove torch.compile from add_rms_forward to avoid recompilation The add_rms_forward method processes two input tensors (x and residual), which causes torch.compile recompilation issues. Keep @torch.compile only on rms_forward which processes a single input. This prevents unnecessary recompilation overhead during inference. Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 07:02:02 +08:00
Zijie Tian	dce6ad6b74	♻️ refactor: chunked LayerNorm/QKV/MLP for 64k memory optimization Implement chunked processing for LayerNorm, QKV projection, and MLP layers to reduce peak activation memory for 64k sequence inference. Changes: - Chunked input_layernorm and post_attention_layernorm (chunk_size=128) - Chunked QKV projection (chunk_size=128) - Chunked MLP processing (chunk_size=128) with memory cleanup - Added torch.cuda.empty_cache() calls after each chunk This reduces peak activation from ~2 GB to ~50 MB per layer, making 64k inference theoretically possible on 24GB GPUs (though still limited by memory fragmentation). Related: docs/64k_memory_analysis.md Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 07:01:57 +08:00
Zijie Tian	cf168fd9b9	✅ test: add comprehensive RULER benchmark test suite - Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT) - Implement RULER official evaluation metrics (string_match_all/part) - Fix max_model_len to 32896 to prevent decode OOM on long inputs - Add ruler_benchmark_report.md with full test results (92.1% accuracy) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-14 00:51:30 +08:00
Zijie Tian	76af506956	[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST	2026-01-13 02:01:07 +08:00
Zijie Tian	49519c7ce7	📝 docs: update offload accuracy issue with independent testing results Document key finding: single request inference works correctly (100% accuracy). The 66% accuracy issue in batch mode is due to state accumulation between sequential requests in the same process. - Add comparison table: independent (100%) vs batch (66%) testing modes - Document root cause analysis: state cleanup issue between requests - Add workaround using test_ruler_niah.sh for independent testing - Update next steps to focus on OffloadEngine reset/cleanup logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 21:08:35 +08:00
Zijie Tian	1424e665e7	✅ test: add parallel multi-GPU RULER NIAH test script Add test_ruler_niah.sh for independent sample testing across multiple GPUs. Each sample runs in a separate Python process to avoid state accumulation issues. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 21:08:27 +08:00
Zijie Tian	64971c8e8a	Merge branch 'zijie/fix-dist-3': Fix distributed port conflict - Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 16:27:25 +08:00
Zijie Tian	de6f36bdb2	[docs] Added dist port issue.	2026-01-12 15:16:39 +08:00
Zijie Tian	8e0888c20c	[docs] Added offload_acc issue.	2026-01-12 15:05:55 +08:00
Zijie Tian	a6cc703d73	[tests] Added test_niah_standalone.py.	2026-01-12 00:16:37 +08:00
Zijie Tian	5895de0c97	[docs] Added transformers error desp.	2026-01-11 18:48:50 +08:00
Zijie Tian	2771312565	[docs] Add sparse prefill integration plan from int-minference analysis Consolidated analysis from int-minference-1/2/3 branches into a unified integration plan for MInference, XAttention, and FlexPrefill strategies. Key design decisions: - Backward compatible: Keep existing SparsePolicy interface - Unified BlockMask intermediate representation for new strategies - XAttention/FlexPrefill use block_sparse_attn_func kernel - MInference can optionally use block_sparse_attn (Phase 4) Five-phase implementation plan: 1. BlockMask + block_sparse_attn wrapper 2. XAttention implementation 3. FlexPrefill implementation 4. Optional MInference refactoring 5. Integration and testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 23:33:09 +08:00
Zijie Tian	de6eae472d	[docs] Update CLAUDE.md with multi-model support documentation - Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:29:39 +08:00
Zijie Tian	e23be2e844	Merge branch 'zijie/add-llama-1': Add multi-model support - Add model registry system for dynamic model loading - Implement LlamaForCausalLM with Llama3 RoPE scaling - Register Qwen3ForCausalLM and Qwen2ForCausalLM - Update ModelRunner to use get_model_class() for dynamic model selection Tested: needle 32k test PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:20:53 +08:00
Zijie Tian	24f5ae5fc3	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:14 CST	2026-01-10 21:14:32 +08:00
Zijie Tian	03a8c033cb	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST	2026-01-10 21:03:45 +08:00
Zijie Tian	9377ff63fe	Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload	2026-01-09 16:13:38 +08:00
Zijie Tian	067e36f4a2	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST	2026-01-09 16:10:28 +08:00
Zijie Tian	1425510a2e	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST	2026-01-09 16:05:36 +08:00
Zijie Tian	335117bfca	Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload	2026-01-09 15:21:48 +08:00
Zijie Tian	5012b11291	[bench] Modify bench_vllm.py	2026-01-09 15:20:37 +08:00
Zijie Tian	ccf04d3917	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:16 CST	2026-01-09 15:16:55 +08:00
Zijie Tian	59f8970ed3	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:12 CST	2026-01-09 15:12:42 +08:00
Zijie Tian	6378cb4c17	Merge remote-tracking branch 'origin/zijie/fix-ga-perf-2' into tzj/vs_offload	2026-01-09 14:21:00 +08:00
Zijie Tian	47e3e465f0	[claudesquad] update from 'fix-ga-perf-2' on 09 Jan 26 14:08 CST	2026-01-09 14:08:12 +08:00
Zijie Tian	aac94c9481	[claude] Added some commands.	2026-01-09 13:16:23 +08:00
Zijie Tian	79c4df4a27	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST	2026-01-08 23:42:30 +08:00
Zijie Tian	ea4e904de0	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST	2026-01-08 23:22:38 +08:00
Zijie Tian	0bfe1984ef	[docs] Refine GPU mutex: exclusive for benchmarks, port check for tests Benchmarks (bench*.py) still require exclusive GPU access for accurate measurements. Other scripts (tests, examples) now only check for distributed port 29500 conflicts, allowing parallel GPU sharing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 21:35:08 +08:00
Zijie Tian	105201b902	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST	2026-01-08 21:19:38 +08:00
Zijie Tian	a8c9f0d837	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST	2026-01-08 20:53:08 +08:00
Zijie Tian	85bcca3d17	[claudesquad] update from 'int-offload-1' on 08 Jan 26 19:44 CST	2026-01-08 19:44:29 +08:00
Zijie Tian	b5c0ef3b7a	[docs] Replace chunked prefill docs with layer-wise offload strategy Remove all chunked prefill related documentation (ring buffer, sgDMA, Triton merge kernels, known issues) and replace with layer-wise offload system documentation including: - Design philosophy and benefits - Memory layout and per-layer KV size table - Prefill and decode flow pseudocode - Critical implementation details (sync offload, causal=False for decode) - Helper methods in HybridKVCacheManager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 05:39:26 +08:00
Zijie Tian	bbbfd1e7da	[docs] Simplify multi-instance development with direct PYTHONPATH Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method: - No pip install required - Code changes take effect immediately - Each worktree is completely isolated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 04:51:55 +08:00
Zijie Tian	c1ddb44e5d	Merge branch 'zijie/layer-prefill-1' into tzj/vs_offload Adds MInference sparse attention support: - New MInference sparse policy implementation - A-shape, vertical-slash, and block-sparse patterns - Updated bench.py with sparse attention options - test_minference_gpu.py validation test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 03:40:53 +08:00
Zijie Tian	d8a87da1c3	[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST	2026-01-08 03:36:39 +08:00
Zijie Tian	ecd9ae0271	[WIP] changed to layerwise offload.	2026-01-08 00:28:27 +08:00
Zijie Tian	6575099a06	[refactor] Cleanup unused code after perf_opt merge Removed ~460 lines of unused/redundant code from offload_engine.py: - CUDA gather methods (gathered_h2d_*, update_gather_indices) - Legacy async transfer methods (prefetch_block_async, offload_block_async) - Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices) - Legacy compatibility methods (load_to_compute_layer, wait_compute_layer) - Unused gather_indices tensors and memory calculations Updated class docstring to reflect current architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:25:21 +08:00

1 2 3 4

168 Commits