nano-vllm

Author	SHA1	Message	Date
Zijie Tian	c51a640a29	🐛 fix: remove torch.compile from add_rms_forward to avoid recompilation The add_rms_forward method processes two input tensors (x and residual), which causes torch.compile recompilation issues. Keep @torch.compile only on rms_forward which processes a single input. This prevents unnecessary recompilation overhead during inference. Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 07:02:02 +08:00
Zijie Tian	dce6ad6b74	♻️ refactor: chunked LayerNorm/QKV/MLP for 64k memory optimization Implement chunked processing for LayerNorm, QKV projection, and MLP layers to reduce peak activation memory for 64k sequence inference. Changes: - Chunked input_layernorm and post_attention_layernorm (chunk_size=128) - Chunked QKV projection (chunk_size=128) - Chunked MLP processing (chunk_size=128) with memory cleanup - Added torch.cuda.empty_cache() calls after each chunk This reduces peak activation from ~2 GB to ~50 MB per layer, making 64k inference theoretically possible on 24GB GPUs (though still limited by memory fragmentation). Related: docs/64k_memory_analysis.md Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-14 07:01:57 +08:00
Zijie Tian	cf168fd9b9	✅ test: add comprehensive RULER benchmark test suite - Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT) - Implement RULER official evaluation metrics (string_match_all/part) - Fix max_model_len to 32896 to prevent decode OOM on long inputs - Add ruler_benchmark_report.md with full test results (92.1% accuracy) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-14 00:51:30 +08:00
Zijie Tian	76af506956	[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST	2026-01-13 02:01:07 +08:00
Zijie Tian	49519c7ce7	📝 docs: update offload accuracy issue with independent testing results Document key finding: single request inference works correctly (100% accuracy). The 66% accuracy issue in batch mode is due to state accumulation between sequential requests in the same process. - Add comparison table: independent (100%) vs batch (66%) testing modes - Document root cause analysis: state cleanup issue between requests - Add workaround using test_ruler_niah.sh for independent testing - Update next steps to focus on OffloadEngine reset/cleanup logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 21:08:35 +08:00
Zijie Tian	1424e665e7	✅ test: add parallel multi-GPU RULER NIAH test script Add test_ruler_niah.sh for independent sample testing across multiple GPUs. Each sample runs in a separate Python process to avoid state accumulation issues. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 21:08:27 +08:00
Zijie Tian	64971c8e8a	Merge branch 'zijie/fix-dist-3': Fix distributed port conflict - Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 16:27:25 +08:00
Zijie Tian	de6f36bdb2	[docs] Added dist port issue.	2026-01-12 15:16:39 +08:00
Zijie Tian	8e0888c20c	[docs] Added offload_acc issue.	2026-01-12 15:05:55 +08:00
Zijie Tian	a6cc703d73	[tests] Added test_niah_standalone.py.	2026-01-12 00:16:37 +08:00
Zijie Tian	5895de0c97	[docs] Added transformers error desp.	2026-01-11 18:48:50 +08:00
Zijie Tian	2771312565	[docs] Add sparse prefill integration plan from int-minference analysis Consolidated analysis from int-minference-1/2/3 branches into a unified integration plan for MInference, XAttention, and FlexPrefill strategies. Key design decisions: - Backward compatible: Keep existing SparsePolicy interface - Unified BlockMask intermediate representation for new strategies - XAttention/FlexPrefill use block_sparse_attn_func kernel - MInference can optionally use block_sparse_attn (Phase 4) Five-phase implementation plan: 1. BlockMask + block_sparse_attn wrapper 2. XAttention implementation 3. FlexPrefill implementation 4. Optional MInference refactoring 5. Integration and testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 23:33:09 +08:00
Zijie Tian	de6eae472d	[docs] Update CLAUDE.md with multi-model support documentation - Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:29:39 +08:00
Zijie Tian	e23be2e844	Merge branch 'zijie/add-llama-1': Add multi-model support - Add model registry system for dynamic model loading - Implement LlamaForCausalLM with Llama3 RoPE scaling - Register Qwen3ForCausalLM and Qwen2ForCausalLM - Update ModelRunner to use get_model_class() for dynamic model selection Tested: needle 32k test PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:20:53 +08:00
Zijie Tian	24f5ae5fc3	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:14 CST	2026-01-10 21:14:32 +08:00
Zijie Tian	03a8c033cb	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST	2026-01-10 21:03:45 +08:00
Zijie Tian	9377ff63fe	Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload	2026-01-09 16:13:38 +08:00
Zijie Tian	067e36f4a2	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST	2026-01-09 16:10:28 +08:00
Zijie Tian	1425510a2e	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST	2026-01-09 16:05:36 +08:00
Zijie Tian	335117bfca	Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload	2026-01-09 15:21:48 +08:00
Zijie Tian	5012b11291	[bench] Modify bench_vllm.py	2026-01-09 15:20:37 +08:00
Zijie Tian	ccf04d3917	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:16 CST	2026-01-09 15:16:55 +08:00
Zijie Tian	59f8970ed3	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:12 CST	2026-01-09 15:12:42 +08:00
Zijie Tian	6378cb4c17	Merge remote-tracking branch 'origin/zijie/fix-ga-perf-2' into tzj/vs_offload	2026-01-09 14:21:00 +08:00
Zijie Tian	47e3e465f0	[claudesquad] update from 'fix-ga-perf-2' on 09 Jan 26 14:08 CST	2026-01-09 14:08:12 +08:00
Zijie Tian	aac94c9481	[claude] Added some commands.	2026-01-09 13:16:23 +08:00
Zijie Tian	79c4df4a27	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST	2026-01-08 23:42:30 +08:00
Zijie Tian	ea4e904de0	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST	2026-01-08 23:22:38 +08:00
Zijie Tian	0bfe1984ef	[docs] Refine GPU mutex: exclusive for benchmarks, port check for tests Benchmarks (bench*.py) still require exclusive GPU access for accurate measurements. Other scripts (tests, examples) now only check for distributed port 29500 conflicts, allowing parallel GPU sharing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 21:35:08 +08:00
Zijie Tian	105201b902	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST	2026-01-08 21:19:38 +08:00
Zijie Tian	a8c9f0d837	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST	2026-01-08 20:53:08 +08:00
Zijie Tian	85bcca3d17	[claudesquad] update from 'int-offload-1' on 08 Jan 26 19:44 CST	2026-01-08 19:44:29 +08:00
Zijie Tian	b5c0ef3b7a	[docs] Replace chunked prefill docs with layer-wise offload strategy Remove all chunked prefill related documentation (ring buffer, sgDMA, Triton merge kernels, known issues) and replace with layer-wise offload system documentation including: - Design philosophy and benefits - Memory layout and per-layer KV size table - Prefill and decode flow pseudocode - Critical implementation details (sync offload, causal=False for decode) - Helper methods in HybridKVCacheManager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 05:39:26 +08:00
Zijie Tian	bbbfd1e7da	[docs] Simplify multi-instance development with direct PYTHONPATH Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method: - No pip install required - Code changes take effect immediately - Each worktree is completely isolated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 04:51:55 +08:00
Zijie Tian	c1ddb44e5d	Merge branch 'zijie/layer-prefill-1' into tzj/vs_offload Adds MInference sparse attention support: - New MInference sparse policy implementation - A-shape, vertical-slash, and block-sparse patterns - Updated bench.py with sparse attention options - test_minference_gpu.py validation test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 03:40:53 +08:00
Zijie Tian	d8a87da1c3	[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST	2026-01-08 03:36:39 +08:00
Zijie Tian	ecd9ae0271	[WIP] changed to layerwise offload.	2026-01-08 00:28:27 +08:00
Zijie Tian	6575099a06	[refactor] Cleanup unused code after perf_opt merge Removed ~460 lines of unused/redundant code from offload_engine.py: - CUDA gather methods (gathered_h2d_*, update_gather_indices) - Legacy async transfer methods (prefetch_block_async, offload_block_async) - Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices) - Legacy compatibility methods (load_to_compute_layer, wait_compute_layer) - Unused gather_indices tensors and memory calculations Updated class docstring to reflect current architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:25:21 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	aa953ecb59	[refactor] Aligned the bench.	2026-01-07 04:25:06 +08:00
Zijie Tian	362f5e575f	[fix] Fixed .gitignores .	2026-01-07 03:32:14 +08:00
Zijie Tian	58a06501c1	Merge branch 'zijie/debug_chunk-2' into tzj/minference	2026-01-07 03:30:38 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	2fe50bab50	[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST	2026-01-07 03:27:27 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	f240903013	[docs] Add GPU mutex instructions for multi-instance debugging Add instructions for Claude instances to check GPU availability before running CUDA operations, preventing conflicts when multiple instances debug in parallel on a single GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 01:42:59 +08:00
Zijie Tian	0e691f2d85	[WIP] move metadata to GPU.	2026-01-06 23:32:32 +08:00
Zijie Tian	edb5273e34	[WIP] Added basic test for quest.	2026-01-06 22:30:31 +08:00

1 2 3 4

156 Commits