nano-vllm

Author	SHA1	Message	Date
Zijie Tian	de6f36bdb2	[docs] Added dist port issue.	2026-01-12 15:16:39 +08:00
Zijie Tian	8e0888c20c	[docs] Added offload_acc issue.	2026-01-12 15:05:55 +08:00
Zijie Tian	a6cc703d73	[tests] Added test_niah_standalone.py.	2026-01-12 00:16:37 +08:00
Zijie Tian	5895de0c97	[docs] Added transformers error desp.	2026-01-11 18:48:50 +08:00
Zijie Tian	2771312565	[docs] Add sparse prefill integration plan from int-minference analysis Consolidated analysis from int-minference-1/2/3 branches into a unified integration plan for MInference, XAttention, and FlexPrefill strategies. Key design decisions: - Backward compatible: Keep existing SparsePolicy interface - Unified BlockMask intermediate representation for new strategies - XAttention/FlexPrefill use block_sparse_attn_func kernel - MInference can optionally use block_sparse_attn (Phase 4) Five-phase implementation plan: 1. BlockMask + block_sparse_attn wrapper 2. XAttention implementation 3. FlexPrefill implementation 4. Optional MInference refactoring 5. Integration and testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 23:33:09 +08:00
Zijie Tian	de6eae472d	[docs] Update CLAUDE.md with multi-model support documentation - Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:29:39 +08:00
Zijie Tian	e23be2e844	Merge branch 'zijie/add-llama-1': Add multi-model support - Add model registry system for dynamic model loading - Implement LlamaForCausalLM with Llama3 RoPE scaling - Register Qwen3ForCausalLM and Qwen2ForCausalLM - Update ModelRunner to use get_model_class() for dynamic model selection Tested: needle 32k test PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-10 21:20:53 +08:00
Zijie Tian	24f5ae5fc3	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:14 CST	2026-01-10 21:14:32 +08:00
Zijie Tian	03a8c033cb	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST	2026-01-10 21:03:45 +08:00
Zijie Tian	9377ff63fe	Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload	2026-01-09 16:13:38 +08:00
Zijie Tian	067e36f4a2	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST	2026-01-09 16:10:28 +08:00
Zijie Tian	1425510a2e	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST	2026-01-09 16:05:36 +08:00
Zijie Tian	335117bfca	Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload	2026-01-09 15:21:48 +08:00
Zijie Tian	5012b11291	[bench] Modify bench_vllm.py	2026-01-09 15:20:37 +08:00
Zijie Tian	ccf04d3917	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:16 CST	2026-01-09 15:16:55 +08:00
Zijie Tian	59f8970ed3	[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:12 CST	2026-01-09 15:12:42 +08:00
Zijie Tian	6378cb4c17	Merge remote-tracking branch 'origin/zijie/fix-ga-perf-2' into tzj/vs_offload	2026-01-09 14:21:00 +08:00
Zijie Tian	47e3e465f0	[claudesquad] update from 'fix-ga-perf-2' on 09 Jan 26 14:08 CST	2026-01-09 14:08:12 +08:00
Zijie Tian	aac94c9481	[claude] Added some commands.	2026-01-09 13:16:23 +08:00
Zijie Tian	79c4df4a27	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST	2026-01-08 23:42:30 +08:00
Zijie Tian	ea4e904de0	[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST	2026-01-08 23:22:38 +08:00
Zijie Tian	0bfe1984ef	[docs] Refine GPU mutex: exclusive for benchmarks, port check for tests Benchmarks (bench*.py) still require exclusive GPU access for accurate measurements. Other scripts (tests, examples) now only check for distributed port 29500 conflicts, allowing parallel GPU sharing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 21:35:08 +08:00
Zijie Tian	105201b902	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST	2026-01-08 21:19:38 +08:00
Zijie Tian	a8c9f0d837	[claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST	2026-01-08 20:53:08 +08:00
Zijie Tian	85bcca3d17	[claudesquad] update from 'int-offload-1' on 08 Jan 26 19:44 CST	2026-01-08 19:44:29 +08:00
Zijie Tian	b5c0ef3b7a	[docs] Replace chunked prefill docs with layer-wise offload strategy Remove all chunked prefill related documentation (ring buffer, sgDMA, Triton merge kernels, known issues) and replace with layer-wise offload system documentation including: - Design philosophy and benefits - Memory layout and per-layer KV size table - Prefill and decode flow pseudocode - Critical implementation details (sync offload, causal=False for decode) - Helper methods in HybridKVCacheManager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 05:39:26 +08:00
Zijie Tian	bbbfd1e7da	[docs] Simplify multi-instance development with direct PYTHONPATH Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method: - No pip install required - Code changes take effect immediately - Each worktree is completely isolated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 04:51:55 +08:00
Zijie Tian	c1ddb44e5d	Merge branch 'zijie/layer-prefill-1' into tzj/vs_offload Adds MInference sparse attention support: - New MInference sparse policy implementation - A-shape, vertical-slash, and block-sparse patterns - Updated bench.py with sparse attention options - test_minference_gpu.py validation test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-08 03:40:53 +08:00
Zijie Tian	d8a87da1c3	[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST	2026-01-08 03:36:39 +08:00
Zijie Tian	ecd9ae0271	[WIP] changed to layerwise offload.	2026-01-08 00:28:27 +08:00
Zijie Tian	6575099a06	[refactor] Cleanup unused code after perf_opt merge Removed ~460 lines of unused/redundant code from offload_engine.py: - CUDA gather methods (gathered_h2d_*, update_gather_indices) - Legacy async transfer methods (prefetch_block_async, offload_block_async) - Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices) - Legacy compatibility methods (load_to_compute_layer, wait_compute_layer) - Unused gather_indices tensors and memory calculations Updated class docstring to reflect current architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:25:21 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	aa953ecb59	[refactor] Aligned the bench.	2026-01-07 04:25:06 +08:00
Zijie Tian	362f5e575f	[fix] Fixed .gitignores .	2026-01-07 03:32:14 +08:00
Zijie Tian	58a06501c1	Merge branch 'zijie/debug_chunk-2' into tzj/minference	2026-01-07 03:30:38 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	2fe50bab50	[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST	2026-01-07 03:27:27 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	f240903013	[docs] Add GPU mutex instructions for multi-instance debugging Add instructions for Claude instances to check GPU availability before running CUDA operations, preventing conflicts when multiple instances debug in parallel on a single GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 01:42:59 +08:00
Zijie Tian	0e691f2d85	[WIP] move metadata to GPU.	2026-01-06 23:32:32 +08:00
Zijie Tian	edb5273e34	[WIP] Added basic test for quest.	2026-01-06 22:30:31 +08:00
Zijie Tian	690492e074	[WIP] Before refactor policies.	2026-01-06 20:47:55 +08:00
Zijie Tian	7cc8a394a5	[fix] Fixed bench_offload.py, BUT performance DEGRAD.	2026-01-06 18:46:48 +08:00
Zijie Tian	535f2037ab	[WIP] Before fix bench_offload.py.	2026-01-06 18:41:08 +08:00
Zijie Tian	c7ac39dfbd	[refactor] Before add sprae policy.	2026-01-05 21:19:24 +08:00
Zijie Tian	e554d5482b	[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.	2026-01-05 20:31:42 +08:00
Zijie Tian	247c5312d9	[fix] Fixed decode misalign.	2026-01-05 19:00:44 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00

1 2 3

149 Commits