nano-vllm

Author	SHA1	Message	Date
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	535f2037ab	[WIP] Before fix bench_offload.py.	2026-01-06 18:41:08 +08:00
Zijie Tian	c7ac39dfbd	[refactor] Before add sprae policy.	2026-01-05 21:19:24 +08:00
Zijie Tian	d623043a3c	[WIP] FIXED decode and prefill NEEDLE test.	2026-01-05 01:51:46 +08:00
Zijie Tian	e897380127	[test] Added test_align.py and Before change nanovllm attention.	2026-01-04 22:48:01 +08:00
Zijie Tian	30462fe89a	[WIP] Before fix needle.	2025-12-31 23:35:25 +08:00
Zijie Tian	89f8020d38	[WIP] fixing attention compute error.	2025-12-30 00:31:48 +08:00
Zijie Tian	1907b625b6	[refactor] Remove legacy mode path.	2025-12-22 20:17:56 +08:00
Zijie Tian	051f2295c9	[feat] Added sparse KVcache feature, NEED VERIFY.	2025-12-22 08:51:02 +08:00
Zijie Tian	dc7807a211	[feat] Fixed warmup memory overhead.	2025-12-15 21:39:14 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	60d24f7c12	[feat] Added bench_offload.py and GreedySampler.	2025-12-12 00:24:08 +08:00
Zijie Tian	babfa17354	[refactor] Translate into english, void Chinese due to claude.	2025-12-11 00:30:24 +08:00
Zijie Tian	e85c2b4776	[fix] Fixed kvcache offload bugs.	2025-12-10 22:34:00 +08:00
Zijie Tian	190df5f70d	[refactor] Refactor current gpu and cpu block allocation strategy.	2025-12-10 21:23:31 +08:00
Zijie Tian	0a247ccb1b	[feat] Added `num_gpu_blocks` limit gpu blocks.	2025-12-10 20:17:42 +08:00
Zijie Tian	87055cc5ce	[refactor] Implement real chunked prefill mechenism.	2025-12-10 18:34:01 +08:00
Zijie Tian	0b6f19242d	[feat] Added chunked prefill and kvcache offload mechenism.	2025-12-10 03:47:37 +08:00
GeeeekExplorer	2f21442653	support qwen2	2025-11-04 01:44:42 +08:00
GeeeekExplorer	df99418f7d	simplify	2025-08-31 20:02:51 +08:00
PeterDing	f5b4840276	fix(model_runner): correct position indexing to be 0-based - Change position calculation from len(seq) to len(seq) - 1	2025-07-04 14:29:12 +08:00
GeeeekExplorer	cb0b3dec3f	remove rng state	2025-06-27 22:50:33 +08:00
GeeeekExplorer	1caeec8dfa	same as vllm	2025-06-27 18:50:56 +08:00
GeeeekExplorer	658520b788	warmup and allocate	2025-06-27 01:51:57 +08:00
GeeeekExplorer	03cfc13bb3	faster pickle	2025-06-23 00:51:52 +08:00
GeeeekExplorer	cde3fc22c2	simplify	2025-06-21 17:19:15 +08:00
jinghuan-Chen	ffafaeb133	Release CUDA Graphs resource before exit.	2025-06-18 16:17:31 +08:00
GeeeekExplorer	bc0ad5a116	better	2025-06-17 23:33:38 +08:00
GeeeekExplorer	7e42fa6f63	fix	2025-06-15 13:28:29 +08:00
GeeeekExplorer	fc778a4da9	better	2025-06-15 10:36:45 +08:00
cheunglei	53b3ef2e32	support tensor parallel	2025-06-15 01:31:24 +08:00
GeeeekExplorer	b6136383c9	support fast pickle	2025-06-14 13:36:57 +08:00
GeeeekExplorer	4a8aa090a7	fix	2025-06-14 00:56:07 +08:00
GeeeekExplorer	98a1551a7d	support CUDA_VISIBLE_DEVICES	2025-06-12 23:14:01 +08:00
GeeeekExplorer	fee58d44e4	fix	2025-06-12 01:00:31 +08:00
GeeeekExplorer	08c84ec08d	multi file loader	2025-06-12 01:00:09 +08:00
GeeeekExplorer	386290d69e	refactor	2025-06-11 21:12:57 +08:00
GeeeekExplorer	b98e1ca305	fix	2025-06-10 21:25:54 +08:00
GeeeekExplorer	a5a4909e6a	init commit	2025-06-10 00:27:01 +08:00

44 Commits