nano-vllm

Author	SHA1	Message	Date
Zijie Tian	484d0de9f9	[feat] Added debug hook to offload_engine.py.	2025-12-31 19:44:39 +08:00
Zijie Tian	89f8020d38	[WIP] fixing attention compute error.	2025-12-30 00:31:48 +08:00
Zijie Tian	82ed34fc2d	[opt] optimize nanovllm performance compareable with vllm.	2025-12-25 03:47:07 +08:00
Zijie Tian	16fcf8350b	[WIP] replace merge attention with triton kernel.	2025-12-25 01:07:05 +08:00
Zijie Tian	cf5e7df093	[WIP] Added sgDMA operator for scatter kvcache communication.	2025-12-24 23:48:52 +08:00
Zijie Tian	6ec1b23982	[WIP] NEED to modify communication.	2025-12-24 21:57:51 +08:00
Zijie Tian	782437c486	[WIP] remove num_prefetch_blocks varible.	2025-12-24 18:22:26 +08:00
Zijie Tian	4dcef16c13	[WIP] NEED refactor nanovllm mechenism.	2025-12-22 23:52:56 +08:00
Zijie Tian	1907b625b6	[refactor] Remove legacy mode path.	2025-12-22 20:17:56 +08:00
Zijie Tian	051f2295c9	[feat] Added sparse KVcache feature, NEED VERIFY.	2025-12-22 08:51:02 +08:00
Zijie Tian	dc7807a211	[feat] Fixed warmup memory overhead.	2025-12-15 21:39:14 +08:00
Zijie Tian	91a0f09a24	[feat] Optimized with ASYNC offload.	2025-12-15 07:21:35 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	61edb8a344	[feat] Finished offload. Still need optimize performance.	2025-12-12 02:27:40 +08:00
Zijie Tian	9b8165af5a	[fix] Fixed kvcache offload problem.	2025-12-12 01:35:30 +08:00
Zijie Tian	60d24f7c12	[feat] Added bench_offload.py and GreedySampler.	2025-12-12 00:24:08 +08:00
Zijie Tian	0bd7ba7536	[fix] Fixed chunked_attention.py implement.	2025-12-11 22:39:50 +08:00
Zijie Tian	b9ed77cbbb	[fix] Fix import error.	2025-12-11 05:31:06 +08:00
Zijie Tian	babfa17354	[refactor] Translate into english, void Chinese due to claude.	2025-12-11 00:30:24 +08:00
Zijie Tian	e85c2b4776	[fix] Fixed kvcache offload bugs.	2025-12-10 22:34:00 +08:00
Zijie Tian	190df5f70d	[refactor] Refactor current gpu and cpu block allocation strategy.	2025-12-10 21:23:31 +08:00
Zijie Tian	0a247ccb1b	[feat] Added `num_gpu_blocks` limit gpu blocks.	2025-12-10 20:17:42 +08:00
Zijie Tian	01f19ee4a6	[feat] Added logger into nanovllm.	2025-12-10 19:53:38 +08:00
Zijie Tian	87055cc5ce	[refactor] Implement real chunked prefill mechenism.	2025-12-10 18:34:01 +08:00
Zijie Tian	0b6f19242d	[feat] Added chunked prefill and kvcache offload mechenism.	2025-12-10 03:47:37 +08:00
Zijie Tian	204fe2b38f	[feat] Added metric into tqdm bar.	2025-12-10 00:52:13 +08:00
Zijie Tian	761929390e	[bench] Added vllm vs nano-vllm bench.	2025-12-10 00:44:57 +08:00
GeeeekExplorer	2f21442653	support qwen2	2025-11-04 01:44:42 +08:00
GeeeekExplorer	6ef2a4f630	compile random sampling	2025-08-31 22:55:34 +08:00
GeeeekExplorer	df99418f7d	simplify	2025-08-31 20:02:51 +08:00
PeterDing	f5b4840276	fix(model_runner): correct position indexing to be 0-based - Change position calculation from len(seq) to len(seq) - 1	2025-07-04 14:29:12 +08:00
GeeeekExplorer	38baf0bbe4	remove assert shape	2025-06-27 23:00:30 +08:00
GeeeekExplorer	cb0b3dec3f	remove rng state	2025-06-27 22:50:33 +08:00
GeeeekExplorer	1caeec8dfa	same as vllm	2025-06-27 18:50:56 +08:00
GeeeekExplorer	658520b788	warmup and allocate	2025-06-27 01:51:57 +08:00
xiaohajiayou	054aec852d	Fix: Division-by-Zero Risk and Typo	2025-06-24 02:02:33 +08:00
GeeeekExplorer	03cfc13bb3	faster pickle	2025-06-23 00:51:52 +08:00
GeeeekExplorer	cde3fc22c2	simplify	2025-06-21 17:19:15 +08:00
jinghuan-Chen	ffafaeb133	Release CUDA Graphs resource before exit.	2025-06-18 16:17:31 +08:00
Xingkai Yu	4fc764f175	Merge pull request #22 from cheunglei/use_spawn	2025-06-17 23:53:59 +08:00
cheunglei	b5ace32982	use spawn	2025-06-17 23:49:15 +08:00
GeeeekExplorer	bc0ad5a116	better	2025-06-17 23:33:38 +08:00
GeeeekExplorer	7e42fa6f63	fix	2025-06-15 13:28:29 +08:00
Xingkai Yu	326b121fad	Merge pull request #10 from MARD1NO/refine_return_hint_in_schedule	2025-06-15 10:39:51 +08:00
GeeeekExplorer	fc778a4da9	better	2025-06-15 10:36:45 +08:00
MARD1NO	98bbbefb68	schedule return bool args	2025-06-15 10:15:05 +08:00
cheunglei	53b3ef2e32	support tensor parallel	2025-06-15 01:31:24 +08:00
GeeeekExplorer	b6136383c9	support fast pickle	2025-06-14 13:36:57 +08:00
GeeeekExplorer	4a8aa090a7	fix	2025-06-14 00:56:07 +08:00

1 2

59 Commits