Zijie Tian
8fd25d72d7
Merge perf_opt-1 and perf_opt-2 branches
...
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-07 06:03:44 +08:00
Zijie Tian
ccf27d3a74
[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST
2026-01-07 05:58:23 +08:00
Zijie Tian
0ad86eb449
[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST
2026-01-07 05:58:10 +08:00
Zijie Tian
2a6e0a2c02
[feat] Added Quest Sparsity Policy.
2026-01-07 03:29:21 +08:00
Zijie Tian
c99a6f3d3f
[WIP] Before add Quest policy.
2026-01-07 02:32:30 +08:00
Zijie Tian
535f2037ab
[WIP] Before fix bench_offload.py.
2026-01-06 18:41:08 +08:00
Zijie Tian
c7ac39dfbd
[refactor] Before add sprae policy.
2026-01-05 21:19:24 +08:00
Zijie Tian
d623043a3c
[WIP] FIXED decode and prefill NEEDLE test.
2026-01-05 01:51:46 +08:00
Zijie Tian
e897380127
[test] Added test_align.py and Before change nanovllm attention.
2026-01-04 22:48:01 +08:00
Zijie Tian
30462fe89a
[WIP] Before fix needle.
2025-12-31 23:35:25 +08:00
Zijie Tian
89f8020d38
[WIP] fixing attention compute error.
2025-12-30 00:31:48 +08:00
Zijie Tian
1907b625b6
[refactor] Remove legacy mode path.
2025-12-22 20:17:56 +08:00
Zijie Tian
051f2295c9
[feat] Added sparse KVcache feature, NEED VERIFY.
2025-12-22 08:51:02 +08:00
Zijie Tian
dc7807a211
[feat] Fixed warmup memory overhead.
2025-12-15 21:39:14 +08:00
Zijie Tian
b8b6478506
[feat] Need to optimized with async prefetch.
2025-12-15 06:58:40 +08:00
Zijie Tian
1081ab51ea
[refactor] Refactor offload code to multi-chunk.
2025-12-15 01:13:58 +08:00
Zijie Tian
60d24f7c12
[feat] Added bench_offload.py and GreedySampler.
2025-12-12 00:24:08 +08:00
Zijie Tian
babfa17354
[refactor] Translate into english, void Chinese due to claude.
2025-12-11 00:30:24 +08:00
Zijie Tian
e85c2b4776
[fix] Fixed kvcache offload bugs.
2025-12-10 22:34:00 +08:00
Zijie Tian
190df5f70d
[refactor] Refactor current gpu and cpu block allocation strategy.
2025-12-10 21:23:31 +08:00
Zijie Tian
0a247ccb1b
[feat] Added num_gpu_blocks limit gpu blocks.
2025-12-10 20:17:42 +08:00
Zijie Tian
87055cc5ce
[refactor] Implement real chunked prefill mechenism.
2025-12-10 18:34:01 +08:00
Zijie Tian
0b6f19242d
[feat] Added chunked prefill and kvcache offload mechenism.
2025-12-10 03:47:37 +08:00
GeeeekExplorer
2f21442653
support qwen2
2025-11-04 01:44:42 +08:00
GeeeekExplorer
df99418f7d
simplify
2025-08-31 20:02:51 +08:00
PeterDing
f5b4840276
fix(model_runner): correct position indexing to be 0-based
...
- Change position calculation from len(seq) to len(seq) - 1
2025-07-04 14:29:12 +08:00
GeeeekExplorer
cb0b3dec3f
remove rng state
2025-06-27 22:50:33 +08:00
GeeeekExplorer
1caeec8dfa
same as vllm
2025-06-27 18:50:56 +08:00
GeeeekExplorer
658520b788
warmup and allocate
2025-06-27 01:51:57 +08:00
GeeeekExplorer
03cfc13bb3
faster pickle
2025-06-23 00:51:52 +08:00
GeeeekExplorer
cde3fc22c2
simplify
2025-06-21 17:19:15 +08:00
jinghuan-Chen
ffafaeb133
Release CUDA Graphs resource before exit.
2025-06-18 16:17:31 +08:00
GeeeekExplorer
bc0ad5a116
better
2025-06-17 23:33:38 +08:00
GeeeekExplorer
7e42fa6f63
fix
2025-06-15 13:28:29 +08:00
GeeeekExplorer
fc778a4da9
better
2025-06-15 10:36:45 +08:00
cheunglei
53b3ef2e32
support tensor parallel
2025-06-15 01:31:24 +08:00
GeeeekExplorer
b6136383c9
support fast pickle
2025-06-14 13:36:57 +08:00
GeeeekExplorer
4a8aa090a7
fix
2025-06-14 00:56:07 +08:00
GeeeekExplorer
98a1551a7d
support CUDA_VISIBLE_DEVICES
2025-06-12 23:14:01 +08:00
GeeeekExplorer
fee58d44e4
fix
2025-06-12 01:00:31 +08:00
GeeeekExplorer
08c84ec08d
multi file loader
2025-06-12 01:00:09 +08:00
GeeeekExplorer
386290d69e
refactor
2025-06-11 21:12:57 +08:00
GeeeekExplorer
b98e1ca305
fix
2025-06-10 21:25:54 +08:00
GeeeekExplorer
a5a4909e6a
init commit
2025-06-10 00:27:01 +08:00