Commit Graph

119 Commits

Author SHA1 Message Date
Zijie Tian
6575099a06 [refactor] Cleanup unused code after perf_opt merge
Removed ~460 lines of unused/redundant code from offload_engine.py:
- CUDA gather methods (gathered_h2d_*, update_gather_indices)
- Legacy async transfer methods (prefetch_block_async, offload_block_async)
- Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices)
- Legacy compatibility methods (load_to_compute_layer, wait_compute_layer)
- Unused gather_indices tensors and memory calculations

Updated class docstring to reflect current architecture.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:25:21 +08:00
Zijie Tian
8fd25d72d7 Merge perf_opt-1 and perf_opt-2 branches
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload

Both features are complementary and improve CPU offload performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:03:44 +08:00
Zijie Tian
ccf27d3a74 [claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST 2026-01-07 05:58:23 +08:00
Zijie Tian
0ad86eb449 [claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST 2026-01-07 05:58:10 +08:00
Zijie Tian
aa953ecb59 [refactor] Aligned the bench. 2026-01-07 04:25:06 +08:00
Zijie Tian
362f5e575f [fix] Fixed .gitignores . 2026-01-07 03:32:14 +08:00
Zijie Tian
58a06501c1 Merge branch 'zijie/debug_chunk-2' into tzj/minference 2026-01-07 03:30:38 +08:00
Zijie Tian
2a6e0a2c02 [feat] Added Quest Sparsity Policy. 2026-01-07 03:29:21 +08:00
Zijie Tian
2fe50bab50 [claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST 2026-01-07 03:27:27 +08:00
Zijie Tian
c99a6f3d3f [WIP] Before add Quest policy. 2026-01-07 02:32:30 +08:00
Zijie Tian
f240903013 [docs] Add GPU mutex instructions for multi-instance debugging
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 01:42:59 +08:00
Zijie Tian
0e691f2d85 [WIP] move metadata to GPU. 2026-01-06 23:32:32 +08:00
Zijie Tian
edb5273e34 [WIP] Added basic test for quest. 2026-01-06 22:30:31 +08:00
Zijie Tian
690492e074 [WIP] Before refactor policies. 2026-01-06 20:47:55 +08:00
Zijie Tian
7cc8a394a5 [fix] Fixed bench_offload.py, BUT performance DEGRAD. 2026-01-06 18:46:48 +08:00
Zijie Tian
535f2037ab [WIP] Before fix bench_offload.py. 2026-01-06 18:41:08 +08:00
Zijie Tian
c7ac39dfbd [refactor] Before add sprae policy. 2026-01-05 21:19:24 +08:00
Zijie Tian
e554d5482b [refactor] Delete unnesscessory test, and refacrtor the offload prefix cache. 2026-01-05 20:31:42 +08:00
Zijie Tian
247c5312d9 [fix] Fixed decode misalign. 2026-01-05 19:00:44 +08:00
Zijie Tian
054aaff403 [fix] Fixed needle test bug. 2026-01-05 18:34:09 +08:00
Zijie Tian
d623043a3c [WIP] FIXED decode and prefill NEEDLE test. 2026-01-05 01:51:46 +08:00
Zijie Tian
e897380127 [test] Added test_align.py and Before change nanovllm attention. 2026-01-04 22:48:01 +08:00
Zijie Tian
24096431ed [refactor] refactor test_align.py. 2026-01-04 20:55:40 +08:00
Zijie Tian
772313db8f [refactor] Refactor the kvcache offload. 2026-01-04 19:37:03 +08:00
Zijie Tian
00ed17c640 [feat] Added debug tools. 2026-01-03 22:36:40 +08:00
Zijie Tian
9b52d25866 [docs] Update CLAUDE.md. 2026-01-03 20:46:00 +08:00
Zijie Tian
8c3418725b [refactor] Refactor needle test. 2026-01-03 19:19:37 +08:00
Zijie Tian
b3685c9190 [test] Added test_align.py 2026-01-03 18:55:58 +08:00
Zijie Tian
6927a75ac3 [refactor] refactor needle.py. 2026-01-03 18:33:48 +08:00
Zijie Tian
ff8b09cd35 [test] Added test_needle_ref.py. 2026-01-02 22:03:23 +08:00
Zijie Tian
74ee6d0895 [WIP] need to fix model to normally decode. 2026-01-01 05:18:27 +08:00
Zijie Tian
62b8a63314 [refactor] Refactor the test_chunked_prefill/decode. 2026-01-01 03:32:26 +08:00
Zijie Tian
965c8aff12 [WIP] need change flashattention to debug. 2026-01-01 00:58:22 +08:00
Zijie Tian
30462fe89a [WIP] Before fix needle. 2025-12-31 23:35:25 +08:00
Zijie Tian
ccd1b3d4ab [WIP] Before modify nanovllm CPU-GPU kvcache. 2025-12-31 22:41:07 +08:00
Zijie Tian
31e90a7268 [test] Added offload correct verify. 2025-12-31 20:59:53 +08:00
Zijie Tian
484d0de9f9 [feat] Added debug hook to offload_engine.py. 2025-12-31 19:44:39 +08:00
Zijie Tian
7af721c12c [WIP] Before modify to FlashInfer. 2025-12-30 01:11:13 +08:00
Zijie Tian
89f8020d38 [WIP] fixing attention compute error. 2025-12-30 00:31:48 +08:00
Zijie Tian
bf4c63c7ec [docs] Added Sparse Attn. 2025-12-29 19:56:54 +08:00
Zijie Tian
600af0f59c [fix] Fixed compile problem. 2025-12-26 21:02:43 +08:00
Zijie Tian
82ed34fc2d [opt] optimize nanovllm performance compareable with vllm. 2025-12-25 03:47:07 +08:00
Zijie Tian
16fcf8350b [WIP] replace merge attention with triton kernel. 2025-12-25 01:07:05 +08:00
Zijie Tian
cf5e7df093 [WIP] Added sgDMA operator for scatter kvcache communication. 2025-12-24 23:48:52 +08:00
Zijie Tian
6ec1b23982 [WIP] NEED to modify communication. 2025-12-24 21:57:51 +08:00
Zijie Tian
782437c486 [WIP] remove num_prefetch_blocks varible. 2025-12-24 18:22:26 +08:00
Zijie Tian
b264de903d [test] Added a simple test_prefill.py. 2025-12-23 00:26:25 +08:00
Zijie Tian
4dcef16c13 [WIP] NEED refactor nanovllm mechenism. 2025-12-22 23:52:56 +08:00
Zijie Tian
1907b625b6 [refactor] Remove legacy mode path. 2025-12-22 20:17:56 +08:00
Zijie Tian
08d83185ce [fix] fix bench*.py. 2025-12-22 19:53:50 +08:00