Zijie Tian
76af506956
[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST
2026-01-13 02:01:07 +08:00
Zijie Tian
64971c8e8a
Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
...
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-12 16:27:25 +08:00
Zijie Tian
a6cc703d73
[tests] Added test_niah_standalone.py.
2026-01-12 00:16:37 +08:00
Zijie Tian
e23be2e844
Merge branch 'zijie/add-llama-1': Add multi-model support
...
- Add model registry system for dynamic model loading
- Implement LlamaForCausalLM with Llama3 RoPE scaling
- Register Qwen3ForCausalLM and Qwen2ForCausalLM
- Update ModelRunner to use get_model_class() for dynamic model selection
Tested: needle 32k test PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-10 21:20:53 +08:00
Zijie Tian
03a8c033cb
[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST
2026-01-10 21:03:45 +08:00
Zijie Tian
1425510a2e
[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST
2026-01-09 16:05:36 +08:00
Zijie Tian
59f8970ed3
[claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:12 CST
2026-01-09 15:12:42 +08:00
Zijie Tian
47e3e465f0
[claudesquad] update from 'fix-ga-perf-2' on 09 Jan 26 14:08 CST
2026-01-09 14:08:12 +08:00
Zijie Tian
ea4e904de0
[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST
2026-01-08 23:22:38 +08:00
Zijie Tian
a8c9f0d837
[claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST
2026-01-08 20:53:08 +08:00
Zijie Tian
c1ddb44e5d
Merge branch 'zijie/layer-prefill-1' into tzj/vs_offload
...
Adds MInference sparse attention support:
- New MInference sparse policy implementation
- A-shape, vertical-slash, and block-sparse patterns
- Updated bench.py with sparse attention options
- test_minference_gpu.py validation test
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-08 03:40:53 +08:00
Zijie Tian
d8a87da1c3
[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST
2026-01-08 03:36:39 +08:00
Zijie Tian
ecd9ae0271
[WIP] changed to layerwise offload.
2026-01-08 00:28:27 +08:00
Zijie Tian
6575099a06
[refactor] Cleanup unused code after perf_opt merge
...
Removed ~460 lines of unused/redundant code from offload_engine.py:
- CUDA gather methods (gathered_h2d_*, update_gather_indices)
- Legacy async transfer methods (prefetch_block_async, offload_block_async)
- Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices)
- Legacy compatibility methods (load_to_compute_layer, wait_compute_layer)
- Unused gather_indices tensors and memory calculations
Updated class docstring to reflect current architecture.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-07 06:25:21 +08:00
Zijie Tian
8fd25d72d7
Merge perf_opt-1 and perf_opt-2 branches
...
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-07 06:03:44 +08:00
Zijie Tian
ccf27d3a74
[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST
2026-01-07 05:58:23 +08:00
Zijie Tian
0ad86eb449
[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST
2026-01-07 05:58:10 +08:00
Zijie Tian
58a06501c1
Merge branch 'zijie/debug_chunk-2' into tzj/minference
2026-01-07 03:30:38 +08:00
Zijie Tian
2a6e0a2c02
[feat] Added Quest Sparsity Policy.
2026-01-07 03:29:21 +08:00
Zijie Tian
2fe50bab50
[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST
2026-01-07 03:27:27 +08:00
Zijie Tian
c99a6f3d3f
[WIP] Before add Quest policy.
2026-01-07 02:32:30 +08:00
Zijie Tian
0e691f2d85
[WIP] move metadata to GPU.
2026-01-06 23:32:32 +08:00
Zijie Tian
690492e074
[WIP] Before refactor policies.
2026-01-06 20:47:55 +08:00
Zijie Tian
7cc8a394a5
[fix] Fixed bench_offload.py, BUT performance DEGRAD.
2026-01-06 18:46:48 +08:00
Zijie Tian
535f2037ab
[WIP] Before fix bench_offload.py.
2026-01-06 18:41:08 +08:00
Zijie Tian
c7ac39dfbd
[refactor] Before add sprae policy.
2026-01-05 21:19:24 +08:00
Zijie Tian
e554d5482b
[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.
2026-01-05 20:31:42 +08:00
Zijie Tian
247c5312d9
[fix] Fixed decode misalign.
2026-01-05 19:00:44 +08:00
Zijie Tian
054aaff403
[fix] Fixed needle test bug.
2026-01-05 18:34:09 +08:00
Zijie Tian
d623043a3c
[WIP] FIXED decode and prefill NEEDLE test.
2026-01-05 01:51:46 +08:00
Zijie Tian
e897380127
[test] Added test_align.py and Before change nanovllm attention.
2026-01-04 22:48:01 +08:00
Zijie Tian
772313db8f
[refactor] Refactor the kvcache offload.
2026-01-04 19:37:03 +08:00
Zijie Tian
00ed17c640
[feat] Added debug tools.
2026-01-03 22:36:40 +08:00
Zijie Tian
74ee6d0895
[WIP] need to fix model to normally decode.
2026-01-01 05:18:27 +08:00
Zijie Tian
965c8aff12
[WIP] need change flashattention to debug.
2026-01-01 00:58:22 +08:00
Zijie Tian
30462fe89a
[WIP] Before fix needle.
2025-12-31 23:35:25 +08:00
Zijie Tian
ccd1b3d4ab
[WIP] Before modify nanovllm CPU-GPU kvcache.
2025-12-31 22:41:07 +08:00
Zijie Tian
484d0de9f9
[feat] Added debug hook to offload_engine.py.
2025-12-31 19:44:39 +08:00
Zijie Tian
89f8020d38
[WIP] fixing attention compute error.
2025-12-30 00:31:48 +08:00
Zijie Tian
82ed34fc2d
[opt] optimize nanovllm performance compareable with vllm.
2025-12-25 03:47:07 +08:00
Zijie Tian
16fcf8350b
[WIP] replace merge attention with triton kernel.
2025-12-25 01:07:05 +08:00
Zijie Tian
cf5e7df093
[WIP] Added sgDMA operator for scatter kvcache communication.
2025-12-24 23:48:52 +08:00
Zijie Tian
6ec1b23982
[WIP] NEED to modify communication.
2025-12-24 21:57:51 +08:00
Zijie Tian
782437c486
[WIP] remove num_prefetch_blocks varible.
2025-12-24 18:22:26 +08:00
Zijie Tian
4dcef16c13
[WIP] NEED refactor nanovllm mechenism.
2025-12-22 23:52:56 +08:00
Zijie Tian
1907b625b6
[refactor] Remove legacy mode path.
2025-12-22 20:17:56 +08:00
Zijie Tian
051f2295c9
[feat] Added sparse KVcache feature, NEED VERIFY.
2025-12-22 08:51:02 +08:00
Zijie Tian
dc7807a211
[feat] Fixed warmup memory overhead.
2025-12-15 21:39:14 +08:00
Zijie Tian
91a0f09a24
[feat] Optimized with ASYNC offload.
2025-12-15 07:21:35 +08:00
Zijie Tian
b8b6478506
[feat] Need to optimized with async prefetch.
2025-12-15 06:58:40 +08:00