Commit Graph

58 Commits

Author SHA1 Message Date
Zijie Tian
dce6ad6b74 ♻️ refactor: chunked LayerNorm/QKV/MLP for 64k memory optimization
Implement chunked processing for LayerNorm, QKV projection, and MLP
layers to reduce peak activation memory for 64k sequence inference.

Changes:
- Chunked input_layernorm and post_attention_layernorm (chunk_size=128)
- Chunked QKV projection (chunk_size=128)
- Chunked MLP processing (chunk_size=128) with memory cleanup
- Added torch.cuda.empty_cache() calls after each chunk

This reduces peak activation from ~2 GB to ~50 MB per layer,
making 64k inference theoretically possible on 24GB GPUs
(though still limited by memory fragmentation).

Related: docs/64k_memory_analysis.md

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:01:57 +08:00
Zijie Tian
76af506956 [claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST 2026-01-13 02:01:07 +08:00
Zijie Tian
64971c8e8a Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:27:25 +08:00
Zijie Tian
a6cc703d73 [tests] Added test_niah_standalone.py. 2026-01-12 00:16:37 +08:00
Zijie Tian
e23be2e844 Merge branch 'zijie/add-llama-1': Add multi-model support
- Add model registry system for dynamic model loading
- Implement LlamaForCausalLM with Llama3 RoPE scaling
- Register Qwen3ForCausalLM and Qwen2ForCausalLM
- Update ModelRunner to use get_model_class() for dynamic model selection

Tested: needle 32k test PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:20:53 +08:00
Zijie Tian
03a8c033cb [claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST 2026-01-10 21:03:45 +08:00
Zijie Tian
1425510a2e [claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST 2026-01-09 16:05:36 +08:00
Zijie Tian
59f8970ed3 [claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:12 CST 2026-01-09 15:12:42 +08:00
Zijie Tian
47e3e465f0 [claudesquad] update from 'fix-ga-perf-2' on 09 Jan 26 14:08 CST 2026-01-09 14:08:12 +08:00
Zijie Tian
ea4e904de0 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST 2026-01-08 23:22:38 +08:00
Zijie Tian
a8c9f0d837 [claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST 2026-01-08 20:53:08 +08:00
Zijie Tian
c1ddb44e5d Merge branch 'zijie/layer-prefill-1' into tzj/vs_offload
Adds MInference sparse attention support:
- New MInference sparse policy implementation
- A-shape, vertical-slash, and block-sparse patterns
- Updated bench.py with sparse attention options
- test_minference_gpu.py validation test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 03:40:53 +08:00
Zijie Tian
d8a87da1c3 [claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST 2026-01-08 03:36:39 +08:00
Zijie Tian
ecd9ae0271 [WIP] changed to layerwise offload. 2026-01-08 00:28:27 +08:00
Zijie Tian
8fd25d72d7 Merge perf_opt-1 and perf_opt-2 branches
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload

Both features are complementary and improve CPU offload performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:03:44 +08:00
Zijie Tian
ccf27d3a74 [claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST 2026-01-07 05:58:23 +08:00
Zijie Tian
0ad86eb449 [claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST 2026-01-07 05:58:10 +08:00
Zijie Tian
2a6e0a2c02 [feat] Added Quest Sparsity Policy. 2026-01-07 03:29:21 +08:00
Zijie Tian
c99a6f3d3f [WIP] Before add Quest policy. 2026-01-07 02:32:30 +08:00
Zijie Tian
535f2037ab [WIP] Before fix bench_offload.py. 2026-01-06 18:41:08 +08:00
Zijie Tian
c7ac39dfbd [refactor] Before add sprae policy. 2026-01-05 21:19:24 +08:00
Zijie Tian
d623043a3c [WIP] FIXED decode and prefill NEEDLE test. 2026-01-05 01:51:46 +08:00
Zijie Tian
e897380127 [test] Added test_align.py and Before change nanovllm attention. 2026-01-04 22:48:01 +08:00
Zijie Tian
30462fe89a [WIP] Before fix needle. 2025-12-31 23:35:25 +08:00
Zijie Tian
89f8020d38 [WIP] fixing attention compute error. 2025-12-30 00:31:48 +08:00
Zijie Tian
1907b625b6 [refactor] Remove legacy mode path. 2025-12-22 20:17:56 +08:00
Zijie Tian
051f2295c9 [feat] Added sparse KVcache feature, NEED VERIFY. 2025-12-22 08:51:02 +08:00
Zijie Tian
dc7807a211 [feat] Fixed warmup memory overhead. 2025-12-15 21:39:14 +08:00
Zijie Tian
b8b6478506 [feat] Need to optimized with async prefetch. 2025-12-15 06:58:40 +08:00
Zijie Tian
1081ab51ea [refactor] Refactor offload code to multi-chunk. 2025-12-15 01:13:58 +08:00
Zijie Tian
60d24f7c12 [feat] Added bench_offload.py and GreedySampler. 2025-12-12 00:24:08 +08:00
Zijie Tian
babfa17354 [refactor] Translate into english, void Chinese due to claude. 2025-12-11 00:30:24 +08:00
Zijie Tian
e85c2b4776 [fix] Fixed kvcache offload bugs. 2025-12-10 22:34:00 +08:00
Zijie Tian
190df5f70d [refactor] Refactor current gpu and cpu block allocation strategy. 2025-12-10 21:23:31 +08:00
Zijie Tian
0a247ccb1b [feat] Added num_gpu_blocks limit gpu blocks. 2025-12-10 20:17:42 +08:00
Zijie Tian
87055cc5ce [refactor] Implement real chunked prefill mechenism. 2025-12-10 18:34:01 +08:00
Zijie Tian
0b6f19242d [feat] Added chunked prefill and kvcache offload mechenism. 2025-12-10 03:47:37 +08:00
GeeeekExplorer
2f21442653 support qwen2 2025-11-04 01:44:42 +08:00
GeeeekExplorer
df99418f7d simplify 2025-08-31 20:02:51 +08:00
PeterDing
f5b4840276 fix(model_runner): correct position indexing to be 0-based
- Change position calculation from len(seq) to len(seq) - 1
2025-07-04 14:29:12 +08:00
GeeeekExplorer
cb0b3dec3f remove rng state 2025-06-27 22:50:33 +08:00
GeeeekExplorer
1caeec8dfa same as vllm 2025-06-27 18:50:56 +08:00
GeeeekExplorer
658520b788 warmup and allocate 2025-06-27 01:51:57 +08:00
GeeeekExplorer
03cfc13bb3 faster pickle 2025-06-23 00:51:52 +08:00
GeeeekExplorer
cde3fc22c2 simplify 2025-06-21 17:19:15 +08:00
jinghuan-Chen
ffafaeb133 Release CUDA Graphs resource before exit. 2025-06-18 16:17:31 +08:00
GeeeekExplorer
bc0ad5a116 better 2025-06-17 23:33:38 +08:00
GeeeekExplorer
7e42fa6f63 fix 2025-06-15 13:28:29 +08:00
GeeeekExplorer
fc778a4da9 better 2025-06-15 10:36:45 +08:00
cheunglei
53b3ef2e32 support tensor parallel 2025-06-15 01:31:24 +08:00