⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill

Implement extended CUDA Graph coverage for CPU offload path: - Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph) - Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes - Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj - Only attention core remains in eager mode for dynamic offload Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K) Also adds: - --enforce-eager flag to bench_offload.py for comparison - Offload mode constraint documentation in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 07:38:40 +08:00
parent 0d31b3f71f
commit 0437311068
4 changed files with 740 additions and 2 deletions
--- a/bench_offload.py
+++ b/bench_offload.py
@@ -69,6 +69,7 @@ def main():
    parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
    parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
    parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
+    parser.add_argument("--enforce-eager", action="store_true", help="Disable CUDA Graphs (use eager mode)")
    args = parser.parse_args()

    path = os.path.expanduser(args.model)
@@ -89,7 +90,7 @@ def main():

    llm = LLM(
        path,
-        enforce_eager=False,
+        enforce_eager=args.enforce_eager,
        max_model_len=max_len,
        max_num_batched_tokens=max_len,
        enable_cpu_offload=True,