⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill

Implement extended CUDA Graph coverage for CPU offload path: - Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph) - Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes - Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj - Only attention core remains in eager mode for dynamic offload Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K) Also adds: - --enforce-eager flag to bench_offload.py for comparison - Offload mode constraint documentation in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 07:38:40 +08:00
parent 0d31b3f71f
commit 0437311068
4 changed files with 740 additions and 2 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -93,6 +93,8 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py

 **Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)

+**Offload Mode Constraint**: When using `enable_cpu_offload=True`, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.
+
 **Common Issues**:
 1. `max_num_batched_tokens < max_model_len`: Set equal for long context
 2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`