⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill
Implement extended CUDA Graph coverage for CPU offload path: - Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph) - Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes - Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj - Only attention core remains in eager mode for dynamic offload Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K) Also adds: - --enforce-eager flag to bench_offload.py for comparison - Offload mode constraint documentation in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
This commit is contained in:
@@ -93,6 +93,8 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
|
||||
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
|
||||
|
||||
**Offload Mode Constraint**: When using `enable_cpu_offload=True`, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.
|
||||
|
||||
**Common Issues**:
|
||||
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
|
||||
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
|
||||
|
||||
Reference in New Issue
Block a user