nano-vllm/nanovllm/layers/graphed_layers.py at c90dc196b255ecd7af18af577a7b7d309e75036c

Files

Zijie Tian 0437311068 ⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill

Implement extended CUDA Graph coverage for CPU offload path:
- Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph)
- Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes
- Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj
- Only attention core remains in eager mode for dynamic offload

Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K)

Also adds:
- --enforce-eager flag to bench_offload.py for comparison
- Offload mode constraint documentation in CLAUDE.md

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-27 07:38:40 +08:00

19 KiB

Raw Blame History

View Raw

19 KiB Raw Blame History

19 KiB

Raw Blame History