nano-vllm/tests/test_chunk_attention_graph_reuse.py at 4484ebbb7719de37005c5bdad677c6fa183a6d86

Files

Zijie Tian a5307fb124 📝 docs: add CUDA Graph optimization plan for offload mode decode

- Update task_plan.md with 6-phase segmented graph implementation plan
- Add findings.md documenting 7 key discoveries about current implementation
- Add progress.md for tracking implementation progress
- Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy

Key architecture decision: Split transformer layer into 3 segments:
- PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused)
- CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager)
- POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused)

Total: 4 graphs serving all layers via copy_() tensor updates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-22 02:12:24 +08:00

5.1 KiB

Raw Blame History

View Raw

5.1 KiB Raw Blame History

5.1 KiB

Raw Blame History