Zijie Tian
|
a5307fb124
|
📝 docs: add CUDA Graph optimization plan for offload mode decode
- Update task_plan.md with 6-phase segmented graph implementation plan
- Add findings.md documenting 7 key discoveries about current implementation
- Add progress.md for tracking implementation progress
- Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy
Key architecture decision: Split transformer layer into 3 segments:
- PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused)
- CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager)
- POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused)
Total: 4 graphs serving all layers via copy_() tensor updates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-01-22 02:12:24 +08:00 |
|