Zijie Tian
a5307fb124
📝 docs: add CUDA Graph optimization plan for offload mode decode
- Update task_plan.md with 6-phase segmented graph implementation plan
- Add findings.md documenting 7 key discoveries about current implementation
- Add progress.md for tracking implementation progress
- Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy
Key architecture decision: Split transformer layer into 3 segments:
- PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused)
- CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager)
- POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused)
Total: 4 graphs serving all layers via copy_() tensor updates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 02:12:24 +08:00
..
2025-12-22 23:52:56 +08:00
2026-01-03 19:19:37 +08:00
2026-01-22 02:12:24 +08:00
2026-01-22 00:57:05 +08:00
2026-01-21 03:30:36 +08:00
2026-01-03 19:19:37 +08:00
2026-01-19 21:19:21 +08:00
2026-01-06 23:32:32 +08:00
2026-01-20 23:41:17 +08:00
2026-01-06 18:41:08 +08:00
2026-01-22 01:13:17 +08:00
2026-01-06 18:41:08 +08:00