📝 docs: add CUDA Graph optimization plan for offload mode decode

- Update task_plan.md with 6-phase segmented graph implementation plan - Add findings.md documenting 7 key discoveries about current implementation - Add progress.md for tracking implementation progress - Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy Key architecture decision: Split transformer layer into 3 segments: - PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused) - CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager) - POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused) Total: 4 graphs serving all layers via copy_() tensor updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 02:12:24 +08:00
parent d808970f2f
commit a5307fb124
4 changed files with 651 additions and 64 deletions
--- a/findings.md
+++ b/findings.md
@@ -0,0 +1,109 @@
+# Findings: CUDA Graph for Offload Mode
+
+## Discovery 1: 为什么 Offload Mode 不使用 CUDA Graph
+
+**位置**: `nanovllm/engine/model_runner.py:421`
+
+```python
+use_eager = is_prefill or self.enforce_eager or input_ids.size(0) > 512 or context.is_chunked_prefill
+```
+
+**原因**: `run_chunked_offload_decode()` 设置 `is_chunked_prefill=True`，强制使用 eager mode。
+
+---
+
+## Discovery 2: 当前 CUDA Graph 架构
+
+**文件**: `model_runner.py:682-717`
+
+```python
+def capture_cudagraph(self):
+    # 为不同 batch size 捕获完整 model forward
+    for bs in [1, 2, 4, 8, 16, ...]:
+        with torch.cuda.graph(graph):
+            outputs[:bs] = self.model(input_ids[:bs], positions[:bs])
+```
+
+**特点**:
+- 捕获完整的 `model()` 调用（包含所有层）
+- 使用 graph pool 共享内存
+- 只用于 decode（prefill 始终 eager）
+
+---
+
+## Discovery 3: Offload Decode 的 Attention 流程
+
+**文件**: `nanovllm/kvcache/sparse/full_policy.py:304-379`
+
+**Ring Buffer Pipeline**:
+```
+1. 预加载前 N 个 blocks 到 GPU slots
+2. 对每个 block:
+   a. wait_slot_layer()       # 等待 H2D
+   b. get_kv_for_slot()       # 获取 KV
+   c. flash_attn_with_lse()   # ⭐ 可 graph
+   d. record_slot_compute_done()
+   e. load_next_block()       # 启动下一个 H2D
+   f. merge_attention_outputs() # ⭐ 可 graph（但动态）
+```
+
+**关键**: H2D 传输不能 graph，但 attention 计算可以。
+
+---
+
+## Discovery 4: 验证 Graph 复用可行性
+
+**测试**: `tests/test_chunk_attention_graph_reuse.py`
+
+**结论**:
+- 只需 2 个 graph（causal + non-causal）
+- 通过 `copy_()` 更新 static tensors
+- 可复用于所有层和所有 chunk pairs
+
+**测试结果**:
+```
+Layer 0: max_diff=3.91e-03 ✅
+Layer 1: max_diff=7.81e-03 ✅
+Layer 2: max_diff=3.91e-03 ✅
+✅ PASSED
+```
+
+---
+
+## Discovery 5: Chunk Size 和 Block Size 关系
+
+**观察**:
+- Prefilled blocks 的 KV size = `block_size`
+- Decode buffer 的 KV size = `1` 到 `block_size`（动态）
+
+**Graph 策略**:
+- Prefilled blocks: 固定 size = block_size，适合 graph
+- Decode buffer: 动态 size，建议保持 eager
+
+---
+
+## Discovery 6: 使用的 Triton 算子
+
+**文件**: `nanovllm/ops/chunked_attention.py`
+
+| 算子 | 功能 | 可 Graph |
+|------|------|----------|
+| `flash_attn_with_lse()` | Attention + LSE | ✅ |
+| `merge_attention_outputs()` | 合并两个 attention 输出 | ✅ |
+
+这两个算子是纯 GPU 计算，可以被 CUDA Graph 捕获。
+
+---
+
+## Discovery 7: 数据依赖分析
+
+**Attention 输入**:
+- `q`: 来自当前层的 QKV projection，shape 固定
+- `k, v`: 来自 GPU slot（H2D 传输后），shape = [1, block_size, heads, dim]
+
+**依赖链**:
+```
+H2D(block) → wait() → get_kv() → copy_to_static() → graph.replay() → clone_output()
+```
+
+**关键**: Graph 只封装 attention 计算，不包含数据传输。