From be67fa80605ab504c788f49845dd11894dbac651 Mon Sep 17 00:00:00 2001
From: Zijie Tian <zijietian@mail.xmu.edu.cn>
Date: Fri, 23 Jan 2026 09:43:22 +0800
Subject: [PATCH] =?UTF-8?q?=F0=9F=97=91=EF=B8=8F=20chore:=20remove=20tempo?=
 =?UTF-8?q?rary=20planning=20files?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

These files are session-level temporary files and should not be tracked.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 findings.md  | 109 ----------------
 progress.md  |  55 --------
 task_plan.md | 357 ---------------------------------------------------
 3 files changed, 521 deletions(-)
 delete mode 100644 findings.md
 delete mode 100644 progress.md
 delete mode 100644 task_plan.md

diff --git a/findings.md b/findings.md
deleted file mode 100644
index 2497c6f..0000000
--- a/findings.md
+++ /dev/null
@@ -1,109 +0,0 @@
-# Findings: CUDA Graph for Offload Mode
-
-## Discovery 1: 为什么 Offload Mode 不使用 CUDA Graph
-
-**位置**: `nanovllm/engine/model_runner.py:421`
-
-```python
-use_eager = is_prefill or self.enforce_eager or input_ids.size(0) > 512 or context.is_chunked_prefill
-```
-
-**原因**: `run_chunked_offload_decode()` 设置 `is_chunked_prefill=True`，强制使用 eager mode。
-
----
-
-## Discovery 2: 当前 CUDA Graph 架构
-
-**文件**: `model_runner.py:682-717`
-
-```python
-def capture_cudagraph(self):
-    # 为不同 batch size 捕获完整 model forward
-    for bs in [1, 2, 4, 8, 16, ...]:
-        with torch.cuda.graph(graph):
-            outputs[:bs] = self.model(input_ids[:bs], positions[:bs])
-```
-
-**特点**:
-- 捕获完整的 `model()` 调用（包含所有层）
-- 使用 graph pool 共享内存
-- 只用于 decode（prefill 始终 eager）
-
----
-
-## Discovery 3: Offload Decode 的 Attention 流程
-
-**文件**: `nanovllm/kvcache/sparse/full_policy.py:304-379`
-
-**Ring Buffer Pipeline**:
-```
-1. 预加载前 N 个 blocks 到 GPU slots
-2. 对每个 block:
-   a. wait_slot_layer()       # 等待 H2D
-   b. get_kv_for_slot()       # 获取 KV
-   c. flash_attn_with_lse()   # ⭐ 可 graph
-   d. record_slot_compute_done()
-   e. load_next_block()       # 启动下一个 H2D
-   f. merge_attention_outputs() # ⭐ 可 graph（但动态）
-```
-
-**关键**: H2D 传输不能 graph，但 attention 计算可以。
-
----
-
-## Discovery 4: 验证 Graph 复用可行性
-
-**测试**: `tests/test_chunk_attention_graph_reuse.py`
-
-**结论**:
-- 只需 2 个 graph（causal + non-causal）
-- 通过 `copy_()` 更新 static tensors
-- 可复用于所有层和所有 chunk pairs
-
-**测试结果**:
-```
-Layer 0: max_diff=3.91e-03 ✅
-Layer 1: max_diff=7.81e-03 ✅
-Layer 2: max_diff=3.91e-03 ✅
-✅ PASSED
-```
-
----
-
-## Discovery 5: Chunk Size 和 Block Size 关系
-
-**观察**:
-- Prefilled blocks 的 KV size = `block_size`
-- Decode buffer 的 KV size = `1` 到 `block_size`（动态）
-
-**Graph 策略**:
-- Prefilled blocks: 固定 size = block_size，适合 graph
-- Decode buffer: 动态 size，建议保持 eager
-
----
-
-## Discovery 6: 使用的 Triton 算子
-
-**文件**: `nanovllm/ops/chunked_attention.py`
-
-| 算子 | 功能 | 可 Graph |
-|------|------|----------|
-| `flash_attn_with_lse()` | Attention + LSE | ✅ |
-| `merge_attention_outputs()` | 合并两个 attention 输出 | ✅ |
-
-这两个算子是纯 GPU 计算，可以被 CUDA Graph 捕获。
-
----
-
-## Discovery 7: 数据依赖分析
-
-**Attention 输入**:
-- `q`: 来自当前层的 QKV projection，shape 固定
-- `k, v`: 来自 GPU slot（H2D 传输后），shape = [1, block_size, heads, dim]
-
-**依赖链**:
-```
-H2D(block) → wait() → get_kv() → copy_to_static() → graph.replay() → clone_output()
-```
-
-**关键**: Graph 只封装 attention 计算，不包含数据传输。
diff --git a/progress.md b/progress.md
deleted file mode 100644
index dc78479..0000000
--- a/progress.md
+++ /dev/null
@@ -1,55 +0,0 @@
-# Progress: CUDA Graph for Offload Mode
-
-## Session: 2026-01-22
-
-### 调研阶段 ✅ 完成
-
-**完成的调研**:
-
-1. ✅ 分析 `model_runner.py` 中的 CUDA Graph 实现
-   - `capture_cudagraph()`: 为不同 batch size 捕获完整 model forward
-   - `run_model()`: 通过 `is_chunked_prefill` 决定 eager/graph
-
-2. ✅ 分析 offload decode 流程
-   - `run_chunked_offload_decode()` 设置 `is_chunked_prefill=True`
-   - 导致永远使用 eager mode
-
-3. ✅ 分析 ring buffer pipeline
-   - `_decode_ring_buffer_pipeline()` 包含 H2D 传输 + attention 计算
-   - H2D 不能 graph，attention 可以 graph
-
-4. ✅ 验证 graph 复用策略
-   - 创建 `test_chunk_attention_graph_reuse.py`
-   - 确认 2 个 graph 可复用于所有层
-
-### 计划编写 ✅ 完成
-
-- ✅ 创建 `task_plan.md`
-- ✅ 创建 `findings.md`
-- ✅ 创建 `progress.md`
-
-### 下一步: 实现
-
-**Phase 1**: 添加 graph 捕获到 OffloadEngine
-- [ ] 在 `offload_engine.py` 添加 `capture_attention_graphs()`
-- [ ] 添加 `attention_graph_causal` 和 `attention_graph_non_causal` 属性
-
-**Phase 2**: 修改 ring buffer pipeline
-- [ ] 在 `_decode_ring_buffer_pipeline()` 使用 graph replay
-- [ ] 保持 H2D 和 merge 为 eager
-
-**Phase 3**: 测试
-- [ ] 运行 needle test 验证正确性
-- [ ] 对比性能
-
----
-
-## 文件清单
-
-| 文件 | 状态 | 说明 |
-|------|------|------|
-| `tests/test_chunk_attention_graph.py` | ✅ 已提交 | 预分配 chunk pair graphs 测试 |
-| `tests/test_chunk_attention_graph_reuse.py` | 待提交 | Graph 复用验证 |
-| `task_plan.md` | ✅ 创建 | 实现计划 |
-| `findings.md` | ✅ 创建 | 调研发现 |
-| `progress.md` | ✅ 创建 | 进度日志 |
diff --git a/task_plan.md b/task_plan.md
deleted file mode 100644
index e2dcc8d..0000000
--- a/task_plan.md
+++ /dev/null
@@ -1,357 +0,0 @@
-# Task Plan: CUDA Graph 优化 Offload Mode Decode
-
-## 目标
-
-为 nanovllm 的 CPU offload 模式添加 CUDA Graph 支持，加速 decode 阶段的计算。
-
-## 问题分析
-
-### Transformer 层的完整结构
-
-```
-Qwen3DecoderLayer.forward:
-├── input_layernorm (RMSNorm)           # ✅ 纯 GPU
-├── self_attn:
-│   ├── qkv_proj (Linear)               # ✅ 纯 GPU
-│   ├── q_norm, k_norm (RMSNorm)        # ✅ 纯 GPU
-│   ├── rotary_emb                      # ✅ 纯 GPU
-│   ├── attn._chunked_decode_attention: # ⚠️ 包含 CPU→GPU
-│   │   ├── H2D transfer                # ❌ 不能 graph
-│   │   ├── flash_attn_with_lse         # ✅ 可以 graph
-│   │   └── merge                       # ✅ 纯 GPU
-│   └── o_proj (Linear)                 # ✅ 纯 GPU
-├── post_attention_layernorm            # ✅ 纯 GPU
-└── mlp (FFN: gate, up, down)           # ✅ 纯 GPU
-```
-
-**核心问题**：H2D 传输被嵌在 attention 中间，打断了整层的 graph 捕获。
-
-### 可能的方案
-
-| 方案 | 描述 | 优点 | 缺点 |
-|------|------|------|------|
-| A. 分段 Graph | 将层拆分为 pre/post attention 两段 | 覆盖面广 | 改动大，需拆分层执行 |
-| B. 只 Graph Attention | 只优化 flash_attn_with_lse | 改动小 | 优化效果有限 |
-| C. 重构执行流程 | 完全重写 model forward | 最优效果 | 工作量巨大 |
-
-### 推荐：方案 A（分段 Graph）
-
-将每层拆分为两个 graph：
-1. **pre_attention_graph**: `norm → qkv_proj → q/k_norm → rotary`
-2. **post_attention_graph**: `o_proj → norm → FFN`
-
-中间的 `_chunked_decode_attention` 保持 eager（包含 H2D），但内部的 `flash_attn_with_lse` 使用 graph。
-
----
-
-## 当前状态分析
-
-### 现有 CUDA Graph 实现
-
-**文件**: `nanovllm/engine/model_runner.py`
-
-| 方法 | 行号 | 功能 |
-|------|------|------|
-| `capture_cudagraph()` | 682-717 | 为不同 batch size 捕获完整 model forward |
-| `run_model()` | 415-436 | 决定使用 eager 还是 graph replay |
-
-**关键逻辑** (`run_model`):
-```python
-use_eager = is_prefill or self.enforce_eager or input_ids.size(0) > 512 or context.is_chunked_prefill
-```
-
-**问题**: `run_chunked_offload_decode` 设置 `is_chunked_prefill=True`，导致**永远使用 eager mode**。
-
-### Offload Decode 流程
-
-**文件**: `nanovllm/kvcache/sparse/full_policy.py`
-
-`_decode_ring_buffer_pipeline()` (L304-379):
-```
-for block in cpu_blocks:
-    1. wait_slot_layer(slot)           # 等待 H2D 完成
-    2. k, v = get_kv_for_slot(slot)    # 获取 KV
-    3. o, lse = flash_attn_with_lse()  # ⭐ 纯 GPU 计算
-    4. record_slot_compute_done(slot)  # 标记计算完成
-    5. load_next_block()               # 启动下一个 H2D
-    6. merge_attention_outputs()       # ⭐ 纯 GPU 计算
-```
-
-**可 Graph 化的部分**:
-- `flash_attn_with_lse()` - 纯 GPU 计算
-- 不可 Graph 化: H2D 传输、动态 merge
-
-## 验证结果
-
-**测试文件**: `tests/test_chunk_attention_graph_reuse.py`
-
-| 测试 | 结果 |
-|------|------|
-| 2 个 Graph 复用于所有层和所有 chunk | ✅ PASSED |
-| copy_() 更新 static tensors | ✅ 有效 |
-| Eager merge | ✅ 用户已接受 |
-
-**结论**: 只需 2 个 graph（causal + non-causal），通过 copy_() 复用。
-
----
-
-## 修改计划（方案 A：分段 Graph）
-
-### 架构设计
-
-```
-每层执行流程（Offload Decode）:
-┌─────────────────────────────────────────────────────────────┐
-│  PRE-ATTENTION GRAPH (可复用于所有层)                        │
-│  input_layernorm → qkv_proj → q/k_norm → rotary → split Q  │
-└─────────────────────────────────────────────────────────────┘
-                              ↓
-┌─────────────────────────────────────────────────────────────┐
-│  CHUNKED ATTENTION (Eager + 部分 Graph)                     │
-│  for block in cpu_blocks:                                   │
-│      H2D transfer (eager)                                   │
-│      flash_attn_with_lse (GRAPH - 2个可复用)                │
-│      merge (eager)                                          │
-│  decode_buffer attention (eager)                            │
-└─────────────────────────────────────────────────────────────┘
-                              ↓
-┌─────────────────────────────────────────────────────────────┐
-│  POST-ATTENTION GRAPH (可复用于所有层)                       │
-│  o_proj → post_layernorm → gate_proj → up_proj → SiLU      │
-│  → down_proj → residual                                     │
-└─────────────────────────────────────────────────────────────┘
-```
-
-**总共需要的 Graph 数量**:
-- 1 个 pre_attention_graph（所有层复用）
-- 2 个 attention_graph（causal + non-causal，所有层复用）
-- 1 个 post_attention_graph（所有层复用）
-- **总计: 4 个 graph**
-
----
-
-### Phase 1: 拆分 DecoderLayer 执行
-
-**目标**: 将 `Qwen3DecoderLayer.forward` 拆分为可独立调用的三段
-
-**修改文件**: `nanovllm/models/qwen3.py`
-
-**新增方法**:
-```python
-class Qwen3DecoderLayer:
-    def forward_pre_attention(self, positions, hidden_states, residual):
-        """Pre-attention: norm → qkv → rotary → 返回 q, k, v"""
-        if residual is None:
-            hidden_states, residual = self.input_layernorm(hidden_states), hidden_states
-        else:
-            hidden_states, residual = self.input_layernorm(hidden_states, residual)
-
-        qkv = self.self_attn.qkv_proj(hidden_states)
-        q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
-        q = q.view(-1, self.num_heads, self.head_dim)
-        k = k.view(-1, self.num_kv_heads, self.head_dim)
-        v = v.view(-1, self.num_kv_heads, self.head_dim)
-        q = self.self_attn.q_norm(q)
-        k = self.self_attn.k_norm(k)
-        q, k = self.self_attn.rotary_emb(positions, q, k)
-        return q, k, v, hidden_states, residual
-
-    def forward_post_attention(self, attn_output, hidden_states, residual):
-        """Post-attention: o_proj → norm → FFN"""
-        output = self.self_attn.o_proj(attn_output.flatten(1, -1))
-        hidden_states, residual = self.post_attention_layernorm(output, residual)
-        hidden_states = self.mlp(hidden_states)
-        return hidden_states, residual
-```
-
-**状态**: `pending`
-
----
-
-### Phase 2: 捕获 Pre/Post Attention Graph
-
-**目标**: 捕获 pre_attention 和 post_attention 的 graph
-
-**修改文件**: `nanovllm/engine/model_runner.py`
-
-**新增方法**: `capture_offload_layer_graphs()`
-
-```python
-def capture_offload_layer_graphs(self):
-    """捕获 offload mode 的 layer graphs"""
-    # 获取任意一层作为模板（所有层结构相同）
-    layer = self.model.model.layers[0]
-
-    # Static tensors
-    static_hidden = torch.zeros(1, self.hidden_size, ...)
-    static_residual = torch.zeros(1, self.hidden_size, ...)
-    static_positions = torch.zeros(1, ...)
-
-    # Pre-attention graph
-    self.pre_attn_graph = torch.cuda.CUDAGraph()
-    with torch.cuda.graph(self.pre_attn_graph):
-        static_q, static_k, static_v, _, _ = layer.forward_pre_attention(
-            static_positions, static_hidden, static_residual
-        )
-
-    # Post-attention graph
-    self.post_attn_graph = torch.cuda.CUDAGraph()
-    with torch.cuda.graph(self.post_attn_graph):
-        _, _ = layer.forward_post_attention(
-            static_attn_output, static_hidden, static_residual
-        )
-```
-
-**状态**: `pending`
-
----
-
-### Phase 3: 捕获 Attention Graph
-
-**目标**: 捕获 2 个 attention graph（causal + non-causal）
-
-**修改文件**: `nanovllm/kvcache/offload_engine.py`
-
-```python
-class OffloadEngine:
-    def capture_attention_graphs(self):
-        """捕获 attention graphs（复用于所有层）"""
-        self.attn_graph_causal = self._capture_attn_graph(causal=True)
-        self.attn_graph_non_causal = self._capture_attn_graph(causal=False)
-
-    def _capture_attn_graph(self, causal: bool):
-        static_q = torch.zeros(1, 1, num_heads, head_dim, ...)
-        static_k = torch.zeros(1, block_size, num_kv_heads, head_dim, ...)
-        static_v = torch.zeros(1, block_size, num_kv_heads, head_dim, ...)
-
-        graph = torch.cuda.CUDAGraph()
-        with torch.cuda.graph(graph):
-            output, lse = flash_attn_with_lse(static_q, static_k, static_v,
-                                              self.scale, causal)
-        return AttentionGraph(graph, static_q, static_k, static_v, output, lse)
-```
-
-**状态**: `pending`
-
----
-
-### Phase 4: 修改 Offload Decode 执行流程
-
-**目标**: 使用 graph replay 执行 offload decode
-
-**修改文件**: `nanovllm/engine/model_runner.py`
-
-**修改方法**: `run_chunked_offload_decode()`
-
-```python
-def run_chunked_offload_decode_with_graph(self, seqs):
-    """使用 graph 加速的 offload decode"""
-    seq = seqs[0]
-
-    # 准备输入
-    input_ids = torch.tensor([seq.last_token], ...)
-    positions = torch.tensor([len(seq) - 1], ...)
-
-    # Embedding
-    hidden_states = self.model.model.embed_tokens(input_ids)
-    residual = None
-
-    for layer_id, layer in enumerate(self.model.model.layers):
-        # Phase 1: Pre-attention (GRAPH)
-        self.pre_attn_vars["hidden"].copy_(hidden_states)
-        self.pre_attn_vars["residual"].copy_(residual) if residual else None
-        self.pre_attn_vars["positions"].copy_(positions)
-        self.pre_attn_graph.replay()
-        q = self.pre_attn_vars["q"].clone()
-        k = self.pre_attn_vars["k"].clone()
-        v = self.pre_attn_vars["v"].clone()
-
-        # Phase 2: Chunked Attention (Eager + Graph)
-        attn_output = self._chunked_attention_with_graph(q, k, v, layer_id, ...)
-
-        # Phase 3: Post-attention (GRAPH)
-        self.post_attn_vars["attn_output"].copy_(attn_output)
-        self.post_attn_graph.replay()
-        hidden_states = self.post_attn_vars["hidden"].clone()
-        residual = self.post_attn_vars["residual"].clone()
-
-    # LM head
-    logits = self.model.compute_logits(hidden_states)
-    return logits
-```
-
-**状态**: `pending`
-
----
-
-### Phase 5: 修改 Ring Buffer Pipeline
-
-**目标**: 在 attention 内部使用 graph
-
-**修改文件**: `nanovllm/kvcache/sparse/full_policy.py`
-
-**修改**: `_decode_ring_buffer_pipeline()` 中的 `flash_attn_with_lse` 调用
-
-```python
-# 当前：eager
-prev_o, prev_lse = flash_attn_with_lse(q, k, v, scale, causal=False)
-
-# 修改为：graph replay
-graph = offload_engine.attn_graph_non_causal
-graph.static_q.copy_(q)
-graph.static_k.copy_(k)
-graph.static_v.copy_(v)
-graph.graph.replay()
-prev_o = graph.static_output.clone()
-prev_lse = graph.static_lse.clone()
-```
-
-**状态**: `pending`
-
----
-
-### Phase 6: 添加配置开关
-
-**修改文件**: `nanovllm/config.py`
-
-```python
-enable_offload_graph: bool = True  # 默认启用
-```
-
-**状态**: `pending`
-
----
-
-## 文件修改清单
-
-| 文件 | 修改类型 | 说明 |
-|------|----------|------|
-| `nanovllm/engine/model_runner.py` | 新增方法 | `capture_offload_attention_graph()` |
-| `nanovllm/kvcache/offload_engine.py` | 新增属性+方法 | Graph 存储和访问 |
-| `nanovllm/kvcache/sparse/full_policy.py` | 修改方法 | 使用 graph replay |
-| `nanovllm/config.py` | 新增配置 | `enable_offload_graph` |
-
----
-
-## 风险和注意事项
-
-1. **Graph 捕获时机**: 需要在 KV cache 分配后、第一次 decode 前捕获
-2. **Chunk size 匹配**: Graph 的 chunk_size 必须和 block_size 一致
-3. **多 GPU**: Graph 需要在每个 GPU 上分别捕获
-4. **内存**: 2 个 graph 的额外内存开销很小
-
----
-
-## 测试计划
-
-1. **单元测试**: 验证 graph replay 结果正确
-2. **集成测试**: 运行 `test_needle.py --enable-offload --input-len 32768`
-3. **性能测试**: 对比 eager vs graph 的 decode 延迟
-
----
-
-## 预期收益
-
-- Decode 阶段 attention 计算加速（减少 kernel launch overhead）
-- 与现有 ring buffer pipeline 兼容
-- 内存开销极小（只有 2 个额外 graph）