♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill

- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences) - Delete layer_k/v_buffer_a/b double buffers - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods - Remove pipeline state tracking variables - Simplify decode to use ring buffer pipeline only (more efficient for long sequences) - Rename compute_chunked_attention → compute_chunked_prefill for clarity - Add mandatory needle test requirements: --enable-offload --input-len 32768 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:10:40 +08:00
parent 6080bf7554
commit fa7601f4b8
9 changed files with 67 additions and 299 deletions
--- a/.claude/rules/gpu-testing.md
+++ b/.claude/rules/gpu-testing.md
@@ -77,6 +77,45 @@ Claude: Runs `python tests/test_needle.py ...`  # NO! Missing GPU specification!

 ---

+## Needle Test Requirements (MANDATORY)
+
+When running `test_needle.py`, **ALWAYS** use these settings:
+
+1. **Enable offload**: `--enable-offload` is **REQUIRED**
+2. **Use 32K context**: `--input-len 32768` is **REQUIRED**
+
+### Standard Needle Test Command
+
+```bash
+CUDA_VISIBLE_DEVICES=X PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
+    python tests/test_needle.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --input-len 32768
+```
+
+### Why These Settings?
+
+| Setting | Reason |
+|---------|--------|
+| `--enable-offload` | Tests the CPU offload pipeline which is the main feature being developed |
+| `--input-len 32768` | 32K context properly exercises the chunked prefill/decode paths; 8K is too short to catch many issues |
+
+### Do NOT Use
+
+```bash
+# ❌ Wrong: Missing offload
+python tests/test_needle.py --model ~/models/Llama-3.1-8B-Instruct
+
+# ❌ Wrong: Too short (default 8K)
+python tests/test_needle.py --model ~/models/Llama-3.1-8B-Instruct --enable-offload
+
+# ✅ Correct: Offload + 32K
+python tests/test_needle.py --model ~/models/Llama-3.1-8B-Instruct --enable-offload --input-len 32768
+```
+
+---
+
 ## Combined Checklist

 Before running any GPU test:
--- a/.claude/rules/sparse-policy.md
+++ b/.claude/rules/sparse-policy.md
@@ -21,7 +21,7 @@ class PrefillOnlyPolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = False

-    def compute_chunked_attention(self, ...):
+    def compute_chunked_prefill(self, ...):
        # 正常实现 prefill 逻辑
        ...

@@ -35,7 +35,7 @@ class DecodeOnlyPolicy(SparsePolicy):
    supports_prefill = False
    supports_decode = True

-    def compute_chunked_attention(self, ...):
+    def compute_chunked_prefill(self, ...):
        # 不支持 prefill，必须 assert False
        assert False, "DecodeOnlyPolicy does not support prefill phase"

@@ -53,7 +53,7 @@ class FullAttentionPolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = True

-    def compute_chunked_attention(self, ...):
+    def compute_chunked_prefill(self, ...):
        # 完整实现

    def compute_chunked_decode(self, ...):
@@ -85,14 +85,11 @@ if not sparse_policy.supports_decode:
 在 SparsePolicy 的 `compute_chunked_*` 方法中，所有 CPU-GPU 数据传输**必须**通过 `OffloadEngine` 进行，**禁止**直接使用 `torch.Tensor.copy_()` 或 `.to(device)`：

 ```python
-# ✅ 正确：使用 OffloadEngine 的方法
+# ✅ 正确：使用 OffloadEngine 的 ring buffer 方法
 offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id)
 offload_engine.wait_slot_layer(slot)
 k, v = offload_engine.get_kv_for_slot(slot)

-# ✅ 正确：使用 cross-layer pipeline
-k, v = offload_engine.get_decode_layer_kv(layer_id, num_blocks)
-
 # ❌ 错误：直接使用 torch 通信
 gpu_tensor.copy_(cpu_tensor)
 gpu_tensor = cpu_tensor.to("cuda")
@@ -102,6 +99,6 @@ gpu_tensor = cpu_tensor.cuda()
 ### 原因

 1. **流同步**：OffloadEngine 内部管理 CUDA streams，确保正确的同步
-2. **Pipeline 优化**：OffloadEngine 实现了 ring buffer 和 cross-layer pipeline
+2. **Pipeline 优化**：OffloadEngine 实现了 ring buffer pipeline
 3. **资源管理**：OffloadEngine 管理 GPU buffer slots，避免内存碎片
 4. **一致性**：统一的接口便于调试和维护