[WIP] Before debug plan.
This commit is contained in:
6
.gitignore
vendored
6
.gitignore
vendored
@@ -232,9 +232,9 @@ tests/data/
|
|||||||
.serena/
|
.serena/
|
||||||
|
|
||||||
# Planning-with-files temporary files
|
# Planning-with-files temporary files
|
||||||
task_plan.md
|
# task_plan.md
|
||||||
findings.md
|
# findings.md
|
||||||
progress.md
|
# progress.md
|
||||||
task_plan_*.md
|
task_plan_*.md
|
||||||
findings_*.md
|
findings_*.md
|
||||||
progress_*.md
|
progress_*.md
|
||||||
|
|||||||
133
findings.md
Normal file
133
findings.md
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
# Findings: nanovllm State Leakage Investigation
|
||||||
|
|
||||||
|
## Key Discovery 1: OffloadEngine.reset() 不清除 CPU Cache
|
||||||
|
**File**: `nanovllm/kvcache/offload_engine.py:247-274`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def reset(self) -> None:
|
||||||
|
# 清除 GPU ring buffer slots
|
||||||
|
self.k_cache_gpu.zero_()
|
||||||
|
self.v_cache_gpu.zero_()
|
||||||
|
# 清除 per-layer decode buffers
|
||||||
|
self.decode_k_buffer.zero_()
|
||||||
|
self.decode_v_buffer.zero_()
|
||||||
|
# 清除 per-layer prefill buffers
|
||||||
|
self.prefill_k_buffer.zero_()
|
||||||
|
self.prefill_v_buffer.zero_()
|
||||||
|
# 清除 pending async events
|
||||||
|
self.pending_events.clear()
|
||||||
|
|
||||||
|
# ⚠️ 注意:以下内容未被清除!
|
||||||
|
# - self.k_cache_cpu
|
||||||
|
# - self.v_cache_cpu
|
||||||
|
# - Ring buffer slot states
|
||||||
|
```
|
||||||
|
|
||||||
|
**Impact**: CPU cache 在请求之间保留,可能导致状态泄漏。
|
||||||
|
|
||||||
|
## Key Discovery 2: deallocate() 调用 reset()
|
||||||
|
**File**: `nanovllm/kvcache/hybrid_manager.py:206-237`
|
||||||
|
|
||||||
|
`HybridKVCacheManager.deallocate()` 方法:
|
||||||
|
1. 释放所有 logical blocks
|
||||||
|
2. 释放对应的 CPU blocks
|
||||||
|
3. **调用 `offload_engine.reset()`**
|
||||||
|
|
||||||
|
但这只在 sequence 完成被释放时发生。如果 deallocate 没有被正确调用,或者调用后 CPU cache 仍有残留数据,就会导致状态泄漏。
|
||||||
|
|
||||||
|
## Key Discovery 3: LLMEngine 没有显式重置 KV cache
|
||||||
|
**File**: `nanovllm/engine/llm_engine.py:84-142`
|
||||||
|
|
||||||
|
`LLMEngine.generate()` 方法:
|
||||||
|
- 调用 `Observer.complete_reset()` 重置性能观察器
|
||||||
|
- **没有调用任何 KV cache 重置方法**
|
||||||
|
|
||||||
|
这意味着如果前一个请求的状态没有被完全清理,会影响下一个请求。
|
||||||
|
|
||||||
|
## Key Discovery 4: 状态跟踪变量
|
||||||
|
**File**: `nanovllm/kvcache/hybrid_manager.py`
|
||||||
|
|
||||||
|
HybridKVCacheManager 维护多个状态跟踪变量:
|
||||||
|
- `prefilled_blocks: Set[int]` - 跟踪已 prefill 的 blocks
|
||||||
|
- `_decode_start_pos: Dict[int, int]` - 每个 sequence 的 decode 起始位置
|
||||||
|
- `_prefill_len: Dict[int, int]` - 每个 sequence 的 prefill 长度
|
||||||
|
|
||||||
|
这些变量在 `deallocate()` 时部分清理,但 `prefilled_blocks` 只是 `discard()` 单个 block。
|
||||||
|
|
||||||
|
## Hypothesis: Root Cause Chain
|
||||||
|
|
||||||
|
```
|
||||||
|
Request A 完成
|
||||||
|
↓
|
||||||
|
deallocate() 被调用
|
||||||
|
↓
|
||||||
|
offload_engine.reset() 被调用
|
||||||
|
↓
|
||||||
|
GPU buffers 清零 ✅
|
||||||
|
CPU cache 未清零 ❌ ← 问题点
|
||||||
|
↓
|
||||||
|
Request B 开始
|
||||||
|
↓
|
||||||
|
CPU cache 可能包含 Request A 的残留数据
|
||||||
|
↓
|
||||||
|
错误的 attention 计算
|
||||||
|
↓
|
||||||
|
错误的输出
|
||||||
|
```
|
||||||
|
|
||||||
|
## 验证策略:状态一致性对比
|
||||||
|
|
||||||
|
**核心思路**:对比 fresh-llm 模式和 batch 模式下,同一个 sample 开始时的状态是否一致。
|
||||||
|
|
||||||
|
### 需要检查的状态
|
||||||
|
|
||||||
|
| 组件 | 状态 | 检查方法 |
|
||||||
|
|------|------|----------|
|
||||||
|
| OffloadEngine | `k_cache_cpu`, `v_cache_cpu` | `.sum()` 或 `.abs().max()` |
|
||||||
|
| OffloadEngine | `k_cache_gpu`, `v_cache_gpu` | `.sum()` 或 `.abs().max()` |
|
||||||
|
| OffloadEngine | `decode_k/v_buffer` | `.sum()` |
|
||||||
|
| OffloadEngine | `prefill_k/v_buffer` | `.sum()` |
|
||||||
|
| HybridManager | `prefilled_blocks` | `len()` |
|
||||||
|
| HybridManager | `free_logical_ids` | `len()` |
|
||||||
|
| HybridManager | `free_cpu_blocks` | `len()` |
|
||||||
|
|
||||||
|
### 状态检查代码
|
||||||
|
|
||||||
|
```python
|
||||||
|
def dump_state(offload_engine, hybrid_manager, label=""):
|
||||||
|
"""Dump state for comparison."""
|
||||||
|
state = {
|
||||||
|
# OffloadEngine GPU state
|
||||||
|
"k_gpu_sum": offload_engine.k_cache_gpu.sum().item(),
|
||||||
|
"v_gpu_sum": offload_engine.v_cache_gpu.sum().item(),
|
||||||
|
# OffloadEngine CPU state
|
||||||
|
"k_cpu_sum": offload_engine.k_cache_cpu.sum().item(),
|
||||||
|
"v_cpu_sum": offload_engine.v_cache_cpu.sum().item(),
|
||||||
|
# Buffers
|
||||||
|
"decode_k_sum": offload_engine.decode_k_buffer.sum().item(),
|
||||||
|
"decode_v_sum": offload_engine.decode_v_buffer.sum().item(),
|
||||||
|
"prefill_k_sum": offload_engine.prefill_k_buffer.sum().item(),
|
||||||
|
"prefill_v_sum": offload_engine.prefill_v_buffer.sum().item(),
|
||||||
|
# HybridManager
|
||||||
|
"prefilled_blocks": len(hybrid_manager.prefilled_blocks),
|
||||||
|
"free_logical": len(hybrid_manager.free_logical_ids),
|
||||||
|
"free_cpu": len(hybrid_manager.free_cpu_blocks),
|
||||||
|
}
|
||||||
|
print(f"[STATE {label}] {state}")
|
||||||
|
return state
|
||||||
|
|
||||||
|
def compare_states(s1, s2):
|
||||||
|
"""Compare two states, return differences."""
|
||||||
|
diffs = {}
|
||||||
|
for k in s1:
|
||||||
|
if s1[k] != s2[k]:
|
||||||
|
diffs[k] = (s1[k], s2[k])
|
||||||
|
return diffs
|
||||||
|
```
|
||||||
|
|
||||||
|
### 验证步骤
|
||||||
|
|
||||||
|
1. **fresh-llm 模式**:记录 sample N 开始时的状态 (S_fresh)
|
||||||
|
2. **batch 模式**:记录 sample N 开始时的状态 (S_batch)
|
||||||
|
3. **对比**:`compare_states(S_fresh, S_batch)`
|
||||||
|
4. **结论**:差异项即为泄漏源
|
||||||
48
progress.md
Normal file
48
progress.md
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
# Progress Log: nanovllm State Leakage Debug
|
||||||
|
|
||||||
|
## Session: 2026-01-20
|
||||||
|
|
||||||
|
### Entry 1: Initial Analysis Complete
|
||||||
|
**Time**: 开始
|
||||||
|
|
||||||
|
**Completed**:
|
||||||
|
- [x] 读取 `docs/ruler_32k_chunked_offload_issue.md` 理解问题描述
|
||||||
|
- [x] 读取 `nanovllm/kvcache/offload_engine.py` 分析 reset() 实现
|
||||||
|
- [x] 读取 `nanovllm/kvcache/hybrid_manager.py` 分析 deallocate() 实现
|
||||||
|
- [x] 读取 `nanovllm/engine/llm_engine.py` 分析请求处理流程
|
||||||
|
- [x] 创建 planning files (task_plan.md, findings.md, progress.md)
|
||||||
|
|
||||||
|
**Key Finding**:
|
||||||
|
`OffloadEngine.reset()` 清除了 GPU buffers 但**没有清除 CPU cache**。这是最可能的状态泄漏源头。
|
||||||
|
|
||||||
|
**Next Steps**:
|
||||||
|
1. 验证 CPU cache 假设 - 添加 CPU cache 清零到 reset()
|
||||||
|
2. 运行对比测试确认修复效果
|
||||||
|
3. 检查其他可能的状态泄漏点
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Entry 2: (待填写)
|
||||||
|
**Time**:
|
||||||
|
|
||||||
|
**Completed**:
|
||||||
|
|
||||||
|
**Issues**:
|
||||||
|
|
||||||
|
**Next Steps**:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test Results Summary
|
||||||
|
| Test | Before Fix | After Fix | Notes |
|
||||||
|
|------|------------|-----------|-------|
|
||||||
|
| niah_single_1 (fresh-llm) | 100% | - | Baseline |
|
||||||
|
| niah_single_1 (batch) | ~80% | - | State leakage |
|
||||||
|
| multikey_1 | ~94% | - | |
|
||||||
|
| multikey_2 | ~94% | - | |
|
||||||
|
| multikey_3 | ~56% | - | |
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
| File | Change | Status |
|
||||||
|
|------|--------|--------|
|
||||||
|
| (待记录) | | |
|
||||||
218
task_plan.md
Normal file
218
task_plan.md
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
# nanovllm 状态泄漏调试计划
|
||||||
|
|
||||||
|
**Task**: 修复连续请求之间的状态泄漏,使 RULER 32K 测试准确率从 ~80% 提升到 100%
|
||||||
|
**Created**: 2026-01-20
|
||||||
|
**Updated**: 2026-01-21
|
||||||
|
**Status**: `in_progress`
|
||||||
|
**Reference**: [docs/ruler_32k_chunked_offload_issue.md](docs/ruler_32k_chunked_offload_issue.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause Summary
|
||||||
|
|
||||||
|
**已确认问题**: 连续请求之间的状态泄漏
|
||||||
|
- **证据**: 单样本测试(每次重新初始化 LLM)准确率 100%,批量测试准确率 ~80%
|
||||||
|
- **差异**: 20% 的错误率来自状态泄漏
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 调试规范 (MUST FOLLOW)
|
||||||
|
|
||||||
|
### 1. 并行工作流(多 GPU + 异步 Task + 标志文件)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ GPU 0: 异步 Task - 当前版本测试 (v1) │
|
||||||
|
│ → 结果: /tmp/nanovllm_test_v1.json │
|
||||||
|
│ → 标志: /tmp/nanovllm_test_v1.done │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ GPU 1: 主 Agent - 调试/修复代码 │
|
||||||
|
│ → 状态对比验证 │
|
||||||
|
│ → 修改代码 │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ GPU 2: 异步 Task - 新版本测试 (v2, 修复后) │
|
||||||
|
│ → 结果: /tmp/nanovllm_test_v2.json │
|
||||||
|
│ → 标志: /tmp/nanovllm_test_v2.done │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ 主 Agent: 检查标志文件,读取结果,决定下一步 │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**标志文件约定**:
|
||||||
|
- 结果文件: `/tmp/nanovllm_test_<version>.json`
|
||||||
|
- 完成标志: `/tmp/nanovllm_test_<version>.done`
|
||||||
|
|
||||||
|
### 2. 验证测试命令
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 批量测试 niah_single_1 (100 samples) - 作为验证手段
|
||||||
|
CUDA_VISIBLE_DEVICES=X PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||||
|
python tests/test_ruler.py --task niah_single_1 --enable-offload --json-output /tmp/result.json
|
||||||
|
|
||||||
|
# 完成后写标志文件
|
||||||
|
echo "done" > /tmp/nanovllm_test_done.flag
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Phase 完成后报告
|
||||||
|
|
||||||
|
每个 Phase 完成后:
|
||||||
|
1. **更新 `progress.md`** - 记录测试结果和发现
|
||||||
|
2. **向用户报告** - 总结本 phase 的结果
|
||||||
|
3. **等待用户确认** - 不要自动开始下一个 phase
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 问题优先级(已更新)
|
||||||
|
|
||||||
|
| 优先级 | 问题 | 文件位置 | 状态 |
|
||||||
|
|--------|------|----------|------|
|
||||||
|
| **P0** | CPU cache 未清除 | `offload_engine.py:reset()` | 需要验证 |
|
||||||
|
| **P0** | Ring buffer slot 状态 | `offload_engine.py` | 需要验证 |
|
||||||
|
| **P1** | Sparse policy 状态 | `sparse/policy.py` | 待检查 |
|
||||||
|
| **P2** | HybridManager 跟踪变量 | `hybrid_manager.py` | 待检查 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 0: 状态分析
|
||||||
|
|
||||||
|
**Status**: `completed`
|
||||||
|
**Objective**: 分析代码中的状态管理逻辑
|
||||||
|
|
||||||
|
### 发现
|
||||||
|
|
||||||
|
#### OffloadEngine.reset() 分析
|
||||||
|
**文件**: `nanovllm/kvcache/offload_engine.py:247-274`
|
||||||
|
|
||||||
|
| 组件 | reset() 是否清除 |
|
||||||
|
|------|-----------------|
|
||||||
|
| GPU ring buffer (k/v_cache_gpu) | Yes |
|
||||||
|
| Decode buffers (decode_k/v_buffer) | Yes |
|
||||||
|
| Prefill buffers (prefill_k/v_buffer) | Yes |
|
||||||
|
| Pending events | Yes |
|
||||||
|
| **CPU cache (k/v_cache_cpu)** | **No** |
|
||||||
|
| Ring buffer slot 状态 | 需要验证 |
|
||||||
|
|
||||||
|
#### HybridKVCacheManager.deallocate() 分析
|
||||||
|
**文件**: `nanovllm/kvcache/hybrid_manager.py:206-237`
|
||||||
|
|
||||||
|
- 释放 logical blocks
|
||||||
|
- 释放 CPU blocks
|
||||||
|
- 调用 `offload_engine.reset()`
|
||||||
|
- 只在 sequence 完成时调用
|
||||||
|
|
||||||
|
#### LLMEngine.generate() 分析
|
||||||
|
**文件**: `nanovllm/engine/llm_engine.py:84-142`
|
||||||
|
|
||||||
|
- 调用 `Observer.complete_reset()` 重置性能观察器
|
||||||
|
- **没有显式调用 KV cache 重置**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: 状态一致性验证
|
||||||
|
|
||||||
|
**Status**: `in_progress`
|
||||||
|
**Objective**: 对比 fresh-llm 模式和 batch 模式下的初始状态,找出差异
|
||||||
|
|
||||||
|
### 验证思路
|
||||||
|
|
||||||
|
```
|
||||||
|
fresh-llm 模式: 每个 request 新建 LLM → 状态必定干净 → 100% 准确
|
||||||
|
batch 模式: 复用 LLM 实例 → 状态可能残留 → ~80% 准确
|
||||||
|
|
||||||
|
差异 = 泄漏源
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tasks
|
||||||
|
|
||||||
|
- [ ] 1.1 添加状态 dump 函数到代码中
|
||||||
|
- [ ] 1.2 运行 fresh-llm 模式,记录某个 sample (如 #40) 开始时的状态
|
||||||
|
- [ ] 1.3 运行 batch 模式,记录同一个 sample 开始时的状态
|
||||||
|
- [ ] 1.4 对比两个状态,找出差异项
|
||||||
|
|
||||||
|
### 需要检查的状态
|
||||||
|
|
||||||
|
| 组件 | 状态 | fresh-llm | batch | 差异? |
|
||||||
|
|------|------|-----------|-------|-------|
|
||||||
|
| OffloadEngine | k_cache_cpu.sum() | - | - | - |
|
||||||
|
| OffloadEngine | v_cache_cpu.sum() | - | - | - |
|
||||||
|
| OffloadEngine | k_cache_gpu.sum() | - | - | - |
|
||||||
|
| OffloadEngine | v_cache_gpu.sum() | - | - | - |
|
||||||
|
| OffloadEngine | decode_k_buffer.sum() | - | - | - |
|
||||||
|
| OffloadEngine | prefill_k_buffer.sum() | - | - | - |
|
||||||
|
| HybridManager | len(prefilled_blocks) | - | - | - |
|
||||||
|
| HybridManager | len(free_logical_ids) | - | - | - |
|
||||||
|
|
||||||
|
### 预期结果
|
||||||
|
|
||||||
|
找到具体哪些状态在 batch 模式下不为零(或不同),这些就是泄漏源。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: 修复泄漏源
|
||||||
|
|
||||||
|
**Status**: `pending`
|
||||||
|
**Objective**: 根据 Phase 1 的发现,修复具体的泄漏点
|
||||||
|
|
||||||
|
### Tasks
|
||||||
|
|
||||||
|
- [ ] 2.1 根据 Phase 1 确定的差异项,添加对应的清除逻辑
|
||||||
|
- [ ] 2.2 运行验证测试
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: 验证修复效果
|
||||||
|
|
||||||
|
**Status**: `pending`
|
||||||
|
**Objective**: 确认修复后准确率达到 100%
|
||||||
|
|
||||||
|
### Tasks
|
||||||
|
|
||||||
|
- [ ] 3.1 运行 batch 模式测试 (niah_single_1)
|
||||||
|
- [ ] 3.2 对比修复前后准确率
|
||||||
|
- [ ] 3.3 运行其他 task (multikey) 验证
|
||||||
|
|
||||||
|
### Target
|
||||||
|
|
||||||
|
| Task | 修复前 | 修复后目标 |
|
||||||
|
|------|--------|-----------|
|
||||||
|
| niah_single_1 (batch) | ~80% | 100% |
|
||||||
|
| niah_single_1 (fresh-llm) | 100% | 100% (baseline) |
|
||||||
|
| multikey_1 | ~94% | 100% |
|
||||||
|
| multikey_2 | ~94% | 100% |
|
||||||
|
| multikey_3 | ~56% | >90% |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Errors Encountered
|
||||||
|
|
||||||
|
| Error | Phase | Attempt | Resolution |
|
||||||
|
|-------|-------|---------|------------|
|
||||||
|
| (待记录) | - | - | - |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision Log
|
||||||
|
|
||||||
|
| Decision | Reason | Phase |
|
||||||
|
|----------|--------|-------|
|
||||||
|
| 使用状态一致性对比验证 | 直接对比差异,不需要逐个猜测泄漏源 | 1 |
|
||||||
|
| 使用 fresh-llm 作为 baseline | 确认单样本测试 100% 通过 | 0 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files to Modify
|
||||||
|
|
||||||
|
| File | Modification | Phase |
|
||||||
|
|------|--------------|-------|
|
||||||
|
| `nanovllm/kvcache/offload_engine.py` | 在 reset() 添加 CPU cache 清零 | 1 |
|
||||||
|
| `nanovllm/kvcache/offload_engine.py` | 添加 slot 状态重置 | 2 |
|
||||||
|
| `nanovllm/kvcache/sparse/policy.py` | 添加 reset() 如需要 | 3 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [docs/ruler_32k_chunked_offload_issue.md](docs/ruler_32k_chunked_offload_issue.md) - 问题背景
|
||||||
|
- [docs/architecture_guide.md](docs/architecture_guide.md) - 架构参考
|
||||||
|
- [findings.md](findings.md) - 代码分析发现
|
||||||
|
- [progress.md](progress.md) - 进度日志
|
||||||
Reference in New Issue
Block a user