[WIP] Before refactor.
This commit is contained in:
9
.claude/ralph-loop.local.md
Normal file
9
.claude/ralph-loop.local.md
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
---
|
||||||
|
active: true
|
||||||
|
iteration: 1
|
||||||
|
max_iterations: 0
|
||||||
|
completion_promise: "COMPLETE"
|
||||||
|
started_at: "2026-01-19T17:25:00Z"
|
||||||
|
---
|
||||||
|
|
||||||
|
请你按照 task_plan.md的要求,进行 nanovllm 的代码重构,确保plan 中最终目标可以圆满实现,注意你仅仅只能使用 GPU 0 来进行调试,其他 GPU 一定不能使用。最终将测试结果写一个报告。 <promise>COMPLETE</promise> -max-iterations 30
|
||||||
107
.claude/rules/sparse-policy.md
Normal file
107
.claude/rules/sparse-policy.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# Sparse Policy 代码规范
|
||||||
|
|
||||||
|
## supports_prefill / supports_decode 标志
|
||||||
|
|
||||||
|
每个 SparsePolicy 子类必须正确设置这两个标志:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MyPolicy(SparsePolicy):
|
||||||
|
supports_prefill = True # 是否支持 prefill 阶段
|
||||||
|
supports_decode = False # 是否支持 decode 阶段
|
||||||
|
```
|
||||||
|
|
||||||
|
## 方法实现规范
|
||||||
|
|
||||||
|
### 规则:不支持的阶段必须 assert False
|
||||||
|
|
||||||
|
如果 policy 不支持某个阶段,对应的 `compute_chunked_*` 方法内部**必须** `assert False`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class PrefillOnlyPolicy(SparsePolicy):
|
||||||
|
supports_prefill = True
|
||||||
|
supports_decode = False
|
||||||
|
|
||||||
|
def compute_chunked_attention(self, ...):
|
||||||
|
# 正常实现 prefill 逻辑
|
||||||
|
...
|
||||||
|
|
||||||
|
def compute_chunked_decode(self, ...):
|
||||||
|
# 不支持 decode,必须 assert False
|
||||||
|
assert False, "PrefillOnlyPolicy does not support decode phase"
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
class DecodeOnlyPolicy(SparsePolicy):
|
||||||
|
supports_prefill = False
|
||||||
|
supports_decode = True
|
||||||
|
|
||||||
|
def compute_chunked_attention(self, ...):
|
||||||
|
# 不支持 prefill,必须 assert False
|
||||||
|
assert False, "DecodeOnlyPolicy does not support prefill phase"
|
||||||
|
|
||||||
|
def compute_chunked_decode(self, ...):
|
||||||
|
# 正常实现 decode 逻辑
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### 规则:FullPolicy 必须同时支持两个阶段
|
||||||
|
|
||||||
|
`FullAttentionPolicy` 作为默认策略,必须同时支持 prefill 和 decode:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class FullAttentionPolicy(SparsePolicy):
|
||||||
|
supports_prefill = True
|
||||||
|
supports_decode = True
|
||||||
|
|
||||||
|
def compute_chunked_attention(self, ...):
|
||||||
|
# 完整实现
|
||||||
|
|
||||||
|
def compute_chunked_decode(self, ...):
|
||||||
|
# 完整实现
|
||||||
|
```
|
||||||
|
|
||||||
|
## 调用方检查
|
||||||
|
|
||||||
|
`attention.py` 中应在调用前检查 policy 是否支持当前阶段:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Prefill 路径
|
||||||
|
if not sparse_policy.supports_prefill:
|
||||||
|
raise RuntimeError(f"{sparse_policy} does not support prefill")
|
||||||
|
|
||||||
|
# Decode 路径
|
||||||
|
if not sparse_policy.supports_decode:
|
||||||
|
raise RuntimeError(f"{sparse_policy} does not support decode")
|
||||||
|
```
|
||||||
|
|
||||||
|
这样提供双重保护:
|
||||||
|
1. 调用方检查 → 提供清晰的错误信息
|
||||||
|
2. 方法内 assert → 防止绕过检查的调用
|
||||||
|
|
||||||
|
## CPU-GPU 通信规范
|
||||||
|
|
||||||
|
### 规则:所有通信必须通过 OffloadEngine
|
||||||
|
|
||||||
|
在 SparsePolicy 的 `compute_chunked_*` 方法中,所有 CPU-GPU 数据传输**必须**通过 `OffloadEngine` 进行,**禁止**直接使用 `torch.Tensor.copy_()` 或 `.to(device)`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# ✅ 正确:使用 OffloadEngine 的方法
|
||||||
|
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id)
|
||||||
|
offload_engine.wait_slot_layer(slot)
|
||||||
|
k, v = offload_engine.get_kv_for_slot(slot)
|
||||||
|
|
||||||
|
# ✅ 正确:使用 cross-layer pipeline
|
||||||
|
k, v = offload_engine.get_decode_layer_kv(layer_id, num_blocks)
|
||||||
|
|
||||||
|
# ❌ 错误:直接使用 torch 通信
|
||||||
|
gpu_tensor.copy_(cpu_tensor)
|
||||||
|
gpu_tensor = cpu_tensor.to("cuda")
|
||||||
|
gpu_tensor = cpu_tensor.cuda()
|
||||||
|
```
|
||||||
|
|
||||||
|
### 原因
|
||||||
|
|
||||||
|
1. **流同步**:OffloadEngine 内部管理 CUDA streams,确保正确的同步
|
||||||
|
2. **Pipeline 优化**:OffloadEngine 实现了 ring buffer 和 cross-layer pipeline
|
||||||
|
3. **资源管理**:OffloadEngine 管理 GPU buffer slots,避免内存碎片
|
||||||
|
4. **一致性**:统一的接口便于调试和维护
|
||||||
Reference in New Issue
Block a user