Files

Zijie Tian 69b779e252 📝 docs: add layer offload planning notes and task plan

Add planning documents for layer-wise offload implementation:
- notes.md: Implementation notes and findings
- task_plan.md: Detailed task breakdown and progress tracking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-22 06:04:36 +08:00

4.8 KiB

Raw Blame History

Notes: SparsePolicy Refactoring Research

Sources

Source 1: tzj/minference branch - policy.py

路径: nanovllm/kvcache/sparse/policy.py
关键设计:
- PolicyContext 数据类包含 query_chunk_idx, num_query_chunks, layer_id, query, is_prefill 等
- select_blocks() 需要 offload_engine 参数
- compute_chunked_prefill() 和 compute_chunked_decode() 是完整的 attention 流程
- on_prefill_offload() / on_decode_offload() hooks 用于收集元数据

Source 2: tzj/minference branch - full_policy.py

路径: nanovllm/kvcache/sparse/full_policy.py
关键实现:
- compute_chunked_prefill() 内部使用 ring buffer pipeline 加载 blocks
- 使用 flash_attn_with_lse 和 merge_attention_outputs 合并多个 chunk 的 attention
- compute_chunked_decode() 处理 prefilled blocks + decode buffer

Source 3: tzj/layer-offload branch - model_runner.py

路径: nanovllm/engine/model_runner.py
关键设计:
- run_layerwise_offload_prefill() 逐层处理，每层计算完整 attention
- sparse_prefill_policy.sparse_prefill_attention(q, k, v, layer_id) 简单接口
- FULL policy 通过 if sparse_prefill_policy is None 走 else 分支

Source 4: tzj/layer-offload branch - xattn.py

路径: nanovllm/kvcache/sparse/xattn.py
关键实现:
- sparse_prefill_attention() 直接使用 FlashAttention（因为 chunked prefill 架构限制）
- 保留 Triton kernels 供未来 GPU-only 模式

Synthesized Findings

架构差异总结

方面	Chunked Offload	Layerwise Offload
Prefill 流程	chunk-by-chunk，跨层	layer-by-layer，完整序列
KV 存储	每 chunk 立即 offload	每层计算后 offload
Attention 计算	分多次计算+合并	一次完整计算
Block 加载	需要从 CPU 加载历史	不需要，已在 GPU
Policy 责任	完整 attention 流程	仅 attention kernel 选择

Layerwise Offload 的简化点

不需要 block selection: 整层 KV 都在 GPU，无需选择
不需要 offload_engine 参数: Policy 不负责加载 KV
不需要 merge_attention_outputs: 一次计算完整 attention
不需要 offload hooks: offload 在 model_runner 统一处理

设计建议

保持接口简单: 只需要 compute_prefill_attention() 和 compute_decode_attention()
FULL 也实现方法: 不再通过 is None 判断，所有 policy 统一调用
移除不必要的参数: 不需要 offload_engine, kvcache_manager, seq 等
统一命名: 使用 compute_*_attention 而不是 sparse_prefill_attention

Code Examples

当前调用方式 (model_runner.py:876-891)

# Sparse or Full attention
if self.sparse_prefill_policy is not None:
    # MInference or other sparse prefill policy
    attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
        q, k, v, layer_id
    )
else:
    # Full attention using FlashAttention
    attn_output = flash_attn_varlen_func(
        q, k, v, ...
    )

建议的新调用方式

# 所有 policy 统一调用
attn_output = self.attention_policy.compute_prefill_attention(
    q, k, v, layer_id, softmax_scale=layer.self_attn.attn.scale
)

Questions Resolved

Q: 是否需要 PolicyContext?
A: 可以简化，因为 layerwise 模式下不需要 chunk 信息
Q: decode 阶段如何处理?
A: Decode 不需要 policy！当前 run_layerwise_offload_decode() 使用标准 layer(positions, hidden_states, residual) 调用，走 Attention.forward() 路径
Q: 为什么 decode 不需要 sparse?
A: 因为 decode 每次只有 1 个 token，没有稀疏化的意义。KV 从 ring buffer 加载后直接用 flash_attn_with_kvcache

Key Insight

Layerwise Offload 的 Policy 设计应该只关注 Prefill：

Prefill: 需要 Policy
- 整个序列一次计算 attention
- 可以使用 sparse attention 方法（如 MInference 的 vertical+slash pattern）
- Policy 接收 q, k, v, layer_id, softmax_scale

Decode: 不需要 Policy
- 每次只有 1 个 token query
- KV 从 ring buffer 加载
- 使用标准 flash_attn_with_kvcache

Interface Comparison Summary

方面	tzj/minference	tzj/layer-offload (新设计)
类名	SparsePolicy	AttentionPolicy
Prefill 方法	compute_chunked_prefill()	compute_attention()
Decode 方法	compute_chunked_decode()	不需要（用标准路径）
需要 offload_engine	是	否
需要 kvcache_manager	是	否
需要 seq	是	否
支持 FULL	是	是

Migration Path

保留 SparsePolicy 作为 AttentionPolicy 的别名
保留 PolicyContext 供未来扩展
保留 select_blocks() 方法签名（虽然不使用）
移除 requires_block_selection 属性（不需要）

4.8 KiB Raw Blame History Unescape Escape

Notes: SparsePolicy Refactoring Research

Sources

Source 1: tzj/minference branch - policy.py

Source 2: tzj/minference branch - full_policy.py

Source 3: tzj/layer-offload branch - model_runner.py

Source 4: tzj/layer-offload branch - xattn.py

Synthesized Findings

架构差异总结

Layerwise Offload 的简化点

设计建议

Code Examples

当前调用方式 (model_runner.py:876-891)

建议的新调用方式

Questions Resolved

Key Insight

Interface Comparison Summary

Migration Path

4.8 KiB

Raw Blame History