Files
nano-vllm/notes.md

206 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Notes: Layerwise Offload Implementation
## Code Analysis
### Current Layerwise Offload Flow
**Prefill** (`model_runner.py:462-573`):
```
for layer_id in range(num_layers):
q, k, v = compute_qkv(hidden_states)
attn_out = flash_attn_varlen_func(q, k, v, causal=True)
hidden_states = mlp(attn_out)
_offload_layer_kv_to_cpu_sync(layer_id, k, v) # BLOCKING!
```
**Decode** (`model_runner.py:641-817`):
```
for layer_id in range(num_layers):
# Load all prefilled KV from CPU (SLOW!)
for block_id in cpu_block_table:
k_block = k_cache_cpu[layer_id, block_id].to("cuda")
v_block = v_cache_cpu[layer_id, block_id].to("cuda")
k_full = cat([k_prefill, k_decode_prev, k_new])
attn_out = flash_attn(q, k_full, v_full, causal=False)
# Store new KV to decode buffer
decode_k_buffer[layer_id, pos].copy_(k_new)
# Block-full offload (lines 793-811)
if block_is_full:
for layer_id in range(num_layers):
k_cache_cpu[layer_id, block].copy_(decode_k_buffer[layer_id], non_blocking=True)
torch.cuda.synchronize() # BAD: global sync
```
### OffloadEngine Existing Infrastructure
**Streams** (available for use):
- `compute_stream` - dedicated compute stream (not default!)
- `prefill_offload_streams[layer_id]` - per-layer D2H streams
- `slot_transfer_streams[slot_idx]` - per-slot H2D streams
- `transfer_stream_main` - main transfer stream
- `_pipeline_layer_stream` - cross-layer pipeline stream
**Events** (available for use):
- `prefill_offload_events[layer_id]` - per-layer offload completion
- `ring_slot_ready[slot]` - H2D completion
- `ring_slot_offload_done[slot]` - D2H completion
- `ring_slot_compute_done[slot]` - compute completion
- `_pipeline_next_layer_event` - pipeline next layer ready
**Buffers** (already allocated):
- `k_cache_cpu/v_cache_cpu` - [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
- `k_cache_gpu/v_cache_gpu` - [num_gpu_blocks, block_size, kv_heads, head_dim] (no layer dim!)
- `decode_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
- `prefill_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
- `layer_k_buffer_a/b, layer_v_buffer_a/b` - [max_prefill_blocks, block_size, kv_heads, head_dim]
### Useful Existing Methods
**Async offload** (currently unused in layerwise):
```python
offload_prefill_buffer_async(layer_id, cpu_block_id, num_valid_tokens)
wait_all_prefill_offloads()
wait_prefill_offload(layer_id)
```
**Cross-layer pipeline** (for decode):
```python
start_decode_pipeline(cpu_block_ids)
get_decode_layer_kv(layer_id, num_blocks) -> (k, v)
end_decode_pipeline()
```
### Chunked Prefill Code to Remove
**attention.py** (lines to remove):
- 172-312: `_chunked_prefill_attention()`
- 314-346: `_sync_load_previous_chunks()`
- 348-480: `_ring_buffer_pipeline_load()`
- 482-591: `_chunked_decode_attention()`
- 593-667: `_decode_ring_buffer_pipeline()`
- 669-726: `_decode_with_layer_pipeline()`
**context.py** (fields to remove):
- `is_chunked_prefill`
- `prev_kv_ranges`
- `chunk_offset`
- `chunked_seq`
- `decode_pos_in_block`
- `decode_start_pos_in_block`
- `current_chunk_idx`
**Keep**:
- `kvcache_manager` - still needed for layerwise
- `sparse_prefill_policy` - needed for MInference
---
## Memory Layout
### 新设计: Ring-Buffered GPU KV Cache
**设计原则**:
- 不追求极致peak memory优化保证流水线正确性
- Ring buffer层数可从外部配置 (默认4层)
- 流水线深度 = num_kv_buffers - 1
```
# 新: Ring-Buffered GPU Cache (layerwise offload专用)
# num_kv_buffers: 外部可配置默认4
layer_k_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
layer_v_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
# 移除: 旧的chunked prefill ring buffer
# k_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
# v_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
```
**为什么使用Ring Buffer?**
Decode阶段的流水线需求 (以4个buffer为例):
```
Buffer 0: [Load L0] → [Compute L0] ──────────────────► [Load L4]
Buffer 1: [Load L1] → [Compute L1] ────────────────────►
Buffer 2: [Load L2] → [Compute L2] ────────────►
Buffer 3: [Load L3] → [Compute L3] ──►
```
流水线深度 = 3可以预加载3层更好地隐藏H2D延迟。
**内存开销** (Qwen3-4B, 128K tokens):
- 单层KV: 128K × 8 × 128 × 2 bytes = 256 MB
- 4层ring buffer: 4 × 256 MB = 1 GB
- 对比28层全GPU: 28 × 256 MB = 7.2 GB
- **节省**: 7.2 GB - 1 GB = 6.2 GB
**配置传递**:
```
LLM(num_kv_buffers=4) → Config → OffloadEngine(num_kv_buffers=...)
```
### CPU Cache (保持不变)
```
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
```
Pinned memory for fast DMA transfers.
### Memory per Layer (Qwen3-4B)
- kv_heads = 8
- head_dim = 128
- dtype = bfloat16 (2 bytes)
- Per token KV: 8 * 128 * 2 * 2 = 4KB
- 128K tokens: 512 MB per layer
- 28 layers: 14 GB total on CPU
---
## Stream Synchronization Pattern
### Correct Pattern for Async Offload
```python
# In offload stream
with torch.cuda.stream(offload_stream):
offload_stream.wait_stream(compute_stream) # Wait for compute to finish
cpu_tensor.copy_(gpu_tensor, non_blocking=True)
event.record(offload_stream)
# Before reusing gpu_tensor
compute_stream.wait_event(event) # Wait for offload to complete
```
### Correct Pattern for Async Load
```python
# In load stream
with torch.cuda.stream(load_stream):
gpu_buffer.copy_(cpu_tensor, non_blocking=True)
event.record(load_stream)
# Before using gpu_buffer
compute_stream.wait_event(event) # Wait for load to complete
```
---
## Test Configuration
**Needle test command**:
```bash
PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
python tests/test_needle.py \
--model ~/models/Qwen3-4B-Instruct-2507/ \
--max-model-len 32768 \
--input-len 8192 \
--enable-offload \
--block-size 1024 \
--num-gpu-blocks 2
```
**GPU mutex check before running**:
```bash
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
```