206 lines
6.2 KiB
Markdown
206 lines
6.2 KiB
Markdown
# Notes: Layerwise Offload Implementation
|
||
|
||
## Code Analysis
|
||
|
||
### Current Layerwise Offload Flow
|
||
|
||
**Prefill** (`model_runner.py:462-573`):
|
||
```
|
||
for layer_id in range(num_layers):
|
||
q, k, v = compute_qkv(hidden_states)
|
||
attn_out = flash_attn_varlen_func(q, k, v, causal=True)
|
||
hidden_states = mlp(attn_out)
|
||
_offload_layer_kv_to_cpu_sync(layer_id, k, v) # BLOCKING!
|
||
```
|
||
|
||
**Decode** (`model_runner.py:641-817`):
|
||
```
|
||
for layer_id in range(num_layers):
|
||
# Load all prefilled KV from CPU (SLOW!)
|
||
for block_id in cpu_block_table:
|
||
k_block = k_cache_cpu[layer_id, block_id].to("cuda")
|
||
v_block = v_cache_cpu[layer_id, block_id].to("cuda")
|
||
|
||
k_full = cat([k_prefill, k_decode_prev, k_new])
|
||
attn_out = flash_attn(q, k_full, v_full, causal=False)
|
||
|
||
# Store new KV to decode buffer
|
||
decode_k_buffer[layer_id, pos].copy_(k_new)
|
||
|
||
# Block-full offload (lines 793-811)
|
||
if block_is_full:
|
||
for layer_id in range(num_layers):
|
||
k_cache_cpu[layer_id, block].copy_(decode_k_buffer[layer_id], non_blocking=True)
|
||
torch.cuda.synchronize() # BAD: global sync
|
||
```
|
||
|
||
### OffloadEngine Existing Infrastructure
|
||
|
||
**Streams** (available for use):
|
||
- `compute_stream` - dedicated compute stream (not default!)
|
||
- `prefill_offload_streams[layer_id]` - per-layer D2H streams
|
||
- `slot_transfer_streams[slot_idx]` - per-slot H2D streams
|
||
- `transfer_stream_main` - main transfer stream
|
||
- `_pipeline_layer_stream` - cross-layer pipeline stream
|
||
|
||
**Events** (available for use):
|
||
- `prefill_offload_events[layer_id]` - per-layer offload completion
|
||
- `ring_slot_ready[slot]` - H2D completion
|
||
- `ring_slot_offload_done[slot]` - D2H completion
|
||
- `ring_slot_compute_done[slot]` - compute completion
|
||
- `_pipeline_next_layer_event` - pipeline next layer ready
|
||
|
||
**Buffers** (already allocated):
|
||
- `k_cache_cpu/v_cache_cpu` - [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||
- `k_cache_gpu/v_cache_gpu` - [num_gpu_blocks, block_size, kv_heads, head_dim] (no layer dim!)
|
||
- `decode_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
|
||
- `prefill_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
|
||
- `layer_k_buffer_a/b, layer_v_buffer_a/b` - [max_prefill_blocks, block_size, kv_heads, head_dim]
|
||
|
||
### Useful Existing Methods
|
||
|
||
**Async offload** (currently unused in layerwise):
|
||
```python
|
||
offload_prefill_buffer_async(layer_id, cpu_block_id, num_valid_tokens)
|
||
wait_all_prefill_offloads()
|
||
wait_prefill_offload(layer_id)
|
||
```
|
||
|
||
**Cross-layer pipeline** (for decode):
|
||
```python
|
||
start_decode_pipeline(cpu_block_ids)
|
||
get_decode_layer_kv(layer_id, num_blocks) -> (k, v)
|
||
end_decode_pipeline()
|
||
```
|
||
|
||
### Chunked Prefill Code to Remove
|
||
|
||
**attention.py** (lines to remove):
|
||
- 172-312: `_chunked_prefill_attention()`
|
||
- 314-346: `_sync_load_previous_chunks()`
|
||
- 348-480: `_ring_buffer_pipeline_load()`
|
||
- 482-591: `_chunked_decode_attention()`
|
||
- 593-667: `_decode_ring_buffer_pipeline()`
|
||
- 669-726: `_decode_with_layer_pipeline()`
|
||
|
||
**context.py** (fields to remove):
|
||
- `is_chunked_prefill`
|
||
- `prev_kv_ranges`
|
||
- `chunk_offset`
|
||
- `chunked_seq`
|
||
- `decode_pos_in_block`
|
||
- `decode_start_pos_in_block`
|
||
- `current_chunk_idx`
|
||
|
||
**Keep**:
|
||
- `kvcache_manager` - still needed for layerwise
|
||
- `sparse_prefill_policy` - needed for MInference
|
||
|
||
---
|
||
|
||
## Memory Layout
|
||
|
||
### 新设计: Ring-Buffered GPU KV Cache
|
||
|
||
**设计原则**:
|
||
- 不追求极致peak memory优化,保证流水线正确性
|
||
- Ring buffer层数可从外部配置 (默认4层)
|
||
- 流水线深度 = num_kv_buffers - 1
|
||
|
||
```
|
||
# 新: Ring-Buffered GPU Cache (layerwise offload专用)
|
||
# num_kv_buffers: 外部可配置,默认4
|
||
layer_k_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||
layer_v_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||
|
||
# 移除: 旧的chunked prefill ring buffer
|
||
# k_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
|
||
# v_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
|
||
```
|
||
|
||
**为什么使用Ring Buffer?**
|
||
|
||
Decode阶段的流水线需求 (以4个buffer为例):
|
||
```
|
||
Buffer 0: [Load L0] → [Compute L0] ──────────────────► [Load L4]
|
||
Buffer 1: [Load L1] → [Compute L1] ────────────────────►
|
||
Buffer 2: [Load L2] → [Compute L2] ────────────►
|
||
Buffer 3: [Load L3] → [Compute L3] ──►
|
||
```
|
||
|
||
流水线深度 = 3,可以预加载3层,更好地隐藏H2D延迟。
|
||
|
||
**内存开销** (Qwen3-4B, 128K tokens):
|
||
- 单层KV: 128K × 8 × 128 × 2 bytes = 256 MB
|
||
- 4层ring buffer: 4 × 256 MB = 1 GB
|
||
- 对比28层全GPU: 28 × 256 MB = 7.2 GB
|
||
- **节省**: 7.2 GB - 1 GB = 6.2 GB
|
||
|
||
**配置传递**:
|
||
```
|
||
LLM(num_kv_buffers=4) → Config → OffloadEngine(num_kv_buffers=...)
|
||
```
|
||
|
||
### CPU Cache (保持不变)
|
||
```
|
||
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||
```
|
||
Pinned memory for fast DMA transfers.
|
||
|
||
### Memory per Layer (Qwen3-4B)
|
||
- kv_heads = 8
|
||
- head_dim = 128
|
||
- dtype = bfloat16 (2 bytes)
|
||
- Per token KV: 8 * 128 * 2 * 2 = 4KB
|
||
- 128K tokens: 512 MB per layer
|
||
- 28 layers: 14 GB total on CPU
|
||
|
||
---
|
||
|
||
## Stream Synchronization Pattern
|
||
|
||
### Correct Pattern for Async Offload
|
||
```python
|
||
# In offload stream
|
||
with torch.cuda.stream(offload_stream):
|
||
offload_stream.wait_stream(compute_stream) # Wait for compute to finish
|
||
cpu_tensor.copy_(gpu_tensor, non_blocking=True)
|
||
event.record(offload_stream)
|
||
|
||
# Before reusing gpu_tensor
|
||
compute_stream.wait_event(event) # Wait for offload to complete
|
||
```
|
||
|
||
### Correct Pattern for Async Load
|
||
```python
|
||
# In load stream
|
||
with torch.cuda.stream(load_stream):
|
||
gpu_buffer.copy_(cpu_tensor, non_blocking=True)
|
||
event.record(load_stream)
|
||
|
||
# Before using gpu_buffer
|
||
compute_stream.wait_event(event) # Wait for load to complete
|
||
```
|
||
|
||
---
|
||
|
||
## Test Configuration
|
||
|
||
**Needle test command**:
|
||
```bash
|
||
PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
|
||
python tests/test_needle.py \
|
||
--model ~/models/Qwen3-4B-Instruct-2507/ \
|
||
--max-model-len 32768 \
|
||
--input-len 8192 \
|
||
--enable-offload \
|
||
--block-size 1024 \
|
||
--num-gpu-blocks 2
|
||
```
|
||
|
||
**GPU mutex check before running**:
|
||
```bash
|
||
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
|
||
```
|