Files
nano-vllm/notes.md

6.2 KiB
Raw Blame History

Notes: Layerwise Offload Implementation

Code Analysis

Current Layerwise Offload Flow

Prefill (model_runner.py:462-573):

for layer_id in range(num_layers):
    q, k, v = compute_qkv(hidden_states)
    attn_out = flash_attn_varlen_func(q, k, v, causal=True)
    hidden_states = mlp(attn_out)
    _offload_layer_kv_to_cpu_sync(layer_id, k, v)  # BLOCKING!

Decode (model_runner.py:641-817):

for layer_id in range(num_layers):
    # Load all prefilled KV from CPU (SLOW!)
    for block_id in cpu_block_table:
        k_block = k_cache_cpu[layer_id, block_id].to("cuda")
        v_block = v_cache_cpu[layer_id, block_id].to("cuda")

    k_full = cat([k_prefill, k_decode_prev, k_new])
    attn_out = flash_attn(q, k_full, v_full, causal=False)

    # Store new KV to decode buffer
    decode_k_buffer[layer_id, pos].copy_(k_new)

# Block-full offload (lines 793-811)
if block_is_full:
    for layer_id in range(num_layers):
        k_cache_cpu[layer_id, block].copy_(decode_k_buffer[layer_id], non_blocking=True)
    torch.cuda.synchronize()  # BAD: global sync

OffloadEngine Existing Infrastructure

Streams (available for use):

  • compute_stream - dedicated compute stream (not default!)
  • prefill_offload_streams[layer_id] - per-layer D2H streams
  • slot_transfer_streams[slot_idx] - per-slot H2D streams
  • transfer_stream_main - main transfer stream
  • _pipeline_layer_stream - cross-layer pipeline stream

Events (available for use):

  • prefill_offload_events[layer_id] - per-layer offload completion
  • ring_slot_ready[slot] - H2D completion
  • ring_slot_offload_done[slot] - D2H completion
  • ring_slot_compute_done[slot] - compute completion
  • _pipeline_next_layer_event - pipeline next layer ready

Buffers (already allocated):

  • k_cache_cpu/v_cache_cpu - [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
  • k_cache_gpu/v_cache_gpu - [num_gpu_blocks, block_size, kv_heads, head_dim] (no layer dim!)
  • decode_k_buffer/v_buffer - [num_layers, block_size, kv_heads, head_dim]
  • prefill_k_buffer/v_buffer - [num_layers, block_size, kv_heads, head_dim]
  • layer_k_buffer_a/b, layer_v_buffer_a/b - [max_prefill_blocks, block_size, kv_heads, head_dim]

Useful Existing Methods

Async offload (currently unused in layerwise):

offload_prefill_buffer_async(layer_id, cpu_block_id, num_valid_tokens)
wait_all_prefill_offloads()
wait_prefill_offload(layer_id)

Cross-layer pipeline (for decode):

start_decode_pipeline(cpu_block_ids)
get_decode_layer_kv(layer_id, num_blocks) -> (k, v)
end_decode_pipeline()

Chunked Prefill Code to Remove

attention.py (lines to remove):

  • 172-312: _chunked_prefill_attention()
  • 314-346: _sync_load_previous_chunks()
  • 348-480: _ring_buffer_pipeline_load()
  • 482-591: _chunked_decode_attention()
  • 593-667: _decode_ring_buffer_pipeline()
  • 669-726: _decode_with_layer_pipeline()

context.py (fields to remove):

  • is_chunked_prefill
  • prev_kv_ranges
  • chunk_offset
  • chunked_seq
  • decode_pos_in_block
  • decode_start_pos_in_block
  • current_chunk_idx

Keep:

  • kvcache_manager - still needed for layerwise
  • sparse_prefill_policy - needed for MInference

Memory Layout

新设计: Ring-Buffered GPU KV Cache

设计原则:

  • 不追求极致peak memory优化保证流水线正确性
  • Ring buffer层数可从外部配置 (默认4层)
  • 流水线深度 = num_kv_buffers - 1
# 新: Ring-Buffered GPU Cache (layerwise offload专用)
# num_kv_buffers: 外部可配置默认4
layer_k_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
layer_v_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]

# 移除: 旧的chunked prefill ring buffer
# k_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim]  <- 删除
# v_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim]  <- 删除

为什么使用Ring Buffer?

Decode阶段的流水线需求 (以4个buffer为例):

Buffer 0: [Load L0] → [Compute L0] ──────────────────► [Load L4]
Buffer 1:           [Load L1] → [Compute L1] ────────────────────►
Buffer 2:                     [Load L2] → [Compute L2] ────────────►
Buffer 3:                               [Load L3] → [Compute L3] ──►

流水线深度 = 3可以预加载3层更好地隐藏H2D延迟。

内存开销 (Qwen3-4B, 128K tokens):

  • 单层KV: 128K × 8 × 128 × 2 bytes = 256 MB
  • 4层ring buffer: 4 × 256 MB = 1 GB
  • 对比28层全GPU: 28 × 256 MB = 7.2 GB
  • 节省: 7.2 GB - 1 GB = 6.2 GB

配置传递:

LLM(num_kv_buffers=4) → Config → OffloadEngine(num_kv_buffers=...)

CPU Cache (保持不变)

k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]

Pinned memory for fast DMA transfers.

Memory per Layer (Qwen3-4B)

  • kv_heads = 8
  • head_dim = 128
  • dtype = bfloat16 (2 bytes)
  • Per token KV: 8 * 128 * 2 * 2 = 4KB
  • 128K tokens: 512 MB per layer
  • 28 layers: 14 GB total on CPU

Stream Synchronization Pattern

Correct Pattern for Async Offload

# In offload stream
with torch.cuda.stream(offload_stream):
    offload_stream.wait_stream(compute_stream)  # Wait for compute to finish
    cpu_tensor.copy_(gpu_tensor, non_blocking=True)
    event.record(offload_stream)

# Before reusing gpu_tensor
compute_stream.wait_event(event)  # Wait for offload to complete

Correct Pattern for Async Load

# In load stream
with torch.cuda.stream(load_stream):
    gpu_buffer.copy_(cpu_tensor, non_blocking=True)
    event.record(load_stream)

# Before using gpu_buffer
compute_stream.wait_event(event)  # Wait for load to complete

Test Configuration

Needle test command:

PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
python tests/test_needle.py \
    --model ~/models/Qwen3-4B-Instruct-2507/ \
    --max-model-len 32768 \
    --input-len 8192 \
    --enable-offload \
    --block-size 1024 \
    --num-gpu-blocks 2

GPU mutex check before running:

nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader