[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST
This commit is contained in:
@@ -63,6 +63,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
|
||||
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
||||
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
||||
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
|
||||
|
||||
## Configuration
|
||||
|
||||
|
||||
194
docs/gpu_only_performance_issue.md
Normal file
194
docs/gpu_only_performance_issue.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# GPU-only Performance Issue: PagedAttention Scatter Overhead
|
||||
|
||||
## Problem Summary
|
||||
|
||||
GPU-only mode with MInference is **slower** than CPU offload mode for long-context single-sequence inference:
|
||||
|
||||
| Mode | Prefill Speed (32K tokens, Qwen3-4B) |
|
||||
|------|--------------------------------------|
|
||||
| GPU-only + MInference | 3383 tok/s |
|
||||
| Offload + MInference | 5373 tok/s |
|
||||
|
||||
This counterintuitive result is caused by **unnecessary `store_kvcache` overhead** in the GPU-only path.
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### GPU-only Execution Path
|
||||
|
||||
```python
|
||||
# attention.py line 86-110
|
||||
def forward(self, q, k, v):
|
||||
# ALWAYS store to cache first - OVERHEAD HERE
|
||||
if k_cache.numel() and v_cache.numel():
|
||||
store_kvcache(k, v, k_cache, v_cache, context.slot_mapping) # ← Always executed
|
||||
|
||||
if context.is_prefill:
|
||||
if context.sparse_prefill_policy is not None:
|
||||
# MInference: uses k, v directly, NOT k_cache!
|
||||
o = sparse_prefill_attention(q, k, v, layer_id)
|
||||
else:
|
||||
# Full attention: also uses k, v directly
|
||||
o = flash_attn_varlen_func(q, k, v, ...)
|
||||
```
|
||||
|
||||
**Key observation**: Prefill attention **never reads from cache** - it uses the computed k, v directly. But `store_kvcache` is always called before attention.
|
||||
|
||||
### The `store_kvcache` Overhead
|
||||
|
||||
```python
|
||||
# attention.py line 8-59
|
||||
def store_kvcache(key, value, k_cache, v_cache, slot_mapping):
|
||||
# 1. Filter invalid slots (conditional logic)
|
||||
valid_mask = slot_mapping >= 0
|
||||
valid_slots = slot_mapping[valid_mask]
|
||||
valid_keys = key[valid_mask]
|
||||
|
||||
# 2. Reshape for scatter operation
|
||||
k_cache_flat = k_cache.view(total_slots, D)
|
||||
valid_keys_flat = valid_keys.reshape(-1, D)
|
||||
|
||||
# 3. Scatter write via index_copy_ - EXPENSIVE!
|
||||
k_cache_flat.index_copy_(0, valid_slots.long(), valid_keys_flat)
|
||||
v_cache_flat.index_copy_(0, valid_slots.long(), valid_values_flat)
|
||||
```
|
||||
|
||||
This scatter operation is called for **every layer** (28 layers for Qwen3-4B), writing **all tokens** (32K) to GPU cache.
|
||||
|
||||
### Offload Path (No Such Overhead)
|
||||
|
||||
```python
|
||||
# model_runner.py - run_layerwise_offload_prefill
|
||||
for layer_id in range(num_layers):
|
||||
# QKV projection + RoPE
|
||||
q, k = layer.self_attn.rotary_emb(positions, q, k)
|
||||
|
||||
# Sparse attention - directly uses k, v
|
||||
attn_output = sparse_prefill_attention(q, k, v, layer_id)
|
||||
|
||||
# Contiguous copy to CPU - no scatter!
|
||||
offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens)
|
||||
```
|
||||
|
||||
## Memory Layout Comparison
|
||||
|
||||
| Aspect | GPU-only (PagedAttention) | Offload (Contiguous) |
|
||||
|--------|---------------------------|----------------------|
|
||||
| **Layout** | `[num_blocks, block_size, heads, dim]` | `[seq_len, heads, dim]` |
|
||||
| **Write pattern** | Scatter via `index_copy_` | Contiguous `copy_()` |
|
||||
| **Indirection** | slot_mapping lookup | None |
|
||||
| **Memory efficiency** | High (shared block pool) | Low (reserved per seq) |
|
||||
| **Write performance** | Slow (memory-bound scatter) | Fast (simple DMA) |
|
||||
|
||||
### Why PagedAttention Uses Scatter
|
||||
|
||||
PagedAttention is designed for:
|
||||
1. **Multi-sequence batching**: Different sequences share a block pool
|
||||
2. **Dynamic memory management**: No need to reserve max_len per sequence
|
||||
3. **Prefix caching**: Shared KV blocks across sequences
|
||||
|
||||
But for **single-sequence long-context** inference, these benefits don't apply, and we only pay the scatter overhead.
|
||||
|
||||
## Why `store_kvcache` is Still Needed
|
||||
|
||||
Even though prefill attention doesn't read from cache, **decode** does:
|
||||
|
||||
```python
|
||||
# attention.py line 111-114
|
||||
else: # decode
|
||||
# Reads from cache!
|
||||
o = flash_attn_with_kvcache(q, k_cache, v_cache, block_table=...)
|
||||
```
|
||||
|
||||
So `store_kvcache` during prefill is preparing KV cache for future decode steps.
|
||||
|
||||
## Potential Optimizations
|
||||
|
||||
### Option 1: Async Store After Attention (Low Effort)
|
||||
|
||||
Move `store_kvcache` after attention computation and make it async:
|
||||
|
||||
```python
|
||||
def forward(self, q, k, v):
|
||||
if context.is_prefill:
|
||||
# Compute attention first
|
||||
if context.sparse_prefill_policy is not None:
|
||||
o = sparse_prefill_attention(q, k, v, layer_id)
|
||||
else:
|
||||
o = flash_attn_varlen_func(q, k, v, ...)
|
||||
|
||||
# Then store async (overlaps with next layer's QKV)
|
||||
if k_cache.numel():
|
||||
store_kvcache_async(k, v, k_cache, v_cache, slot_mapping)
|
||||
...
|
||||
```
|
||||
|
||||
**Expected benefit**: Overlap store with compute, ~20-30% improvement.
|
||||
|
||||
### Option 2: Contiguous Layout for Single-Sequence Mode (Medium Effort)
|
||||
|
||||
Add a "contiguous mode" for single-sequence long-context:
|
||||
|
||||
```python
|
||||
class ContiguousKVCache:
|
||||
"""Simple contiguous KV cache for single-sequence mode."""
|
||||
def __init__(self, num_layers, max_seq_len, num_kv_heads, head_dim, dtype):
|
||||
self.k_cache = torch.zeros(num_layers, max_seq_len, num_kv_heads, head_dim, dtype=dtype)
|
||||
self.v_cache = torch.zeros(num_layers, max_seq_len, num_kv_heads, head_dim, dtype=dtype)
|
||||
|
||||
def store(self, layer_id, k, v, start_pos):
|
||||
# Simple contiguous write - no scatter!
|
||||
seq_len = k.shape[0]
|
||||
self.k_cache[layer_id, start_pos:start_pos+seq_len] = k
|
||||
self.v_cache[layer_id, start_pos:start_pos+seq_len] = v
|
||||
```
|
||||
|
||||
**Expected benefit**: Match or exceed offload performance (~60% improvement).
|
||||
|
||||
### Option 3: Fused Store-Attention Kernel (High Effort)
|
||||
|
||||
Create a fused Triton kernel that:
|
||||
1. Computes QKV projection
|
||||
2. Stores K, V to cache
|
||||
3. Computes attention
|
||||
|
||||
This eliminates memory roundtrips entirely.
|
||||
|
||||
**Expected benefit**: Best possible performance, but high implementation complexity.
|
||||
|
||||
## Recommended Action
|
||||
|
||||
For **single-sequence long-context** workloads (the primary use case for MInference):
|
||||
|
||||
1. **Short term**: Use offload mode - it's actually faster!
|
||||
2. **Medium term**: Implement Option 1 (async store) for quick win
|
||||
3. **Long term**: Consider Option 2 (contiguous layout) for GPU-only mode
|
||||
|
||||
## Performance Measurement
|
||||
|
||||
To reproduce the benchmark:
|
||||
|
||||
```bash
|
||||
# GPU-only + MInference
|
||||
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py \
|
||||
--model ~/models/Qwen3-4B-Instruct-2507/ \
|
||||
--input-len 32768 \
|
||||
--enable-minference
|
||||
|
||||
# Offload + MInference
|
||||
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py \
|
||||
--model ~/models/Qwen3-4B-Instruct-2507/ \
|
||||
--input-len 32768 \
|
||||
--enable-offload \
|
||||
--enable-minference
|
||||
```
|
||||
|
||||
## Related Files
|
||||
|
||||
- `nanovllm/layers/attention.py`: `store_kvcache()` and `Attention.forward()`
|
||||
- `nanovllm/engine/model_runner.py`: `run_layerwise_offload_prefill()`
|
||||
- `nanovllm/kvcache/offload_engine.py`: `offload_layer_kv_sync()`
|
||||
|
||||
## References
|
||||
|
||||
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180) - vLLM's memory management
|
||||
- [MInference Paper](https://arxiv.org/abs/2407.02490) - Sparse prefill attention
|
||||
@@ -531,16 +531,23 @@ class ModelRunner:
|
||||
# RoPE
|
||||
q, k = layer.self_attn.rotary_emb(positions, q, k)
|
||||
|
||||
# Full attention using FlashAttention
|
||||
attn_output = flash_attn_varlen_func(
|
||||
q, k, v,
|
||||
cu_seqlens_q=cu_seqlens,
|
||||
cu_seqlens_k=cu_seqlens,
|
||||
max_seqlen_q=total_tokens,
|
||||
max_seqlen_k=total_tokens,
|
||||
softmax_scale=layer.self_attn.attn.scale,
|
||||
causal=True,
|
||||
)
|
||||
# Sparse or Full attention
|
||||
if self.sparse_prefill_policy is not None:
|
||||
# MInference or other sparse prefill policy
|
||||
attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
|
||||
q, k, v, layer_id
|
||||
)
|
||||
else:
|
||||
# Full attention using FlashAttention
|
||||
attn_output = flash_attn_varlen_func(
|
||||
q, k, v,
|
||||
cu_seqlens_q=cu_seqlens,
|
||||
cu_seqlens_k=cu_seqlens,
|
||||
max_seqlen_q=total_tokens,
|
||||
max_seqlen_k=total_tokens,
|
||||
softmax_scale=layer.self_attn.attn.scale,
|
||||
causal=True,
|
||||
)
|
||||
|
||||
# O projection
|
||||
attn_output = attn_output.view(total_tokens, -1)
|
||||
@@ -550,16 +557,8 @@ class ModelRunner:
|
||||
hidden_states, residual = layer.post_attention_layernorm(hidden_states, residual)
|
||||
hidden_states = layer.mlp(hidden_states)
|
||||
|
||||
# 2d. Offload KV to CPU (synchronous to avoid race condition)
|
||||
# NOTE: Async offload has race condition where k,v memory gets reused
|
||||
# before D2H copy completes. Use sync copy for correctness.
|
||||
block_size = offload_engine.block_size
|
||||
for i, cpu_block_id in enumerate(cpu_block_ids):
|
||||
start = i * block_size
|
||||
end = min(start + block_size, total_tokens)
|
||||
actual_size = end - start
|
||||
offload_engine.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
|
||||
offload_engine.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
|
||||
# 2d. Offload KV to CPU (encapsulated with sparse policy hooks)
|
||||
offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens)
|
||||
|
||||
# Step 3: Final norm
|
||||
hidden_states, _ = self.model.model.norm(hidden_states, residual)
|
||||
|
||||
@@ -336,7 +336,8 @@ class OffloadEngine:
|
||||
"""
|
||||
Async offload entire decode buffer to CPU.
|
||||
|
||||
Called when a decode block is full.
|
||||
Called when a decode block is full. Also calls sparse policy hooks
|
||||
to update metadata (e.g., Quest min/max keys).
|
||||
|
||||
Args:
|
||||
cpu_block_id: Target CPU block ID
|
||||
@@ -346,6 +347,14 @@ class OffloadEngine:
|
||||
self.decode_offload_stream.wait_stream(self.compute_stream)
|
||||
|
||||
for layer_id in range(self.num_layers):
|
||||
# Hook: notify sparse policy BEFORE offload (k still on GPU)
|
||||
if self.sparse_policy is not None:
|
||||
self.sparse_policy.on_decode_offload(
|
||||
cpu_block_id, layer_id,
|
||||
self.decode_k_buffer[layer_id],
|
||||
self.block_size # Full block
|
||||
)
|
||||
|
||||
self.k_cache_cpu[layer_id, cpu_block_id].copy_(
|
||||
self.decode_k_buffer[layer_id], non_blocking=True
|
||||
)
|
||||
@@ -359,3 +368,42 @@ class OffloadEngine:
|
||||
def wait_decode_offload(self) -> None:
|
||||
"""Wait for decode buffer offload to complete."""
|
||||
self.compute_stream.wait_event(self.decode_offload_event)
|
||||
|
||||
# ========== Encapsulated Prefill Offload API (with sparse hooks) ==========
|
||||
|
||||
def offload_layer_kv_sync(
|
||||
self,
|
||||
layer_id: int,
|
||||
k: Tensor,
|
||||
v: Tensor,
|
||||
cpu_block_ids: List[int],
|
||||
total_tokens: int,
|
||||
) -> None:
|
||||
"""
|
||||
Synchronously offload layer KV to CPU with sparse policy hooks.
|
||||
|
||||
This method encapsulates:
|
||||
1. Block-wise copy to CPU cache
|
||||
2. Sparse policy hooks (on_prefill_offload for Quest metadata)
|
||||
|
||||
Args:
|
||||
layer_id: Layer index
|
||||
k: Key tensor [seq_len, kv_heads, head_dim]
|
||||
v: Value tensor [seq_len, kv_heads, head_dim]
|
||||
cpu_block_ids: List of CPU block IDs to offload to
|
||||
total_tokens: Total number of tokens
|
||||
"""
|
||||
for i, cpu_block_id in enumerate(cpu_block_ids):
|
||||
start = i * self.block_size
|
||||
end = min(start + self.block_size, total_tokens)
|
||||
actual_size = end - start
|
||||
|
||||
# Hook: notify sparse policy BEFORE offload (k still on GPU)
|
||||
if self.sparse_policy is not None:
|
||||
self.sparse_policy.on_prefill_offload(
|
||||
cpu_block_id, layer_id, k[start:end], actual_size
|
||||
)
|
||||
|
||||
# Synchronous copy to CPU
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
|
||||
self.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
|
||||
|
||||
@@ -25,6 +25,7 @@ class FullAttentionPolicy(SparsePolicy):
|
||||
# Full attention supports both prefill and decode
|
||||
supports_prefill = True
|
||||
supports_decode = True
|
||||
requires_block_selection = False # Load all blocks, no selective loading
|
||||
|
||||
def select_blocks(
|
||||
self,
|
||||
|
||||
@@ -30,6 +30,7 @@ class MInferencePolicy(SparsePolicy):
|
||||
|
||||
supports_prefill = True
|
||||
supports_decode = False # MInference is prefill-only sparse strategy
|
||||
requires_block_selection = False # MInference only affects attention computation, not KV load
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
|
||||
@@ -77,6 +77,12 @@ class SparsePolicy(ABC):
|
||||
supports_prefill: bool = True
|
||||
supports_decode: bool = True
|
||||
|
||||
# Whether this policy requires selective block loading during decode
|
||||
# If True: OffloadEngine will call select_blocks() before loading KV from CPU
|
||||
# If False: OffloadEngine will load all blocks (select_blocks ignored for load)
|
||||
# Example: MInference=False (only affects attention), Quest=True (affects load)
|
||||
requires_block_selection: bool = False
|
||||
|
||||
def initialize(
|
||||
self,
|
||||
num_layers: int,
|
||||
|
||||
@@ -158,6 +158,7 @@ class QuestPolicy(SparsePolicy):
|
||||
# Quest is decode-only
|
||||
supports_prefill = False
|
||||
supports_decode = True
|
||||
requires_block_selection = True # Quest affects KV load strategy (selective block loading)
|
||||
|
||||
def __init__(self, config: QuestConfig):
|
||||
"""
|
||||
|
||||
463
notes.md
463
notes.md
@@ -1,205 +1,324 @@
|
||||
# Notes: Layerwise Offload Implementation
|
||||
# Notes: Sparsity Integration into Layerwise Offload
|
||||
|
||||
## Code Analysis
|
||||
## Current Architecture Analysis
|
||||
|
||||
### Current Layerwise Offload Flow
|
||||
### GPU-Only Path vs Offload Path
|
||||
|
||||
| Aspect | GPU-Only | Layerwise Offload |
|
||||
|--------|----------|-------------------|
|
||||
| KV Storage | GPU blocks (paged) | CPU pinned + GPU ring buffer |
|
||||
| Prefill | All layers → then attention | Per-layer: attention → offload |
|
||||
| Decode | FlashAttn with block table | Ring buffer H2D → FlashAttn |
|
||||
| Sparse Support | MInference via `attention.py` | Not integrated |
|
||||
|
||||
### MInference Flow (GPU-Only)
|
||||
|
||||
**Prefill** (`model_runner.py:462-573`):
|
||||
```
|
||||
for layer_id in range(num_layers):
|
||||
q, k, v = compute_qkv(hidden_states)
|
||||
attn_out = flash_attn_varlen_func(q, k, v, causal=True)
|
||||
hidden_states = mlp(attn_out)
|
||||
_offload_layer_kv_to_cpu_sync(layer_id, k, v) # BLOCKING!
|
||||
attention.py:101-105:
|
||||
if context.sparse_prefill_policy is not None:
|
||||
o = context.sparse_prefill_policy.sparse_prefill_attention(q, k, v, layer_id)
|
||||
|
||||
minference.py:sparse_prefill_attention():
|
||||
1. estimate_pattern(q, k, layer_id) -> vertical_indices, slash_indices
|
||||
2. _triton_mixed_sparse_attention(q, k, v, indices)
|
||||
3. return output
|
||||
```
|
||||
|
||||
**Decode** (`model_runner.py:641-817`):
|
||||
### Quest Flow (GPU Block Mode)
|
||||
|
||||
```
|
||||
for layer_id in range(num_layers):
|
||||
# Load all prefilled KV from CPU (SLOW!)
|
||||
for block_id in cpu_block_table:
|
||||
k_block = k_cache_cpu[layer_id, block_id].to("cuda")
|
||||
v_block = v_cache_cpu[layer_id, block_id].to("cuda")
|
||||
|
||||
k_full = cat([k_prefill, k_decode_prev, k_new])
|
||||
attn_out = flash_attn(q, k_full, v_full, causal=False)
|
||||
|
||||
# Store new KV to decode buffer
|
||||
decode_k_buffer[layer_id, pos].copy_(k_new)
|
||||
|
||||
# Block-full offload (lines 793-811)
|
||||
if block_is_full:
|
||||
for layer_id in range(num_layers):
|
||||
k_cache_cpu[layer_id, block].copy_(decode_k_buffer[layer_id], non_blocking=True)
|
||||
torch.cuda.synchronize() # BAD: global sync
|
||||
hybrid_manager.py (if using CPU offload with Quest):
|
||||
select_blocks(available_blocks, ctx) -> selected block IDs
|
||||
-> load selected blocks to GPU
|
||||
-> standard FlashAttn with loaded blocks
|
||||
```
|
||||
|
||||
### OffloadEngine Existing Infrastructure
|
||||
### Layerwise Offload Prefill Flow
|
||||
|
||||
**Streams** (available for use):
|
||||
- `compute_stream` - dedicated compute stream (not default!)
|
||||
- `prefill_offload_streams[layer_id]` - per-layer D2H streams
|
||||
- `slot_transfer_streams[slot_idx]` - per-slot H2D streams
|
||||
- `transfer_stream_main` - main transfer stream
|
||||
- `_pipeline_layer_stream` - cross-layer pipeline stream
|
||||
```
|
||||
model_runner.py:run_layerwise_offload_prefill():
|
||||
for layer_id in range(num_layers):
|
||||
# QKV projection
|
||||
q, k, v = qkv_proj(hidden_ln)
|
||||
|
||||
**Events** (available for use):
|
||||
- `prefill_offload_events[layer_id]` - per-layer offload completion
|
||||
- `ring_slot_ready[slot]` - H2D completion
|
||||
- `ring_slot_offload_done[slot]` - D2H completion
|
||||
- `ring_slot_compute_done[slot]` - compute completion
|
||||
- `_pipeline_next_layer_event` - pipeline next layer ready
|
||||
# RoPE
|
||||
q, k = rotary_emb(positions, q, k)
|
||||
|
||||
**Buffers** (already allocated):
|
||||
- `k_cache_cpu/v_cache_cpu` - [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
- `k_cache_gpu/v_cache_gpu` - [num_gpu_blocks, block_size, kv_heads, head_dim] (no layer dim!)
|
||||
- `decode_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
|
||||
- `prefill_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
|
||||
- `layer_k_buffer_a/b, layer_v_buffer_a/b` - [max_prefill_blocks, block_size, kv_heads, head_dim]
|
||||
# FULL attention (no sparsity!)
|
||||
attn_output = flash_attn_varlen_func(q, k, v, ...)
|
||||
|
||||
### Useful Existing Methods
|
||||
# MLP
|
||||
hidden_states = mlp(attn_out + residual)
|
||||
|
||||
**Async offload** (currently unused in layerwise):
|
||||
# Sync offload ALL k, v to CPU
|
||||
for block_id in cpu_block_ids:
|
||||
k_cache_cpu[layer_id, block_id].copy_(k[start:end])
|
||||
v_cache_cpu[layer_id, block_id].copy_(v[start:end])
|
||||
```
|
||||
|
||||
### Layerwise Offload Decode Flow
|
||||
|
||||
```
|
||||
model_runner.py:run_layerwise_offload_decode():
|
||||
# Preload first N layers to ring buffer
|
||||
for i in range(num_buffers):
|
||||
offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)
|
||||
|
||||
for layer_id in range(num_layers):
|
||||
current_buffer = layer_id % num_buffers
|
||||
|
||||
# Wait for buffer load
|
||||
offload_engine.wait_buffer_load(current_buffer)
|
||||
|
||||
# Get prefilled KV from ring buffer (ALL blocks loaded)
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
|
||||
|
||||
# QKV for new token
|
||||
q, k_new, v_new = qkv_proj(hidden_ln)
|
||||
|
||||
# Concat and full attention
|
||||
k_full = torch.cat([k_prefill, k_decode_prev, k_new])
|
||||
attn_output = flash_attn_varlen_func(q, k_full, v_full, ...)
|
||||
|
||||
# Start loading next layer
|
||||
offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### 1. Prefill Sparse Integration Point
|
||||
|
||||
**Location:** `model_runner.py:535-543`
|
||||
|
||||
**Current:**
|
||||
```python
|
||||
offload_prefill_buffer_async(layer_id, cpu_block_id, num_valid_tokens)
|
||||
wait_all_prefill_offloads()
|
||||
wait_prefill_offload(layer_id)
|
||||
attn_output = flash_attn_varlen_func(
|
||||
q, k, v,
|
||||
cu_seqlens_q=cu_seqlens,
|
||||
cu_seqlens_k=cu_seqlens,
|
||||
max_seqlen_q=total_tokens,
|
||||
max_seqlen_k=total_tokens,
|
||||
softmax_scale=layer.self_attn.attn.scale,
|
||||
causal=True,
|
||||
)
|
||||
```
|
||||
|
||||
**Cross-layer pipeline** (for decode):
|
||||
**After Integration:**
|
||||
```python
|
||||
start_decode_pipeline(cpu_block_ids)
|
||||
get_decode_layer_kv(layer_id, num_blocks) -> (k, v)
|
||||
end_decode_pipeline()
|
||||
if self.sparse_policy and self.sparse_policy.supports_offload_prefill:
|
||||
attn_output, k_sparse, v_sparse = self.sparse_policy.offload_prefill_attention(
|
||||
q, k, v, layer_id
|
||||
)
|
||||
k_to_offload = k_sparse if k_sparse is not None else k
|
||||
v_to_offload = v_sparse if v_sparse is not None else v
|
||||
else:
|
||||
attn_output = flash_attn_varlen_func(q, k, v, ...)
|
||||
k_to_offload, v_to_offload = k, v
|
||||
```
|
||||
|
||||
### Chunked Prefill Code to Remove
|
||||
### 2. Decode Sparse Integration Point
|
||||
|
||||
**attention.py** (lines to remove):
|
||||
- 172-312: `_chunked_prefill_attention()`
|
||||
- 314-346: `_sync_load_previous_chunks()`
|
||||
- 348-480: `_ring_buffer_pipeline_load()`
|
||||
- 482-591: `_chunked_decode_attention()`
|
||||
- 593-667: `_decode_ring_buffer_pipeline()`
|
||||
- 669-726: `_decode_with_layer_pipeline()`
|
||||
**Location:** `model_runner.py:636-637` and `model_runner.py:704-706`
|
||||
|
||||
**context.py** (fields to remove):
|
||||
- `is_chunked_prefill`
|
||||
- `prev_kv_ranges`
|
||||
- `chunk_offset`
|
||||
- `chunked_seq`
|
||||
- `decode_pos_in_block`
|
||||
- `decode_start_pos_in_block`
|
||||
- `current_chunk_idx`
|
||||
|
||||
**Keep**:
|
||||
- `kvcache_manager` - still needed for layerwise
|
||||
- `sparse_prefill_policy` - needed for MInference
|
||||
|
||||
---
|
||||
|
||||
## Memory Layout
|
||||
|
||||
### 新设计: Ring-Buffered GPU KV Cache
|
||||
|
||||
**设计原则**:
|
||||
- 不追求极致peak memory优化,保证流水线正确性
|
||||
- Ring buffer层数可从外部配置 (默认4层)
|
||||
- 流水线深度 = num_kv_buffers - 1
|
||||
|
||||
```
|
||||
# 新: Ring-Buffered GPU Cache (layerwise offload专用)
|
||||
# num_kv_buffers: 外部可配置,默认4
|
||||
layer_k_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||||
layer_v_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||||
|
||||
# 移除: 旧的chunked prefill ring buffer
|
||||
# k_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
|
||||
# v_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
|
||||
```
|
||||
|
||||
**为什么使用Ring Buffer?**
|
||||
|
||||
Decode阶段的流水线需求 (以4个buffer为例):
|
||||
```
|
||||
Buffer 0: [Load L0] → [Compute L0] ──────────────────► [Load L4]
|
||||
Buffer 1: [Load L1] → [Compute L1] ────────────────────►
|
||||
Buffer 2: [Load L2] → [Compute L2] ────────────►
|
||||
Buffer 3: [Load L3] → [Compute L3] ──►
|
||||
```
|
||||
|
||||
流水线深度 = 3,可以预加载3层,更好地隐藏H2D延迟。
|
||||
|
||||
**内存开销** (Qwen3-4B, 128K tokens):
|
||||
- 单层KV: 128K × 8 × 128 × 2 bytes = 256 MB
|
||||
- 4层ring buffer: 4 × 256 MB = 1 GB
|
||||
- 对比28层全GPU: 28 × 256 MB = 7.2 GB
|
||||
- **节省**: 7.2 GB - 1 GB = 6.2 GB
|
||||
|
||||
**配置传递**:
|
||||
```
|
||||
LLM(num_kv_buffers=4) → Config → OffloadEngine(num_kv_buffers=...)
|
||||
```
|
||||
|
||||
### CPU Cache (保持不变)
|
||||
```
|
||||
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
```
|
||||
Pinned memory for fast DMA transfers.
|
||||
|
||||
### Memory per Layer (Qwen3-4B)
|
||||
- kv_heads = 8
|
||||
- head_dim = 128
|
||||
- dtype = bfloat16 (2 bytes)
|
||||
- Per token KV: 8 * 128 * 2 * 2 = 4KB
|
||||
- 128K tokens: 512 MB per layer
|
||||
- 28 layers: 14 GB total on CPU
|
||||
|
||||
---
|
||||
|
||||
## Stream Synchronization Pattern
|
||||
|
||||
### Correct Pattern for Async Offload
|
||||
**Current (preload):**
|
||||
```python
|
||||
# In offload stream
|
||||
with torch.cuda.stream(offload_stream):
|
||||
offload_stream.wait_stream(compute_stream) # Wait for compute to finish
|
||||
cpu_tensor.copy_(gpu_tensor, non_blocking=True)
|
||||
event.record(offload_stream)
|
||||
|
||||
# Before reusing gpu_tensor
|
||||
compute_stream.wait_event(event) # Wait for offload to complete
|
||||
for i in range(num_preload):
|
||||
offload_engine.load_layer_kv_to_buffer(
|
||||
i, i, cpu_block_table, valid_tokens_per_block
|
||||
)
|
||||
```
|
||||
|
||||
### Correct Pattern for Async Load
|
||||
**After Integration:**
|
||||
```python
|
||||
# In load stream
|
||||
with torch.cuda.stream(load_stream):
|
||||
gpu_buffer.copy_(cpu_tensor, non_blocking=True)
|
||||
event.record(load_stream)
|
||||
|
||||
# Before using gpu_buffer
|
||||
compute_stream.wait_event(event) # Wait for load to complete
|
||||
for i in range(num_preload):
|
||||
layer_to_load = i
|
||||
if self.sparse_policy and self.sparse_policy.supports_offload_decode:
|
||||
# Prepare q for this layer (need to compute ahead)
|
||||
# OR: use previous layer's pattern as estimate
|
||||
selected_blocks = self.sparse_policy.select_offload_blocks(
|
||||
None, # q not available yet at preload
|
||||
layer_to_load,
|
||||
cpu_block_table,
|
||||
valid_tokens_per_block
|
||||
)
|
||||
else:
|
||||
selected_blocks = cpu_block_table
|
||||
offload_engine.load_sparse_layer_kv_to_buffer(
|
||||
i, layer_to_load, selected_blocks, valid_tokens_per_block
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
**Challenge:** Q is not available during preload phase!
|
||||
|
||||
## Test Configuration
|
||||
**Solutions:**
|
||||
1. Skip sparse preload, only sparse for non-preloaded layers
|
||||
2. Use previous decode step's pattern as estimate
|
||||
3. Add preload hook to sparse policy
|
||||
|
||||
**Needle test command**:
|
||||
```bash
|
||||
PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
|
||||
python tests/test_needle.py \
|
||||
--model ~/models/Qwen3-4B-Instruct-2507/ \
|
||||
--max-model-len 32768 \
|
||||
--input-len 8192 \
|
||||
--enable-offload \
|
||||
--block-size 1024 \
|
||||
--num-gpu-blocks 2
|
||||
### 3. Offload Engine Extension
|
||||
|
||||
**New Method in OffloadEngine:**
|
||||
|
||||
```python
|
||||
def load_sparse_layer_kv_to_buffer(
|
||||
self,
|
||||
buffer_idx: int,
|
||||
layer_id: int,
|
||||
selected_cpu_block_ids: List[int],
|
||||
original_valid_tokens: List[int],
|
||||
) -> int:
|
||||
"""
|
||||
Load only selected blocks from CPU to buffer.
|
||||
|
||||
Returns:
|
||||
Total tokens loaded (may be less than full sequence)
|
||||
"""
|
||||
stream = self.layer_load_streams[buffer_idx]
|
||||
|
||||
with torch.cuda.stream(stream):
|
||||
stream.wait_event(self.buffer_compute_done_events[buffer_idx])
|
||||
|
||||
# Build mapping: original block -> selected position
|
||||
offset = 0
|
||||
for i, cpu_block_id in enumerate(selected_cpu_block_ids):
|
||||
# Find original index to get valid tokens
|
||||
valid_tokens = original_valid_tokens[i] # Need mapping
|
||||
|
||||
self.layer_k_cache[buffer_idx, offset:offset+valid_tokens].copy_(
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens],
|
||||
non_blocking=True
|
||||
)
|
||||
# ... v_cache same
|
||||
|
||||
offset += valid_tokens
|
||||
|
||||
self.buffer_load_events[buffer_idx].record(stream)
|
||||
|
||||
return offset # Caller needs to know actual loaded tokens
|
||||
```
|
||||
|
||||
**GPU mutex check before running**:
|
||||
```bash
|
||||
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
|
||||
## Metadata Flow for Quest
|
||||
|
||||
### During Prefill Offload
|
||||
|
||||
**Current:** No metadata collection in offload path
|
||||
|
||||
**Required:** Call `on_prefill_offload()` for each block
|
||||
|
||||
```python
|
||||
# In run_layerwise_offload_prefill()
|
||||
for i, cpu_block_id in enumerate(cpu_block_ids):
|
||||
start = i * block_size
|
||||
end = min(start + block_size, total_tokens)
|
||||
actual_size = end - start
|
||||
|
||||
# BEFORE offload: update Quest metadata
|
||||
if self.sparse_policy and hasattr(self.sparse_policy, 'on_prefill_offload'):
|
||||
self.sparse_policy.on_prefill_offload(
|
||||
cpu_block_id, layer_id, k[start:end], actual_size
|
||||
)
|
||||
|
||||
# Offload
|
||||
offload_engine.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
|
||||
offload_engine.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
|
||||
```
|
||||
|
||||
### Quest Metadata Shape
|
||||
|
||||
```python
|
||||
# BlockMetadataManager
|
||||
key_min: [num_blocks, num_layers, num_kv_heads, head_dim] # Min key per block per layer
|
||||
key_max: [num_blocks, num_layers, num_kv_heads, head_dim] # Max key per block per layer
|
||||
```
|
||||
|
||||
**Memory:** 2 * num_blocks * num_layers * kv_heads * head_dim * 2 bytes
|
||||
- Example: 1000 blocks * 28 layers * 4 heads * 128 dim * 2 * 2 = ~57 MB
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### MInference Prefill Overhead
|
||||
|
||||
| Operation | Time (64K seq) |
|
||||
|-----------|----------------|
|
||||
| Pattern estimation (last-64) | ~5ms |
|
||||
| Triton sparse attention | ~80ms |
|
||||
| Full FlashAttention | ~100ms |
|
||||
| **Net Speedup** | ~15-20% |
|
||||
|
||||
### Quest Decode Overhead
|
||||
|
||||
| Operation | Time |
|
||||
|-----------|------|
|
||||
| Block scoring (GPU metadata) | ~0.1ms |
|
||||
| Top-K selection | ~0.05ms |
|
||||
| Sparse H2D load (8 blocks) | ~2ms |
|
||||
| Full H2D load (100 blocks) | ~20ms |
|
||||
| **Net Speedup** | ~10x H2D |
|
||||
|
||||
### Memory Trade-offs
|
||||
|
||||
| Mode | GPU Memory | CPU Memory | H2D Bandwidth |
|
||||
|------|------------|------------|---------------|
|
||||
| Full offload | Ring buffer | Full KV | High |
|
||||
| Sparse offload | Ring buffer | Full KV | Low (subset) |
|
||||
| Aggressive sparse | Ring buffer | Sparse KV | Very low |
|
||||
|
||||
## Edge Cases
|
||||
|
||||
### 1. Short Sequences (< sparse threshold)
|
||||
|
||||
```python
|
||||
if total_tokens < sparse_threshold:
|
||||
# Fall back to full attention
|
||||
use_sparse = False
|
||||
```
|
||||
|
||||
### 2. First Decode Step (no previous Q)
|
||||
|
||||
Quest can't score blocks without Q. Options:
|
||||
- Use average embedding as proxy
|
||||
- Load all blocks for first step
|
||||
- Use prefill pattern as estimate
|
||||
|
||||
### 3. Variable Sequence Lengths in Batch
|
||||
|
||||
Layerwise offload currently only supports batch_size=1:
|
||||
```python
|
||||
assert len(seqs) == 1, "Layer-wise offload only supports single sequence"
|
||||
```
|
||||
|
||||
Sparse integration should maintain this constraint.
|
||||
|
||||
### 4. Ring Buffer vs Sparse Load Mismatch
|
||||
|
||||
Ring buffer assumes fixed `total_prefill_tokens`:
|
||||
```python
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(buffer_idx, total_prefill_tokens)
|
||||
```
|
||||
|
||||
Sparse load has variable token count. Need:
|
||||
```python
|
||||
# Track actual loaded tokens per buffer
|
||||
loaded_tokens[buffer_idx] = sparse_load_count
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(buffer_idx, loaded_tokens[buffer_idx])
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
1. `test_sparse_policy_interface.py` - Verify new interface methods
|
||||
2. `test_minference_offload.py` - MInference in offload mode
|
||||
3. `test_quest_offload.py` - Quest block selection in offload mode
|
||||
|
||||
### Integration Tests
|
||||
|
||||
1. `test_offload_sparse_e2e.py` - Full prefill+decode with sparsity
|
||||
2. `test_accuracy_comparison.py` - Compare outputs: full vs sparse
|
||||
|
||||
### Benchmarks
|
||||
|
||||
1. `bench_offload_sparse.py` - Compare:
|
||||
- Full offload (baseline)
|
||||
- MInference prefill + Quest decode
|
||||
- Aggressive sparse offload
|
||||
|
||||
621
task_plan.md
621
task_plan.md
@@ -1,399 +1,346 @@
|
||||
# Task Plan: Layerwise Offload Refactoring
|
||||
# Task Plan: Integrate Sparsity into Layerwise Offload
|
||||
|
||||
## Goal
|
||||
Refactor layerwise offload to use proper OffloadEngine API, pre-allocate buffers, remove chunked prefill code, and pass needle test.
|
||||
|
||||
Extend MInference (prefill sparse) and Quest (decode sparse) to the layerwise offload execution path, with an extensible architecture for future sparsity methods.
|
||||
|
||||
## Key Insight
|
||||
|
||||
**现有的 sparse policy 已经实现,只是 layerwise offload 路径绕过了它!**
|
||||
|
||||
| 路径 | Attention 调用方式 | Sparse 支持 |
|
||||
|------|-------------------|-------------|
|
||||
| GPU-only | `attention.py` → `sparse_prefill_attention()` | YES |
|
||||
| Layerwise offload | `model_runner.py` → `flash_attn_varlen_func()` | NO (直接调用) |
|
||||
|
||||
## Policy Type Analysis
|
||||
|
||||
**两类 sparse policy 的本质区别:**
|
||||
|
||||
| Policy | 影响 Attention 计算 | 影响 KV Load 策略 | `select_blocks()` 行为 |
|
||||
|--------|-------------------|-----------------|----------------------|
|
||||
| **MInference** | YES (`sparse_prefill_attention`) | NO | `return available_blocks` (全部) |
|
||||
| **Quest** | NO | YES | 返回 Top-K subset |
|
||||
|
||||
**MInference**: 只改变 attention 计算方式,不影响外部的 layer-wise load/offload 流程
|
||||
**Quest**: 选择性地只 load 部分 blocks,影响 H2D 传输
|
||||
|
||||
## Architecture Constraint
|
||||
|
||||
**所有 copy_ 操作必须封装在 OffloadEngine 中,model_runner.py 不能直接访问内部存储!**
|
||||
|
||||
## Phases
|
||||
- [x] Phase 1: Add layerwise API to OffloadEngine
|
||||
- [x] Phase 2: Pre-allocate buffers in ModelRunner (skipped - handled by ring buffer)
|
||||
- [x] Phase 3: Refactor run_layerwise_offload_prefill()
|
||||
- [x] Phase 4: Refactor run_layerwise_offload_decode()
|
||||
- [x] Phase 5: Remove chunked prefill code
|
||||
- [x] Phase 6: Verify with needle test
|
||||
|
||||
## Key Questions
|
||||
1. Should we keep chunked_attention.py for MInference use?
|
||||
2. What's the max_seq_len for buffer pre-allocation?
|
||||
3. Should we implement incremental refactoring or all at once?
|
||||
- [x] Phase 1: 添加 `requires_block_selection` 接口标志
|
||||
- [x] Phase 2: Refactor OffloadEngine - 封装 offload 操作,支持 sparse policy hooks
|
||||
- [x] Phase 3: MInference prefill - 在 offload prefill 中调用 `sparse_prefill_attention()`
|
||||
- [x] Phase 4: Quest decode - 根据 `requires_block_selection` 选择性 load blocks (infrastructure ready, full integration deferred)
|
||||
- [x] Phase 5: Configuration 和 testing
|
||||
|
||||
## Decisions Made
|
||||
- Use FullAttentionPolicy for initial testing (per user request)
|
||||
- Focus on correctness first, then optimize async overlap
|
||||
- **GPU KV Cache使用Ring Buffer策略** (用户建议):
|
||||
- 使用N个buffer (可配置,默认4个) 形成ring buffer
|
||||
- 比固定2个buffer更灵活,流水线深度更深
|
||||
- 可以预加载多层,更好地隐藏H2D延迟
|
||||
- 例如: buffer[i] compute, buffer[(i+1)%N] load, buffer[(i+2)%N] load...
|
||||
## Detailed Design
|
||||
|
||||
## Errors Encountered
|
||||
(none yet)
|
||||
### Phase 1: 添加 `requires_block_selection` 接口标志
|
||||
|
||||
## Status
|
||||
**COMPLETE** - All phases implemented and needle test passes
|
||||
|
||||
---
|
||||
|
||||
## Detailed Implementation Plan
|
||||
|
||||
### Phase 1: Modify OffloadEngine GPU Memory Layout + Add Layerwise API
|
||||
|
||||
**File**: `nanovllm/kvcache/offload_engine.py`
|
||||
|
||||
#### 1.1 新的GPU内存布局 (Ring Buffer)
|
||||
|
||||
**设计原则**:
|
||||
- 不追求极致的peak memory优化,而是保证流水线正确性和性能
|
||||
- Ring buffer层数可从外部配置 (通过config或参数)
|
||||
- 默认4层,可以根据GPU内存和H2D带宽调整
|
||||
**New attribute in SparsePolicy base class:**
|
||||
|
||||
```python
|
||||
# ========== Ring-Buffered GPU KV Cache for Layerwise Offload ==========
|
||||
#
|
||||
# 参数: num_kv_buffers (外部可配置,默认4)
|
||||
#
|
||||
# Ring Buffer流水线 (以4个buffer为例):
|
||||
# Buffer 0: [Load L0] → [Compute L0] ──────────────────────────► [Load L4]
|
||||
# Buffer 1: [Load L1] → [Compute L1] ──────────────────────────►
|
||||
# Buffer 2: [Load L2] → [Compute L2] ────────────────►
|
||||
# Buffer 3: [Load L3] → [Compute L3] ──────►
|
||||
#
|
||||
# 优势:
|
||||
# - 流水线深度 = num_kv_buffers - 1
|
||||
# - 可以预加载多层,更好地隐藏H2D延迟
|
||||
# - 比固定2层更灵活
|
||||
class SparsePolicy(ABC):
|
||||
# Existing flags
|
||||
supports_prefill: bool = True
|
||||
supports_decode: bool = True
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
...,
|
||||
num_kv_buffers: int = 4, # 外部可配置的ring buffer层数
|
||||
):
|
||||
self.num_kv_buffers = num_kv_buffers
|
||||
|
||||
# Shape: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||||
self.layer_k_cache = torch.zeros(
|
||||
num_kv_buffers, max_seq_tokens, num_kv_heads, head_dim,
|
||||
dtype=dtype, device="cuda"
|
||||
)
|
||||
self.layer_v_cache = torch.zeros(
|
||||
num_kv_buffers, max_seq_tokens, num_kv_heads, head_dim,
|
||||
dtype=dtype, device="cuda"
|
||||
)
|
||||
|
||||
# Per-buffer events for H2D completion
|
||||
self.buffer_load_events = [torch.cuda.Event() for _ in range(num_kv_buffers)]
|
||||
|
||||
# 内存开销计算 (Qwen3-4B, 128K tokens):
|
||||
# - kv_heads=8, head_dim=128, dtype=bf16
|
||||
# - 单层: 128K × 8 × 128 × 2 = 256 MB
|
||||
# - 4层ring buffer: 4 × 256 MB = 1 GB
|
||||
# - 对比28层全部在GPU: 28 × 256 MB = 7.2 GB
|
||||
# - **节省**: 7.2 GB - 1 GB = 6.2 GB
|
||||
# NEW: Whether this policy requires selective block loading
|
||||
# If True: OffloadEngine will call select_blocks() before loading
|
||||
# If False: OffloadEngine will load all blocks (select_blocks ignored)
|
||||
requires_block_selection: bool = False
|
||||
```
|
||||
|
||||
**配置传递路径**:
|
||||
```
|
||||
LLM(num_kv_buffers=4)
|
||||
→ Config.num_kv_buffers
|
||||
→ OffloadEngine(num_kv_buffers=config.num_kv_buffers)
|
||||
```
|
||||
|
||||
**移除旧的ring buffer设计**:
|
||||
```python
|
||||
# 移除: k_cache_gpu, v_cache_gpu (chunked prefill用的ring buffer)
|
||||
# 移除: ring_slot_ready, ring_slot_offload_done, ring_slot_compute_done
|
||||
# 移除: slot_transfer_streams
|
||||
# 保留: prefill_offload_streams (用于D2H), compute_stream
|
||||
```
|
||||
|
||||
#### 1.2 新的Layerwise API方法
|
||||
**Policy implementations:**
|
||||
|
||||
```python
|
||||
# ========== Prefill: Async D2H Offload ==========
|
||||
def offload_layer_kv_async(
|
||||
self, layer_id: int, k: Tensor, v: Tensor,
|
||||
cpu_block_ids: list[int], total_tokens: int
|
||||
) -> None:
|
||||
"""Async offload layer KV to CPU using per-layer stream."""
|
||||
stream = self.prefill_offload_streams[layer_id]
|
||||
with torch.cuda.stream(stream):
|
||||
stream.wait_stream(self.compute_stream) # Wait for compute
|
||||
class MInferencePolicy(SparsePolicy):
|
||||
supports_prefill = True
|
||||
supports_decode = False
|
||||
requires_block_selection = False # 不影响 load 策略
|
||||
|
||||
def select_blocks(self, available_blocks, ctx):
|
||||
# 不会被调用(requires_block_selection=False)
|
||||
return available_blocks
|
||||
|
||||
|
||||
class QuestPolicy(SparsePolicy):
|
||||
supports_prefill = False
|
||||
supports_decode = True
|
||||
requires_block_selection = True # 影响 load 策略
|
||||
|
||||
def select_blocks(self, available_blocks, ctx):
|
||||
# 会被 OffloadEngine 调用
|
||||
return self._select_topk_blocks(...)
|
||||
|
||||
|
||||
class FullAttentionPolicy(SparsePolicy):
|
||||
supports_prefill = True
|
||||
supports_decode = True
|
||||
requires_block_selection = False # 加载所有 blocks
|
||||
```
|
||||
|
||||
### Phase 2: Refactor OffloadEngine
|
||||
|
||||
**OffloadEngine 根据 `requires_block_selection` 决定是否调用 `select_blocks()`:**
|
||||
|
||||
```python
|
||||
class OffloadEngine:
|
||||
def __init__(self, ..., sparse_policy: "SparsePolicy" = None):
|
||||
self.sparse_policy = sparse_policy
|
||||
|
||||
def offload_layer_kv_sync(
|
||||
self,
|
||||
layer_id: int,
|
||||
k: Tensor,
|
||||
v: Tensor,
|
||||
cpu_block_ids: List[int],
|
||||
total_tokens: int,
|
||||
) -> None:
|
||||
"""
|
||||
Synchronously offload layer KV to CPU.
|
||||
Calls sparse policy hooks internally.
|
||||
"""
|
||||
for i, cpu_block_id in enumerate(cpu_block_ids):
|
||||
start = i * self.block_size
|
||||
end = min(start + self.block_size, total_tokens)
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :end-start].copy_(
|
||||
k[start:end], non_blocking=True
|
||||
actual_size = end - start
|
||||
|
||||
# Hook: notify sparse policy BEFORE offload (k still on GPU)
|
||||
if self.sparse_policy is not None:
|
||||
self.sparse_policy.on_prefill_offload(
|
||||
cpu_block_id, layer_id, k[start:end], actual_size
|
||||
)
|
||||
|
||||
# Synchronous copy to CPU (internal)
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
|
||||
self.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
|
||||
|
||||
def load_layer_kv_to_buffer_with_policy(
|
||||
self,
|
||||
buffer_idx: int,
|
||||
layer_id: int,
|
||||
cpu_block_ids: List[int],
|
||||
valid_tokens_per_block: List[int],
|
||||
query: Optional[Tensor] = None,
|
||||
) -> int:
|
||||
"""
|
||||
Load layer KV to buffer, optionally using sparse policy for block selection.
|
||||
|
||||
Args:
|
||||
buffer_idx: Ring buffer slot
|
||||
layer_id: Layer index
|
||||
cpu_block_ids: All available CPU block IDs
|
||||
valid_tokens_per_block: Valid tokens per block
|
||||
query: Query tensor (needed for block selection if requires_block_selection=True)
|
||||
|
||||
Returns:
|
||||
Total tokens loaded
|
||||
"""
|
||||
# Check if policy requires block selection
|
||||
if (self.sparse_policy is not None and
|
||||
self.sparse_policy.requires_block_selection and
|
||||
query is not None):
|
||||
# Build context
|
||||
ctx = PolicyContext(
|
||||
query_chunk_idx=0,
|
||||
num_query_chunks=1,
|
||||
layer_id=layer_id,
|
||||
query=query,
|
||||
is_prefill=False,
|
||||
block_size=self.block_size,
|
||||
)
|
||||
self.v_cache_cpu[layer_id, cpu_block_id, :end-start].copy_(
|
||||
v[start:end], non_blocking=True
|
||||
# Select blocks
|
||||
selected_blocks = self.sparse_policy.select_blocks(cpu_block_ids, ctx)
|
||||
|
||||
# Build valid_tokens for selected blocks
|
||||
block_to_valid = {bid: vt for bid, vt in zip(cpu_block_ids, valid_tokens_per_block)}
|
||||
selected_valid = [block_to_valid[bid] for bid in selected_blocks]
|
||||
|
||||
return self._load_blocks_to_buffer(
|
||||
buffer_idx, layer_id, selected_blocks, selected_valid
|
||||
)
|
||||
self.prefill_offload_events[layer_id].record(stream)
|
||||
|
||||
def wait_layer_offload(self, layer_id: int) -> None:
|
||||
"""Wait for specific layer's offload to complete."""
|
||||
self.compute_stream.wait_event(self.prefill_offload_events[layer_id])
|
||||
|
||||
# ========== Decode: Ring-Buffered H2D Load ==========
|
||||
def load_layer_kv_to_buffer(
|
||||
self, buffer_idx: int, layer_id: int,
|
||||
cpu_block_ids: list[int], valid_tokens_per_block: list[int]
|
||||
) -> None:
|
||||
"""
|
||||
Async load layer KV from CPU to specified ring buffer slot.
|
||||
|
||||
Args:
|
||||
buffer_idx: Ring buffer slot index (0 to num_kv_buffers-1)
|
||||
layer_id: Which layer's KV to load
|
||||
cpu_block_ids: CPU block IDs containing this layer's KV
|
||||
valid_tokens_per_block: Number of valid tokens in each block
|
||||
"""
|
||||
stream = self.layer_load_streams[buffer_idx] # 每个buffer有独立的stream
|
||||
with torch.cuda.stream(stream):
|
||||
# 等待该buffer上一次compute完成 (防止覆盖正在使用的数据)
|
||||
stream.wait_event(self.buffer_compute_done_events[buffer_idx])
|
||||
|
||||
offset = 0
|
||||
for i, cpu_block_id in enumerate(cpu_block_ids):
|
||||
valid_tokens = valid_tokens_per_block[i]
|
||||
self.layer_k_cache[buffer_idx, offset:offset+valid_tokens].copy_(
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens],
|
||||
non_blocking=True
|
||||
else:
|
||||
# Load all blocks (no selection)
|
||||
return self._load_blocks_to_buffer(
|
||||
buffer_idx, layer_id, cpu_block_ids, valid_tokens_per_block
|
||||
)
|
||||
self.layer_v_cache[buffer_idx, offset:offset+valid_tokens].copy_(
|
||||
self.v_cache_cpu[layer_id, cpu_block_id, :valid_tokens],
|
||||
non_blocking=True
|
||||
)
|
||||
offset += valid_tokens
|
||||
self.buffer_load_events[buffer_idx].record(stream)
|
||||
|
||||
def wait_buffer_load(self, buffer_idx: int) -> None:
|
||||
"""Wait for buffer load to complete on compute_stream."""
|
||||
self.compute_stream.wait_event(self.buffer_load_events[buffer_idx])
|
||||
def _load_blocks_to_buffer(
|
||||
self,
|
||||
buffer_idx: int,
|
||||
layer_id: int,
|
||||
block_ids: List[int],
|
||||
valid_tokens: List[int],
|
||||
) -> int:
|
||||
"""Internal: load specified blocks to buffer."""
|
||||
stream = self.layer_load_streams[buffer_idx]
|
||||
|
||||
def get_buffer_kv(self, buffer_idx: int, total_tokens: int) -> tuple[Tensor, Tensor]:
|
||||
"""Get KV from specified ring buffer slot."""
|
||||
return (
|
||||
self.layer_k_cache[buffer_idx, :total_tokens],
|
||||
self.layer_v_cache[buffer_idx, :total_tokens]
|
||||
)
|
||||
with torch.cuda.stream(stream):
|
||||
stream.wait_event(self.buffer_compute_done_events[buffer_idx])
|
||||
|
||||
def record_buffer_compute_done(self, buffer_idx: int) -> None:
|
||||
"""Record that compute on this buffer is done (allows next load to reuse it)."""
|
||||
self.buffer_compute_done_events[buffer_idx].record(self.compute_stream)
|
||||
offset = 0
|
||||
for cpu_block_id, vt in zip(block_ids, valid_tokens):
|
||||
self.layer_k_cache[buffer_idx, offset:offset+vt].copy_(
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :vt],
|
||||
non_blocking=True
|
||||
)
|
||||
self.layer_v_cache[buffer_idx, offset:offset+vt].copy_(
|
||||
self.v_cache_cpu[layer_id, cpu_block_id, :vt],
|
||||
non_blocking=True
|
||||
)
|
||||
offset += vt
|
||||
|
||||
self.buffer_load_events[buffer_idx].record(stream)
|
||||
|
||||
return offset
|
||||
```
|
||||
|
||||
#### 1.3 Ring Buffer所需的额外资源
|
||||
### Phase 3: MInference Prefill Integration
|
||||
|
||||
```python
|
||||
# Per-buffer streams (并行加载多个buffer)
|
||||
self.layer_load_streams = [torch.cuda.Stream() for _ in range(num_kv_buffers)]
|
||||
|
||||
# Per-buffer events
|
||||
self.buffer_load_events = [torch.cuda.Event() for _ in range(num_kv_buffers)]
|
||||
self.buffer_compute_done_events = [torch.cuda.Event() for _ in range(num_kv_buffers)]
|
||||
|
||||
# 初始化: 标记所有buffer为"compute done" (允许首次加载)
|
||||
for event in self.buffer_compute_done_events:
|
||||
event.record()
|
||||
```
|
||||
|
||||
### Phase 2: Pre-allocate Buffers in ModelRunner
|
||||
|
||||
**File**: `nanovllm/engine/model_runner.py`
|
||||
|
||||
Add in `__init__()`:
|
||||
```python
|
||||
def _allocate_layerwise_buffers(self):
|
||||
max_seq_len = self.config.max_model_len
|
||||
hidden_size = self.config.hf_config.hidden_size
|
||||
num_heads = self.config.hf_config.num_attention_heads
|
||||
num_kv_heads = self.config.hf_config.num_key_value_heads
|
||||
head_dim = hidden_size // num_heads
|
||||
|
||||
# QKV buffer for prefill
|
||||
self.prefill_qkv_buffer = torch.empty(
|
||||
max_seq_len, hidden_size + 2 * num_kv_heads * head_dim,
|
||||
dtype=self.dtype, device="cuda"
|
||||
)
|
||||
|
||||
# Decode buffers (single token)
|
||||
self.decode_qkv_buffer = torch.empty(
|
||||
1, hidden_size + 2 * num_kv_heads * head_dim,
|
||||
dtype=self.dtype, device="cuda"
|
||||
)
|
||||
```
|
||||
|
||||
### Phase 3: Refactor run_layerwise_offload_prefill()
|
||||
|
||||
**Key changes**:
|
||||
1. Use `offload_engine.compute_stream` for all computation
|
||||
2. Use `offload_layer_kv_async()` instead of `_offload_layer_kv_to_cpu_sync()`
|
||||
3. Enable overlap: layer N offload overlaps with layer N+1 compute
|
||||
4. Remove `torch.cuda.synchronize()`
|
||||
**MInference 只影响 attention 计算,不影响 load/offload:**
|
||||
|
||||
```python
|
||||
def run_layerwise_offload_prefill(self, seqs):
|
||||
offload_engine = self.kvcache_manager.offload_engine
|
||||
compute_stream = offload_engine.compute_stream
|
||||
...
|
||||
for layer_id in range(num_layers):
|
||||
# QKV projection + RoPE
|
||||
q, k = layer.self_attn.rotary_emb(positions, q, k)
|
||||
|
||||
with torch.cuda.stream(compute_stream):
|
||||
for layer_id in range(num_layers):
|
||||
# Wait for previous layer's offload buffer to be safe
|
||||
if layer_id > 0:
|
||||
offload_engine.wait_layer_offload(layer_id - 1)
|
||||
# Sparse or Full attention
|
||||
if self.sparse_prefill_policy is not None:
|
||||
# MInference: only changes attention computation
|
||||
attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
|
||||
q, k, v, layer_id
|
||||
)
|
||||
else:
|
||||
attn_output = flash_attn_varlen_func(q, k, v, ...)
|
||||
|
||||
# Compute (using pre-allocated buffers where possible)
|
||||
q, k, v = compute_layer_qkv(...)
|
||||
attn_out = flash_attn_varlen_func(q, k, v, causal=True)
|
||||
hidden_states = compute_mlp(...)
|
||||
# MLP
|
||||
...
|
||||
|
||||
# Async offload (overlaps with next layer)
|
||||
offload_engine.offload_layer_kv_async(layer_id, k, v, cpu_block_ids, total_tokens)
|
||||
|
||||
# Wait for final layer
|
||||
offload_engine.wait_layer_offload(num_layers - 1)
|
||||
# Offload ALL KV (MInference doesn't affect this)
|
||||
offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens)
|
||||
```
|
||||
|
||||
### Phase 4: Refactor run_layerwise_offload_decode()
|
||||
### Phase 4: Quest Decode Integration
|
||||
|
||||
**Key changes**:
|
||||
1. 使用Ring Buffer实现compute/transfer overlap
|
||||
2. N个buffer循环使用 (N = num_kv_buffers, 外部可配置)
|
||||
3. 使用stream events而非global sync
|
||||
4. 流水线深度 = N-1 (可预加载N-1层)
|
||||
|
||||
**Ring Buffer流水线示意** (以4个buffer为例):
|
||||
```
|
||||
时间 ────────────────────────────────────────────────────────────────────────►
|
||||
|
||||
Buffer 0: [Load L0] ─► [Compute L0] ────────────────────────► [Load L4] ─►
|
||||
Buffer 1: [Load L1] ─► [Compute L1] ────────────────────────►
|
||||
Buffer 2: [Load L2] ─► [Compute L2] ────────────────►
|
||||
Buffer 3: [Load L3] ─► [Compute L3] ────►
|
||||
|
||||
流水线深度 = 3 (同时预加载3层)
|
||||
```
|
||||
**Quest 影响 block load 策略:**
|
||||
|
||||
```python
|
||||
def run_layerwise_offload_decode(self, seqs):
|
||||
offload_engine = self.kvcache_manager.offload_engine
|
||||
compute_stream = offload_engine.compute_stream
|
||||
num_buffers = offload_engine.num_kv_buffers
|
||||
|
||||
# 计算每个block的valid tokens
|
||||
valid_tokens_per_block = self._compute_valid_tokens(cpu_block_table, total_prefill_tokens)
|
||||
|
||||
# Phase 1: 预加载前N层到ring buffer (填满流水线)
|
||||
num_preload = min(num_buffers, num_layers)
|
||||
...
|
||||
# Preload first N layers (no query available, full load)
|
||||
for i in range(num_preload):
|
||||
offload_engine.load_layer_kv_to_buffer(
|
||||
i, i, cpu_block_table, valid_tokens_per_block
|
||||
loaded_tokens[i] = offload_engine.load_layer_kv_to_buffer_with_policy(
|
||||
i, i, cpu_block_table, valid_tokens_per_block, query=None
|
||||
)
|
||||
|
||||
# Phase 2: 主循环 - compute当前层,load下一层
|
||||
with torch.cuda.stream(compute_stream):
|
||||
for layer_id in range(num_layers):
|
||||
# 1. 计算当前buffer index (ring)
|
||||
current_buffer = layer_id % num_buffers
|
||||
for layer_id in range(num_layers):
|
||||
current_buffer = layer_id % num_buffers
|
||||
|
||||
# 2. 等待当前buffer的加载完成
|
||||
offload_engine.wait_buffer_load(current_buffer)
|
||||
# Wait for buffer load
|
||||
offload_engine.wait_buffer_load(current_buffer)
|
||||
|
||||
# 3. 开始加载下一层到同一buffer (buffer被复用)
|
||||
# 下一层 = layer_id + num_buffers (因为当前层用完后buffer可复用)
|
||||
next_layer_to_load = layer_id + num_buffers
|
||||
if next_layer_to_load < num_layers:
|
||||
offload_engine.load_layer_kv_to_buffer(
|
||||
current_buffer, next_layer_to_load, cpu_block_table, valid_tokens_per_block
|
||||
)
|
||||
# QKV projection
|
||||
q, k_new, v_new = ...
|
||||
|
||||
# 4. 获取当前buffer的KV并计算
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
|
||||
# Get loaded KV
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(
|
||||
current_buffer, loaded_tokens[current_buffer]
|
||||
)
|
||||
|
||||
# 5. 计算新token的QKV
|
||||
q_new, k_new, v_new = self._compute_decode_qkv(layer_id, hidden_states)
|
||||
# Attention
|
||||
...
|
||||
|
||||
# 6. 拼接并计算attention
|
||||
k_full = torch.cat([k_prefill, k_decode_prev, k_new], dim=0)
|
||||
v_full = torch.cat([v_prefill, v_decode_prev, v_new], dim=0)
|
||||
attn_out = flash_attn_varlen_func(q_new, k_full, v_full, causal=False)
|
||||
# Mark buffer done
|
||||
offload_engine.record_buffer_compute_done(current_buffer)
|
||||
|
||||
# 7. 标记当前buffer的compute完成 (允许后续load复用这个buffer)
|
||||
offload_engine.record_buffer_compute_done(current_buffer)
|
||||
|
||||
# 8. 存储新KV到decode buffer
|
||||
offload_engine.decode_k_buffer[layer_id, pos].copy_(k_new.squeeze(0))
|
||||
offload_engine.decode_v_buffer[layer_id, pos].copy_(v_new.squeeze(0))
|
||||
|
||||
# 9. MLP
|
||||
hidden_states = self._compute_mlp(layer_id, attn_out)
|
||||
|
||||
# Block满时offload (使用async API)
|
||||
if block_is_full:
|
||||
offload_engine.offload_decode_buffer_async(cpu_block_id)
|
||||
# 注意: 这里不需要立即wait,可以在下一个decode step开始前wait
|
||||
# Load next layer (Quest: selective load if requires_block_selection=True)
|
||||
next_layer = layer_id + num_buffers
|
||||
if next_layer < num_layers:
|
||||
loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer_with_policy(
|
||||
current_buffer, next_layer, cpu_block_table, valid_tokens_per_block,
|
||||
query=q # Pass query for block selection
|
||||
)
|
||||
```
|
||||
|
||||
**优势**:
|
||||
- Compute和H2D transfer完全overlap
|
||||
- 流水线深度可配置 (num_kv_buffers-1)
|
||||
- 没有global `torch.cuda.synchronize()`
|
||||
- 使用stream events进行细粒度同步
|
||||
- Buffer在layer_id + num_buffers时自动复用
|
||||
### Phase 5: Configuration
|
||||
|
||||
### Phase 5: Remove Chunked Prefill Code
|
||||
|
||||
**Files to modify**:
|
||||
|
||||
| File | Remove |
|
||||
|------|--------|
|
||||
| `nanovllm/layers/attention.py` | `_chunked_prefill_attention()`, `_chunked_decode_attention()`, `_sync_load_previous_chunks()`, `_ring_buffer_pipeline_load()`, `_decode_ring_buffer_pipeline()`, `_decode_with_layer_pipeline()` |
|
||||
| `nanovllm/utils/context.py` | `is_chunked_prefill`, `prev_kv_ranges`, `chunk_offset`, `chunked_seq`, `decode_pos_in_block`, `decode_start_pos_in_block`, `current_chunk_idx` |
|
||||
| `nanovllm/kvcache/chunked_attention.py` | Keep for MInference (or remove if unused) |
|
||||
|
||||
Simplify `Attention.forward()` to:
|
||||
```python
|
||||
def forward(self, q, k, v):
|
||||
if context.is_prefill:
|
||||
if context.sparse_prefill_policy:
|
||||
return policy.sparse_prefill_attention(q, k, v, self.layer_id)
|
||||
else:
|
||||
return flash_attn_varlen_func(q, k, v, causal=True)
|
||||
else:
|
||||
return flash_attn_with_kvcache(q, k_cache, v_cache, causal=True)
|
||||
@dataclass
|
||||
class Config:
|
||||
# Separate policies for prefill and decode
|
||||
sparse_prefill_policy: SparsePolicyType = SparsePolicyType.FULL # MINFERENCE
|
||||
sparse_decode_policy: SparsePolicyType = SparsePolicyType.FULL # QUEST
|
||||
```
|
||||
|
||||
### Phase 6: Verification
|
||||
## File Changes Summary
|
||||
|
||||
**Test command**:
|
||||
```bash
|
||||
PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
|
||||
python tests/test_needle.py \
|
||||
--model ~/models/Qwen3-4B-Instruct-2507/ \
|
||||
--max-model-len 32768 \
|
||||
--input-len 8192 \
|
||||
--enable-offload \
|
||||
--block-size 1024 \
|
||||
--num-gpu-blocks 2
|
||||
| File | Changes |
|
||||
|------|---------|
|
||||
| `nanovllm/kvcache/sparse/policy.py` | Add `requires_block_selection` attribute |
|
||||
| `nanovllm/kvcache/sparse/minference.py` | Set `requires_block_selection = False` |
|
||||
| `nanovllm/kvcache/sparse/quest.py` | Set `requires_block_selection = True` |
|
||||
| `nanovllm/kvcache/sparse/full_policy.py` | Set `requires_block_selection = False` |
|
||||
| `nanovllm/kvcache/offload_engine.py` | Add `offload_layer_kv_sync()`, `load_layer_kv_to_buffer_with_policy()` |
|
||||
| `nanovllm/engine/model_runner.py` | Use encapsulated methods, integrate sparse policies |
|
||||
|
||||
## Key Design Principles
|
||||
|
||||
1. **Encapsulation**: All copy_ operations in OffloadEngine
|
||||
2. **Interface Flag**: `requires_block_selection` declares if policy affects load strategy
|
||||
3. **Separation of Concerns**:
|
||||
- MInference: only `sparse_prefill_attention()` (compute-level)
|
||||
- Quest: `select_blocks()` + hooks (load-level)
|
||||
4. **Hooks inside engine**: Sparse policy hooks called within OffloadEngine methods
|
||||
|
||||
## Decisions Made
|
||||
|
||||
- [x] 添加 `requires_block_selection` 接口标志区分两类 policy
|
||||
- [x] 所有 copy_ 封装在 OffloadEngine 中
|
||||
- [x] Sparse policy hooks 在 OffloadEngine 内部调用
|
||||
- [x] Decode preload 使用全量加载(Q 不可用)
|
||||
|
||||
## Status
|
||||
|
||||
**COMPLETE** - All phases implemented and tested successfully.
|
||||
|
||||
### Test Results (Qwen3-4B-Instruct-2507)
|
||||
|
||||
验证 offload + MInference 输出与 GPU-only + MInference 完全一致:
|
||||
|
||||
```
|
||||
# GPU-only + MInference
|
||||
test_needle.py --model Qwen3-4B --input-len 32768 --enable-minference
|
||||
- Prefill: 3383 tok/s
|
||||
- Output tokens: [22, 19, 24, 17, 151645] = "7492<|im_end|>"
|
||||
- Result: PASSED
|
||||
|
||||
# Offload + MInference
|
||||
test_needle.py --model Qwen3-4B --input-len 32768 --enable-offload --enable-minference
|
||||
- Prefill: 5373 tok/s (faster due to layer-wise processing)
|
||||
- Output tokens: [22, 19, 24, 17, 151645] = "7492<|im_end|>"
|
||||
- Result: PASSED
|
||||
|
||||
两种配置输出完全一致!
|
||||
```
|
||||
|
||||
**Success criteria**: `test_needle: PASSED`
|
||||
Note: Qwen3-0.6B 在 offload 模式下有已知 bug(模型太小,长序列不稳定),不是本次修改引入。
|
||||
|
||||
---
|
||||
## Performance Discovery
|
||||
|
||||
## Current Issues Summary
|
||||
**意外发现**: Offload 模式比 GPU-only 模式更快!
|
||||
|
||||
| Issue | Location | Solution |
|
||||
|-------|----------|----------|
|
||||
| Direct `.copy_()` bypassing OffloadEngine | `model_runner.py:798-804` | Use `offload_layer_kv_async()` |
|
||||
| `torch.cuda.synchronize()` | `model_runner.py:804` | Use stream events |
|
||||
| Intermediate memory not pre-allocated | `model_runner.py:508-517` | Pre-allocate in `__init__()` |
|
||||
| Chunked prefill code unused | `attention.py`, `context.py` | Remove entirely |
|
||||
| Mode | Prefill Speed |
|
||||
|------|---------------|
|
||||
| GPU-only + MInference | 3383 tok/s |
|
||||
| Offload + MInference | 5373 tok/s |
|
||||
|
||||
---
|
||||
**根本原因**: GPU-only 模式的 `store_kvcache()` 使用 PagedAttention 的 scatter 操作 (`index_copy_`),而 offload 模式使用 contiguous copy。
|
||||
|
||||
## Critical Files
|
||||
|
||||
- `nanovllm/kvcache/offload_engine.py` - Add layerwise API
|
||||
- `nanovllm/engine/model_runner.py` - Pre-allocate buffers, refactor prefill/decode
|
||||
- `nanovllm/layers/attention.py` - Remove chunked prefill code
|
||||
- `nanovllm/utils/context.py` - Remove chunked prefill fields
|
||||
详细分析和优化建议见: [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md)
|
||||
|
||||
@@ -106,12 +106,15 @@ def run_needle_test(
|
||||
}
|
||||
if enable_cpu_offload:
|
||||
llm_kwargs["num_gpu_blocks"] = num_gpu_blocks
|
||||
llm_kwargs["sparse_policy"] = sparse_policy
|
||||
llm_kwargs["sparse_topk_blocks"] = sparse_topk
|
||||
llm_kwargs["sparse_threshold_blocks"] = sparse_threshold
|
||||
elif enable_minference:
|
||||
# MInference is GPU-only sparse prefill
|
||||
|
||||
# Set sparse policy (can be used with or without offload)
|
||||
if enable_minference or enable_quest:
|
||||
llm_kwargs["sparse_policy"] = sparse_policy
|
||||
|
||||
# MInference params (works with both GPU-only and offload mode)
|
||||
if enable_minference:
|
||||
llm_kwargs["minference_adaptive_budget"] = minference_budget
|
||||
llm_kwargs["minference_vertical_size"] = minference_vertical
|
||||
llm_kwargs["minference_slash_size"] = minference_slash
|
||||
|
||||
Reference in New Issue
Block a user