[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST
This commit is contained in:
463
notes.md
463
notes.md
@@ -1,205 +1,324 @@
|
||||
# Notes: Layerwise Offload Implementation
|
||||
# Notes: Sparsity Integration into Layerwise Offload
|
||||
|
||||
## Code Analysis
|
||||
## Current Architecture Analysis
|
||||
|
||||
### Current Layerwise Offload Flow
|
||||
### GPU-Only Path vs Offload Path
|
||||
|
||||
| Aspect | GPU-Only | Layerwise Offload |
|
||||
|--------|----------|-------------------|
|
||||
| KV Storage | GPU blocks (paged) | CPU pinned + GPU ring buffer |
|
||||
| Prefill | All layers → then attention | Per-layer: attention → offload |
|
||||
| Decode | FlashAttn with block table | Ring buffer H2D → FlashAttn |
|
||||
| Sparse Support | MInference via `attention.py` | Not integrated |
|
||||
|
||||
### MInference Flow (GPU-Only)
|
||||
|
||||
**Prefill** (`model_runner.py:462-573`):
|
||||
```
|
||||
for layer_id in range(num_layers):
|
||||
q, k, v = compute_qkv(hidden_states)
|
||||
attn_out = flash_attn_varlen_func(q, k, v, causal=True)
|
||||
hidden_states = mlp(attn_out)
|
||||
_offload_layer_kv_to_cpu_sync(layer_id, k, v) # BLOCKING!
|
||||
attention.py:101-105:
|
||||
if context.sparse_prefill_policy is not None:
|
||||
o = context.sparse_prefill_policy.sparse_prefill_attention(q, k, v, layer_id)
|
||||
|
||||
minference.py:sparse_prefill_attention():
|
||||
1. estimate_pattern(q, k, layer_id) -> vertical_indices, slash_indices
|
||||
2. _triton_mixed_sparse_attention(q, k, v, indices)
|
||||
3. return output
|
||||
```
|
||||
|
||||
**Decode** (`model_runner.py:641-817`):
|
||||
### Quest Flow (GPU Block Mode)
|
||||
|
||||
```
|
||||
for layer_id in range(num_layers):
|
||||
# Load all prefilled KV from CPU (SLOW!)
|
||||
for block_id in cpu_block_table:
|
||||
k_block = k_cache_cpu[layer_id, block_id].to("cuda")
|
||||
v_block = v_cache_cpu[layer_id, block_id].to("cuda")
|
||||
|
||||
k_full = cat([k_prefill, k_decode_prev, k_new])
|
||||
attn_out = flash_attn(q, k_full, v_full, causal=False)
|
||||
|
||||
# Store new KV to decode buffer
|
||||
decode_k_buffer[layer_id, pos].copy_(k_new)
|
||||
|
||||
# Block-full offload (lines 793-811)
|
||||
if block_is_full:
|
||||
for layer_id in range(num_layers):
|
||||
k_cache_cpu[layer_id, block].copy_(decode_k_buffer[layer_id], non_blocking=True)
|
||||
torch.cuda.synchronize() # BAD: global sync
|
||||
hybrid_manager.py (if using CPU offload with Quest):
|
||||
select_blocks(available_blocks, ctx) -> selected block IDs
|
||||
-> load selected blocks to GPU
|
||||
-> standard FlashAttn with loaded blocks
|
||||
```
|
||||
|
||||
### OffloadEngine Existing Infrastructure
|
||||
### Layerwise Offload Prefill Flow
|
||||
|
||||
**Streams** (available for use):
|
||||
- `compute_stream` - dedicated compute stream (not default!)
|
||||
- `prefill_offload_streams[layer_id]` - per-layer D2H streams
|
||||
- `slot_transfer_streams[slot_idx]` - per-slot H2D streams
|
||||
- `transfer_stream_main` - main transfer stream
|
||||
- `_pipeline_layer_stream` - cross-layer pipeline stream
|
||||
```
|
||||
model_runner.py:run_layerwise_offload_prefill():
|
||||
for layer_id in range(num_layers):
|
||||
# QKV projection
|
||||
q, k, v = qkv_proj(hidden_ln)
|
||||
|
||||
**Events** (available for use):
|
||||
- `prefill_offload_events[layer_id]` - per-layer offload completion
|
||||
- `ring_slot_ready[slot]` - H2D completion
|
||||
- `ring_slot_offload_done[slot]` - D2H completion
|
||||
- `ring_slot_compute_done[slot]` - compute completion
|
||||
- `_pipeline_next_layer_event` - pipeline next layer ready
|
||||
# RoPE
|
||||
q, k = rotary_emb(positions, q, k)
|
||||
|
||||
**Buffers** (already allocated):
|
||||
- `k_cache_cpu/v_cache_cpu` - [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
- `k_cache_gpu/v_cache_gpu` - [num_gpu_blocks, block_size, kv_heads, head_dim] (no layer dim!)
|
||||
- `decode_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
|
||||
- `prefill_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
|
||||
- `layer_k_buffer_a/b, layer_v_buffer_a/b` - [max_prefill_blocks, block_size, kv_heads, head_dim]
|
||||
# FULL attention (no sparsity!)
|
||||
attn_output = flash_attn_varlen_func(q, k, v, ...)
|
||||
|
||||
### Useful Existing Methods
|
||||
# MLP
|
||||
hidden_states = mlp(attn_out + residual)
|
||||
|
||||
**Async offload** (currently unused in layerwise):
|
||||
# Sync offload ALL k, v to CPU
|
||||
for block_id in cpu_block_ids:
|
||||
k_cache_cpu[layer_id, block_id].copy_(k[start:end])
|
||||
v_cache_cpu[layer_id, block_id].copy_(v[start:end])
|
||||
```
|
||||
|
||||
### Layerwise Offload Decode Flow
|
||||
|
||||
```
|
||||
model_runner.py:run_layerwise_offload_decode():
|
||||
# Preload first N layers to ring buffer
|
||||
for i in range(num_buffers):
|
||||
offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)
|
||||
|
||||
for layer_id in range(num_layers):
|
||||
current_buffer = layer_id % num_buffers
|
||||
|
||||
# Wait for buffer load
|
||||
offload_engine.wait_buffer_load(current_buffer)
|
||||
|
||||
# Get prefilled KV from ring buffer (ALL blocks loaded)
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
|
||||
|
||||
# QKV for new token
|
||||
q, k_new, v_new = qkv_proj(hidden_ln)
|
||||
|
||||
# Concat and full attention
|
||||
k_full = torch.cat([k_prefill, k_decode_prev, k_new])
|
||||
attn_output = flash_attn_varlen_func(q, k_full, v_full, ...)
|
||||
|
||||
# Start loading next layer
|
||||
offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### 1. Prefill Sparse Integration Point
|
||||
|
||||
**Location:** `model_runner.py:535-543`
|
||||
|
||||
**Current:**
|
||||
```python
|
||||
offload_prefill_buffer_async(layer_id, cpu_block_id, num_valid_tokens)
|
||||
wait_all_prefill_offloads()
|
||||
wait_prefill_offload(layer_id)
|
||||
attn_output = flash_attn_varlen_func(
|
||||
q, k, v,
|
||||
cu_seqlens_q=cu_seqlens,
|
||||
cu_seqlens_k=cu_seqlens,
|
||||
max_seqlen_q=total_tokens,
|
||||
max_seqlen_k=total_tokens,
|
||||
softmax_scale=layer.self_attn.attn.scale,
|
||||
causal=True,
|
||||
)
|
||||
```
|
||||
|
||||
**Cross-layer pipeline** (for decode):
|
||||
**After Integration:**
|
||||
```python
|
||||
start_decode_pipeline(cpu_block_ids)
|
||||
get_decode_layer_kv(layer_id, num_blocks) -> (k, v)
|
||||
end_decode_pipeline()
|
||||
if self.sparse_policy and self.sparse_policy.supports_offload_prefill:
|
||||
attn_output, k_sparse, v_sparse = self.sparse_policy.offload_prefill_attention(
|
||||
q, k, v, layer_id
|
||||
)
|
||||
k_to_offload = k_sparse if k_sparse is not None else k
|
||||
v_to_offload = v_sparse if v_sparse is not None else v
|
||||
else:
|
||||
attn_output = flash_attn_varlen_func(q, k, v, ...)
|
||||
k_to_offload, v_to_offload = k, v
|
||||
```
|
||||
|
||||
### Chunked Prefill Code to Remove
|
||||
### 2. Decode Sparse Integration Point
|
||||
|
||||
**attention.py** (lines to remove):
|
||||
- 172-312: `_chunked_prefill_attention()`
|
||||
- 314-346: `_sync_load_previous_chunks()`
|
||||
- 348-480: `_ring_buffer_pipeline_load()`
|
||||
- 482-591: `_chunked_decode_attention()`
|
||||
- 593-667: `_decode_ring_buffer_pipeline()`
|
||||
- 669-726: `_decode_with_layer_pipeline()`
|
||||
**Location:** `model_runner.py:636-637` and `model_runner.py:704-706`
|
||||
|
||||
**context.py** (fields to remove):
|
||||
- `is_chunked_prefill`
|
||||
- `prev_kv_ranges`
|
||||
- `chunk_offset`
|
||||
- `chunked_seq`
|
||||
- `decode_pos_in_block`
|
||||
- `decode_start_pos_in_block`
|
||||
- `current_chunk_idx`
|
||||
|
||||
**Keep**:
|
||||
- `kvcache_manager` - still needed for layerwise
|
||||
- `sparse_prefill_policy` - needed for MInference
|
||||
|
||||
---
|
||||
|
||||
## Memory Layout
|
||||
|
||||
### 新设计: Ring-Buffered GPU KV Cache
|
||||
|
||||
**设计原则**:
|
||||
- 不追求极致peak memory优化,保证流水线正确性
|
||||
- Ring buffer层数可从外部配置 (默认4层)
|
||||
- 流水线深度 = num_kv_buffers - 1
|
||||
|
||||
```
|
||||
# 新: Ring-Buffered GPU Cache (layerwise offload专用)
|
||||
# num_kv_buffers: 外部可配置,默认4
|
||||
layer_k_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||||
layer_v_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
|
||||
|
||||
# 移除: 旧的chunked prefill ring buffer
|
||||
# k_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
|
||||
# v_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim] <- 删除
|
||||
```
|
||||
|
||||
**为什么使用Ring Buffer?**
|
||||
|
||||
Decode阶段的流水线需求 (以4个buffer为例):
|
||||
```
|
||||
Buffer 0: [Load L0] → [Compute L0] ──────────────────► [Load L4]
|
||||
Buffer 1: [Load L1] → [Compute L1] ────────────────────►
|
||||
Buffer 2: [Load L2] → [Compute L2] ────────────►
|
||||
Buffer 3: [Load L3] → [Compute L3] ──►
|
||||
```
|
||||
|
||||
流水线深度 = 3,可以预加载3层,更好地隐藏H2D延迟。
|
||||
|
||||
**内存开销** (Qwen3-4B, 128K tokens):
|
||||
- 单层KV: 128K × 8 × 128 × 2 bytes = 256 MB
|
||||
- 4层ring buffer: 4 × 256 MB = 1 GB
|
||||
- 对比28层全GPU: 28 × 256 MB = 7.2 GB
|
||||
- **节省**: 7.2 GB - 1 GB = 6.2 GB
|
||||
|
||||
**配置传递**:
|
||||
```
|
||||
LLM(num_kv_buffers=4) → Config → OffloadEngine(num_kv_buffers=...)
|
||||
```
|
||||
|
||||
### CPU Cache (保持不变)
|
||||
```
|
||||
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
```
|
||||
Pinned memory for fast DMA transfers.
|
||||
|
||||
### Memory per Layer (Qwen3-4B)
|
||||
- kv_heads = 8
|
||||
- head_dim = 128
|
||||
- dtype = bfloat16 (2 bytes)
|
||||
- Per token KV: 8 * 128 * 2 * 2 = 4KB
|
||||
- 128K tokens: 512 MB per layer
|
||||
- 28 layers: 14 GB total on CPU
|
||||
|
||||
---
|
||||
|
||||
## Stream Synchronization Pattern
|
||||
|
||||
### Correct Pattern for Async Offload
|
||||
**Current (preload):**
|
||||
```python
|
||||
# In offload stream
|
||||
with torch.cuda.stream(offload_stream):
|
||||
offload_stream.wait_stream(compute_stream) # Wait for compute to finish
|
||||
cpu_tensor.copy_(gpu_tensor, non_blocking=True)
|
||||
event.record(offload_stream)
|
||||
|
||||
# Before reusing gpu_tensor
|
||||
compute_stream.wait_event(event) # Wait for offload to complete
|
||||
for i in range(num_preload):
|
||||
offload_engine.load_layer_kv_to_buffer(
|
||||
i, i, cpu_block_table, valid_tokens_per_block
|
||||
)
|
||||
```
|
||||
|
||||
### Correct Pattern for Async Load
|
||||
**After Integration:**
|
||||
```python
|
||||
# In load stream
|
||||
with torch.cuda.stream(load_stream):
|
||||
gpu_buffer.copy_(cpu_tensor, non_blocking=True)
|
||||
event.record(load_stream)
|
||||
|
||||
# Before using gpu_buffer
|
||||
compute_stream.wait_event(event) # Wait for load to complete
|
||||
for i in range(num_preload):
|
||||
layer_to_load = i
|
||||
if self.sparse_policy and self.sparse_policy.supports_offload_decode:
|
||||
# Prepare q for this layer (need to compute ahead)
|
||||
# OR: use previous layer's pattern as estimate
|
||||
selected_blocks = self.sparse_policy.select_offload_blocks(
|
||||
None, # q not available yet at preload
|
||||
layer_to_load,
|
||||
cpu_block_table,
|
||||
valid_tokens_per_block
|
||||
)
|
||||
else:
|
||||
selected_blocks = cpu_block_table
|
||||
offload_engine.load_sparse_layer_kv_to_buffer(
|
||||
i, layer_to_load, selected_blocks, valid_tokens_per_block
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
**Challenge:** Q is not available during preload phase!
|
||||
|
||||
## Test Configuration
|
||||
**Solutions:**
|
||||
1. Skip sparse preload, only sparse for non-preloaded layers
|
||||
2. Use previous decode step's pattern as estimate
|
||||
3. Add preload hook to sparse policy
|
||||
|
||||
**Needle test command**:
|
||||
```bash
|
||||
PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
|
||||
python tests/test_needle.py \
|
||||
--model ~/models/Qwen3-4B-Instruct-2507/ \
|
||||
--max-model-len 32768 \
|
||||
--input-len 8192 \
|
||||
--enable-offload \
|
||||
--block-size 1024 \
|
||||
--num-gpu-blocks 2
|
||||
### 3. Offload Engine Extension
|
||||
|
||||
**New Method in OffloadEngine:**
|
||||
|
||||
```python
|
||||
def load_sparse_layer_kv_to_buffer(
|
||||
self,
|
||||
buffer_idx: int,
|
||||
layer_id: int,
|
||||
selected_cpu_block_ids: List[int],
|
||||
original_valid_tokens: List[int],
|
||||
) -> int:
|
||||
"""
|
||||
Load only selected blocks from CPU to buffer.
|
||||
|
||||
Returns:
|
||||
Total tokens loaded (may be less than full sequence)
|
||||
"""
|
||||
stream = self.layer_load_streams[buffer_idx]
|
||||
|
||||
with torch.cuda.stream(stream):
|
||||
stream.wait_event(self.buffer_compute_done_events[buffer_idx])
|
||||
|
||||
# Build mapping: original block -> selected position
|
||||
offset = 0
|
||||
for i, cpu_block_id in enumerate(selected_cpu_block_ids):
|
||||
# Find original index to get valid tokens
|
||||
valid_tokens = original_valid_tokens[i] # Need mapping
|
||||
|
||||
self.layer_k_cache[buffer_idx, offset:offset+valid_tokens].copy_(
|
||||
self.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens],
|
||||
non_blocking=True
|
||||
)
|
||||
# ... v_cache same
|
||||
|
||||
offset += valid_tokens
|
||||
|
||||
self.buffer_load_events[buffer_idx].record(stream)
|
||||
|
||||
return offset # Caller needs to know actual loaded tokens
|
||||
```
|
||||
|
||||
**GPU mutex check before running**:
|
||||
```bash
|
||||
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
|
||||
## Metadata Flow for Quest
|
||||
|
||||
### During Prefill Offload
|
||||
|
||||
**Current:** No metadata collection in offload path
|
||||
|
||||
**Required:** Call `on_prefill_offload()` for each block
|
||||
|
||||
```python
|
||||
# In run_layerwise_offload_prefill()
|
||||
for i, cpu_block_id in enumerate(cpu_block_ids):
|
||||
start = i * block_size
|
||||
end = min(start + block_size, total_tokens)
|
||||
actual_size = end - start
|
||||
|
||||
# BEFORE offload: update Quest metadata
|
||||
if self.sparse_policy and hasattr(self.sparse_policy, 'on_prefill_offload'):
|
||||
self.sparse_policy.on_prefill_offload(
|
||||
cpu_block_id, layer_id, k[start:end], actual_size
|
||||
)
|
||||
|
||||
# Offload
|
||||
offload_engine.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
|
||||
offload_engine.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
|
||||
```
|
||||
|
||||
### Quest Metadata Shape
|
||||
|
||||
```python
|
||||
# BlockMetadataManager
|
||||
key_min: [num_blocks, num_layers, num_kv_heads, head_dim] # Min key per block per layer
|
||||
key_max: [num_blocks, num_layers, num_kv_heads, head_dim] # Max key per block per layer
|
||||
```
|
||||
|
||||
**Memory:** 2 * num_blocks * num_layers * kv_heads * head_dim * 2 bytes
|
||||
- Example: 1000 blocks * 28 layers * 4 heads * 128 dim * 2 * 2 = ~57 MB
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### MInference Prefill Overhead
|
||||
|
||||
| Operation | Time (64K seq) |
|
||||
|-----------|----------------|
|
||||
| Pattern estimation (last-64) | ~5ms |
|
||||
| Triton sparse attention | ~80ms |
|
||||
| Full FlashAttention | ~100ms |
|
||||
| **Net Speedup** | ~15-20% |
|
||||
|
||||
### Quest Decode Overhead
|
||||
|
||||
| Operation | Time |
|
||||
|-----------|------|
|
||||
| Block scoring (GPU metadata) | ~0.1ms |
|
||||
| Top-K selection | ~0.05ms |
|
||||
| Sparse H2D load (8 blocks) | ~2ms |
|
||||
| Full H2D load (100 blocks) | ~20ms |
|
||||
| **Net Speedup** | ~10x H2D |
|
||||
|
||||
### Memory Trade-offs
|
||||
|
||||
| Mode | GPU Memory | CPU Memory | H2D Bandwidth |
|
||||
|------|------------|------------|---------------|
|
||||
| Full offload | Ring buffer | Full KV | High |
|
||||
| Sparse offload | Ring buffer | Full KV | Low (subset) |
|
||||
| Aggressive sparse | Ring buffer | Sparse KV | Very low |
|
||||
|
||||
## Edge Cases
|
||||
|
||||
### 1. Short Sequences (< sparse threshold)
|
||||
|
||||
```python
|
||||
if total_tokens < sparse_threshold:
|
||||
# Fall back to full attention
|
||||
use_sparse = False
|
||||
```
|
||||
|
||||
### 2. First Decode Step (no previous Q)
|
||||
|
||||
Quest can't score blocks without Q. Options:
|
||||
- Use average embedding as proxy
|
||||
- Load all blocks for first step
|
||||
- Use prefill pattern as estimate
|
||||
|
||||
### 3. Variable Sequence Lengths in Batch
|
||||
|
||||
Layerwise offload currently only supports batch_size=1:
|
||||
```python
|
||||
assert len(seqs) == 1, "Layer-wise offload only supports single sequence"
|
||||
```
|
||||
|
||||
Sparse integration should maintain this constraint.
|
||||
|
||||
### 4. Ring Buffer vs Sparse Load Mismatch
|
||||
|
||||
Ring buffer assumes fixed `total_prefill_tokens`:
|
||||
```python
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(buffer_idx, total_prefill_tokens)
|
||||
```
|
||||
|
||||
Sparse load has variable token count. Need:
|
||||
```python
|
||||
# Track actual loaded tokens per buffer
|
||||
loaded_tokens[buffer_idx] = sparse_load_count
|
||||
k_prefill, v_prefill = offload_engine.get_buffer_kv(buffer_idx, loaded_tokens[buffer_idx])
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
1. `test_sparse_policy_interface.py` - Verify new interface methods
|
||||
2. `test_minference_offload.py` - MInference in offload mode
|
||||
3. `test_quest_offload.py` - Quest block selection in offload mode
|
||||
|
||||
### Integration Tests
|
||||
|
||||
1. `test_offload_sparse_e2e.py` - Full prefill+decode with sparsity
|
||||
2. `test_accuracy_comparison.py` - Compare outputs: full vs sparse
|
||||
|
||||
### Benchmarks
|
||||
|
||||
1. `bench_offload_sparse.py` - Compare:
|
||||
- Full offload (baseline)
|
||||
- MInference prefill + Quest decode
|
||||
- Aggressive sparse offload
|
||||
|
||||
Reference in New Issue
Block a user