# GPU-only Performance Issue: PagedAttention Scatter Overhead ## Problem Summary GPU-only mode with MInference is **slower** than CPU offload mode for long-context single-sequence inference: | Mode | Prefill Speed (32K tokens, Qwen3-4B) | |------|--------------------------------------| | GPU-only + MInference | 3383 tok/s | | Offload + MInference | 5373 tok/s | This counterintuitive result is caused by **unnecessary `store_kvcache` overhead** in the GPU-only path. ## Root Cause Analysis ### GPU-only Execution Path ```python # attention.py line 86-110 def forward(self, q, k, v): # ALWAYS store to cache first - OVERHEAD HERE if k_cache.numel() and v_cache.numel(): store_kvcache(k, v, k_cache, v_cache, context.slot_mapping) # ← Always executed if context.is_prefill: if context.sparse_prefill_policy is not None: # MInference: uses k, v directly, NOT k_cache! o = sparse_prefill_attention(q, k, v, layer_id) else: # Full attention: also uses k, v directly o = flash_attn_varlen_func(q, k, v, ...) ``` **Key observation**: Prefill attention **never reads from cache** - it uses the computed k, v directly. But `store_kvcache` is always called before attention. ### The `store_kvcache` Overhead ```python # attention.py line 8-59 def store_kvcache(key, value, k_cache, v_cache, slot_mapping): # 1. Filter invalid slots (conditional logic) valid_mask = slot_mapping >= 0 valid_slots = slot_mapping[valid_mask] valid_keys = key[valid_mask] # 2. Reshape for scatter operation k_cache_flat = k_cache.view(total_slots, D) valid_keys_flat = valid_keys.reshape(-1, D) # 3. Scatter write via index_copy_ - EXPENSIVE! k_cache_flat.index_copy_(0, valid_slots.long(), valid_keys_flat) v_cache_flat.index_copy_(0, valid_slots.long(), valid_values_flat) ``` This scatter operation is called for **every layer** (28 layers for Qwen3-4B), writing **all tokens** (32K) to GPU cache. ### Offload Path (No Such Overhead) ```python # model_runner.py - run_layerwise_offload_prefill for layer_id in range(num_layers): # QKV projection + RoPE q, k = layer.self_attn.rotary_emb(positions, q, k) # Sparse attention - directly uses k, v attn_output = sparse_prefill_attention(q, k, v, layer_id) # Contiguous copy to CPU - no scatter! offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens) ``` ## Memory Layout Comparison | Aspect | GPU-only (PagedAttention) | Offload (Contiguous) | |--------|---------------------------|----------------------| | **Layout** | `[num_blocks, block_size, heads, dim]` | `[seq_len, heads, dim]` | | **Write pattern** | Scatter via `index_copy_` | Contiguous `copy_()` | | **Indirection** | slot_mapping lookup | None | | **Memory efficiency** | High (shared block pool) | Low (reserved per seq) | | **Write performance** | Slow (memory-bound scatter) | Fast (simple DMA) | ### Why PagedAttention Uses Scatter PagedAttention is designed for: 1. **Multi-sequence batching**: Different sequences share a block pool 2. **Dynamic memory management**: No need to reserve max_len per sequence 3. **Prefix caching**: Shared KV blocks across sequences But for **single-sequence long-context** inference, these benefits don't apply, and we only pay the scatter overhead. ## Why `store_kvcache` is Still Needed Even though prefill attention doesn't read from cache, **decode** does: ```python # attention.py line 111-114 else: # decode # Reads from cache! o = flash_attn_with_kvcache(q, k_cache, v_cache, block_table=...) ``` So `store_kvcache` during prefill is preparing KV cache for future decode steps. ## Potential Optimizations ### Option 1: Async Store After Attention (Low Effort) Move `store_kvcache` after attention computation and make it async: ```python def forward(self, q, k, v): if context.is_prefill: # Compute attention first if context.sparse_prefill_policy is not None: o = sparse_prefill_attention(q, k, v, layer_id) else: o = flash_attn_varlen_func(q, k, v, ...) # Then store async (overlaps with next layer's QKV) if k_cache.numel(): store_kvcache_async(k, v, k_cache, v_cache, slot_mapping) ... ``` **Expected benefit**: Overlap store with compute, ~20-30% improvement. ### Option 2: Contiguous Layout for Single-Sequence Mode (Medium Effort) Add a "contiguous mode" for single-sequence long-context: ```python class ContiguousKVCache: """Simple contiguous KV cache for single-sequence mode.""" def __init__(self, num_layers, max_seq_len, num_kv_heads, head_dim, dtype): self.k_cache = torch.zeros(num_layers, max_seq_len, num_kv_heads, head_dim, dtype=dtype) self.v_cache = torch.zeros(num_layers, max_seq_len, num_kv_heads, head_dim, dtype=dtype) def store(self, layer_id, k, v, start_pos): # Simple contiguous write - no scatter! seq_len = k.shape[0] self.k_cache[layer_id, start_pos:start_pos+seq_len] = k self.v_cache[layer_id, start_pos:start_pos+seq_len] = v ``` **Expected benefit**: Match or exceed offload performance (~60% improvement). ### Option 3: Fused Store-Attention Kernel (High Effort) Create a fused Triton kernel that: 1. Computes QKV projection 2. Stores K, V to cache 3. Computes attention This eliminates memory roundtrips entirely. **Expected benefit**: Best possible performance, but high implementation complexity. ## Recommended Action For **single-sequence long-context** workloads (the primary use case for MInference): 1. **Short term**: Use offload mode - it's actually faster! 2. **Medium term**: Implement Option 1 (async store) for quick win 3. **Long term**: Consider Option 2 (contiguous layout) for GPU-only mode ## Performance Measurement To reproduce the benchmark: ```bash # GPU-only + MInference PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py \ --model ~/models/Qwen3-4B-Instruct-2507/ \ --input-len 32768 \ --enable-minference # Offload + MInference PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py \ --model ~/models/Qwen3-4B-Instruct-2507/ \ --input-len 32768 \ --enable-offload \ --enable-minference ``` ## Related Files - `nanovllm/layers/attention.py`: `store_kvcache()` and `Attention.forward()` - `nanovllm/engine/model_runner.py`: `run_layerwise_offload_prefill()` - `nanovllm/kvcache/offload_engine.py`: `offload_layer_kv_sync()` ## References - [PagedAttention Paper](https://arxiv.org/abs/2309.06180) - vLLM's memory management - [MInference Paper](https://arxiv.org/abs/2407.02490) - Sparse prefill attention