Files
nano-vllm/docs/sparse_offload_integration.md

14 KiB

Sparse Policy Integration with Layerwise Offload

This document describes the architecture and design of integrating sparse attention policies (MInference, Quest) with the layerwise CPU offload execution path.

Design Goals

  1. Extend sparse policies to offload path: GPU-only path already supports sparse policies, but layerwise offload bypasses them
  2. Maintain encapsulation: All copy_() operations must be inside OffloadEngine, not exposed to model_runner
  3. Distinguish policy types: Some policies affect attention computation (MInference), others affect KV load strategy (Quest)
  4. Extensible architecture: Easy to add new sparse policies in the future

Key Insight

The existing sparse policy implementation works, but the layerwise offload path bypasses it:

Path Attention Method Sparse Support
GPU-only attention.pysparse_prefill_attention() YES
Layerwise offload model_runner.pyflash_attn_varlen_func() NO (direct call)

Two Types of Sparse Policies

The fundamental difference between sparse policies:

Policy Affects Attention Computation Affects KV Load Strategy select_blocks() Behavior
MInference YES (sparse_prefill_attention) NO return available_blocks (all)
Quest NO YES Returns Top-K subset
  • MInference: Only changes how attention is computed, doesn't affect external load/offload flow
  • Quest: Selectively loads only some blocks, affects H2D transfer

The requires_block_selection Interface Flag

To distinguish these policy types, we add a flag to the base class:

# nanovllm/kvcache/sparse/policy.py
class SparsePolicy(ABC):
    # Existing flags
    supports_prefill: bool = True
    supports_decode: bool = True

    # NEW: Whether this policy requires selective block loading
    # If True: OffloadEngine will call select_blocks() before loading
    # If False: OffloadEngine will load all blocks (select_blocks ignored)
    requires_block_selection: bool = False

Policy Implementations

# MInference: prefill-only, no block selection
class MInferencePolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = False
    requires_block_selection = False  # Only affects attention computation

# Quest: decode-only, requires block selection
class QuestPolicy(SparsePolicy):
    supports_prefill = False
    supports_decode = True
    requires_block_selection = True  # Affects KV load strategy

# Full attention: baseline
class FullAttentionPolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = True
    requires_block_selection = False  # Load all blocks

OffloadEngine Encapsulation

All KV cache operations are encapsulated in OffloadEngine. The model_runner never directly accesses internal storage.

Prefill: Synchronous Offload with Hooks

# nanovllm/kvcache/offload_engine.py
def offload_layer_kv_sync(
    self,
    layer_id: int,
    k: Tensor,
    v: Tensor,
    cpu_block_ids: List[int],
    total_tokens: int,
) -> None:
    """
    Synchronously offload layer KV to CPU.
    Calls sparse policy hooks internally.
    """
    for i, cpu_block_id in enumerate(cpu_block_ids):
        start = i * self.block_size
        end = min(start + self.block_size, total_tokens)
        actual_size = end - start

        # Hook: notify sparse policy BEFORE offload (k still on GPU)
        if self.sparse_policy is not None:
            self.sparse_policy.on_prefill_offload(
                cpu_block_id, layer_id, k[start:end], actual_size
            )

        # Synchronous copy to CPU (internal)
        self.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
        self.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])

Decode: Policy-Driven Block Loading

def load_layer_kv_to_buffer_with_policy(
    self,
    buffer_idx: int,
    layer_id: int,
    cpu_block_ids: List[int],
    valid_tokens_per_block: List[int],
    query: Optional[Tensor] = None,
) -> int:
    """
    Load layer KV to buffer, optionally using sparse policy for block selection.

    Returns:
        Total tokens loaded
    """
    # Check if policy requires block selection
    if (self.sparse_policy is not None and
        self.sparse_policy.requires_block_selection and
        query is not None):
        # Build context
        ctx = PolicyContext(
            query_chunk_idx=0,
            num_query_chunks=1,
            layer_id=layer_id,
            query=query,
            is_prefill=False,
            block_size=self.block_size,
        )
        # Select blocks using policy
        selected_blocks = self.sparse_policy.select_blocks(cpu_block_ids, ctx)

        # Build valid_tokens for selected blocks
        block_to_valid = {bid: vt for bid, vt in zip(cpu_block_ids, valid_tokens_per_block)}
        selected_valid = [block_to_valid[bid] for bid in selected_blocks]

        return self._load_blocks_to_buffer(
            buffer_idx, layer_id, selected_blocks, selected_valid
        )
    else:
        # Load all blocks (no selection)
        return self._load_blocks_to_buffer(
            buffer_idx, layer_id, cpu_block_ids, valid_tokens_per_block
        )

Prefill Integration (MInference)

MInference only affects attention computation, not the load/offload flow:

# nanovllm/engine/model_runner.py - run_layerwise_offload_prefill()
def run_layerwise_offload_prefill(self, seqs):
    ...
    for layer_id in range(num_layers):
        # QKV projection + RoPE
        q, k = layer.self_attn.rotary_emb(positions, q, k)

        # Sparse or Full attention
        if self.sparse_prefill_policy is not None:
            # MInference: only changes attention computation
            attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
                q, k, v, layer_id
            )
        else:
            # Full attention using FlashAttention
            attn_output = flash_attn_varlen_func(q, k, v, ...)

        # MLP
        ...

        # Offload ALL KV (MInference doesn't affect this)
        offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens)

Execution Flow Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Layerwise Offload Prefill                     │
│                      with MInference                             │
└─────────────────────────────────────────────────────────────────┘

For each layer:
┌──────────────┐    ┌──────────────┐    ┌────────────────────────┐
│ QKV Proj     │───▶│ RoPE         │───▶│ sparse_prefill_attn()  │
│              │    │              │    │ (MInference pattern)   │
└──────────────┘    └──────────────┘    └───────────┬────────────┘
                                                    │
                    ┌──────────────┐    ┌───────────▼────────────┐
                    │ MLP          │◀───│ O Projection           │
                    │              │    │                        │
                    └──────┬───────┘    └────────────────────────┘
                           │
                    ┌──────▼───────┐
                    │ offload_     │    K, V still on GPU
                    │ layer_kv_    │───▶ Copy to CPU
                    │ sync()       │    (all blocks)
                    └──────────────┘

Decode Integration (Quest - Infrastructure Ready)

Quest affects block load strategy. The infrastructure is ready, full integration deferred.

# nanovllm/engine/model_runner.py - run_layerwise_offload_decode()
def run_layerwise_offload_decode(self, seqs):
    ...
    # Preload first N layers (no query available, full load)
    for i in range(num_preload):
        loaded_tokens[i] = offload_engine.load_layer_kv_to_buffer(
            i, i, cpu_block_table, valid_tokens_per_block
        )

    for layer_id in range(num_layers):
        current_buffer = layer_id % num_buffers

        # Wait for buffer load
        offload_engine.wait_buffer_load(current_buffer)

        # QKV projection
        q, k_new, v_new = ...

        # Get loaded KV from ring buffer
        k_prefill, v_prefill = offload_engine.get_buffer_kv(
            current_buffer, loaded_tokens[current_buffer]
        )

        # Attention
        ...

        # Mark buffer done
        offload_engine.record_buffer_compute_done(current_buffer)

        # Load next layer
        # Future: use load_layer_kv_to_buffer_with_policy(query=q) for Quest
        next_layer = layer_id + num_buffers
        if next_layer < num_layers:
            loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer(
                current_buffer, next_layer, cpu_block_table, valid_tokens_per_block
            )

Quest Integration (Future Work)

When Quest is fully integrated:

# Load next layer with Quest block selection
if next_layer < num_layers:
    loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer_with_policy(
        current_buffer, next_layer, cpu_block_table, valid_tokens_per_block,
        query=q  # Pass query for block selection
    )

Challenge: First N layers are preloaded before query is available, so they must use full load.

Configuration

Enabling Sparse Policy

from nanovllm import LLM
from nanovllm.config import SparsePolicyType

# GPU-only with MInference
llm = LLM(
    model_path,
    sparse_policy=SparsePolicyType.MINFERENCE,
    minference_adaptive_budget=0.3,  # 30% of seq_len
)

# Offload with MInference
llm = LLM(
    model_path,
    enable_cpu_offload=True,
    num_gpu_blocks=2,
    sparse_policy=SparsePolicyType.MINFERENCE,
    minference_adaptive_budget=0.3,
)

MInference Parameters

Parameter Default Description
minference_adaptive_budget 0.3 Budget as fraction of seq_len (0.3 = 30%)
minference_vertical_size 1000 Fixed vertical size (when budget=None)
minference_slash_size 6096 Fixed slash size (when budget=None)
minference_num_sink_tokens 30 Always-kept initial tokens
minference_num_recent_diags 100 Always-kept recent diagonals

Quest Parameters (for future decode integration)

Parameter Default Description
sparse_topk_blocks 8 Top-K blocks to load
sparse_threshold_blocks 4 Apply sparse only when blocks > threshold

Sparse Policy Hooks

Sparse policies can implement hooks for metadata collection:

class SparsePolicy(ABC):
    def on_prefill_offload(
        self,
        block_id: int,
        layer_id: int,
        key: torch.Tensor,
        valid_tokens: int,
    ) -> None:
        """
        Hook called during prefill offload BEFORE KV is copied to CPU.
        Key tensor is still on GPU - can compute metadata efficiently.

        Used by Quest to compute min/max key statistics for block selection.
        """
        pass

    def on_decode_offload(
        self,
        block_id: int,
        keys: torch.Tensor,  # [num_layers, block_size, kv_heads, head_dim]
    ) -> None:
        """
        Hook called when decode buffer is offloaded to CPU.
        """
        pass

File Changes Summary

File Changes
nanovllm/kvcache/sparse/policy.py Add requires_block_selection attribute
nanovllm/kvcache/sparse/minference.py Set requires_block_selection = False
nanovllm/kvcache/sparse/quest.py Set requires_block_selection = True
nanovllm/kvcache/sparse/full_policy.py Set requires_block_selection = False
nanovllm/kvcache/offload_engine.py Add offload_layer_kv_sync(), sparse hooks
nanovllm/engine/model_runner.py Integrate sparse policies in offload paths

Key Design Principles

  1. Encapsulation: All copy_() operations inside OffloadEngine
  2. Interface Flag: requires_block_selection declares policy type
  3. Separation of Concerns:
    • MInference: only sparse_prefill_attention() (compute-level)
    • Quest: select_blocks() + hooks (load-level)
  4. Hooks Inside Engine: Policy hooks called within OffloadEngine methods

Test Results

Verified on Qwen3-4B-Instruct-2507 with 32K input:

# GPU-only + MInference
test_needle.py --model Qwen3-4B --input-len 32768 --enable-minference
- Prefill: 3383 tok/s
- Output: "7492<|im_end|>"
- Result: PASSED

# Offload + MInference
test_needle.py --model Qwen3-4B --input-len 32768 --enable-offload --enable-minference
- Prefill: 5373 tok/s
- Output: "7492<|im_end|>"
- Result: PASSED

Both configurations produce identical outputs, confirming correctness.