Files

Zijie Tian 79c4df4a27 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST

2026-01-08 23:42:30 +08:00

14 KiB

Raw Blame History

Sparse Policy Integration with Layerwise Offload

This document describes the architecture and design of integrating sparse attention policies (MInference, Quest) with the layerwise CPU offload execution path.

Design Goals

Extend sparse policies to offload path: GPU-only path already supports sparse policies, but layerwise offload bypasses them
Maintain encapsulation: All copy_() operations must be inside OffloadEngine, not exposed to model_runner
Distinguish policy types: Some policies affect attention computation (MInference), others affect KV load strategy (Quest)
Extensible architecture: Easy to add new sparse policies in the future

Key Insight

The existing sparse policy implementation works, but the layerwise offload path bypasses it:

Path	Attention Method	Sparse Support
GPU-only	`attention.py` → `sparse_prefill_attention()`	YES
Layerwise offload	`model_runner.py` → `flash_attn_varlen_func()`	NO (direct call)

Two Types of Sparse Policies

The fundamental difference between sparse policies:

Policy	Affects Attention Computation	Affects KV Load Strategy	`select_blocks()` Behavior
MInference	YES (`sparse_prefill_attention`)	NO	`return available_blocks` (all)
Quest	NO	YES	Returns Top-K subset

MInference: Only changes how attention is computed, doesn't affect external load/offload flow
Quest: Selectively loads only some blocks, affects H2D transfer

The `requires_block_selection` Interface Flag

To distinguish these policy types, we add a flag to the base class:

# nanovllm/kvcache/sparse/policy.py
class SparsePolicy(ABC):
    # Existing flags
    supports_prefill: bool = True
    supports_decode: bool = True

    # NEW: Whether this policy requires selective block loading
    # If True: OffloadEngine will call select_blocks() before loading
    # If False: OffloadEngine will load all blocks (select_blocks ignored)
    requires_block_selection: bool = False

Policy Implementations

# MInference: prefill-only, no block selection
class MInferencePolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = False
    requires_block_selection = False  # Only affects attention computation

# Quest: decode-only, requires block selection
class QuestPolicy(SparsePolicy):
    supports_prefill = False
    supports_decode = True
    requires_block_selection = True  # Affects KV load strategy

# Full attention: baseline
class FullAttentionPolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = True
    requires_block_selection = False  # Load all blocks

OffloadEngine Encapsulation

All KV cache operations are encapsulated in OffloadEngine. The model_runner never directly accesses internal storage.

Prefill: Synchronous Offload with Hooks

# nanovllm/kvcache/offload_engine.py
def offload_layer_kv_sync(
    self,
    layer_id: int,
    k: Tensor,
    v: Tensor,
    cpu_block_ids: List[int],
    total_tokens: int,
) -> None:
    """
    Synchronously offload layer KV to CPU.
    Calls sparse policy hooks internally.
    """
    for i, cpu_block_id in enumerate(cpu_block_ids):
        start = i * self.block_size
        end = min(start + self.block_size, total_tokens)
        actual_size = end - start

        # Hook: notify sparse policy BEFORE offload (k still on GPU)
        if self.sparse_policy is not None:
            self.sparse_policy.on_prefill_offload(
                cpu_block_id, layer_id, k[start:end], actual_size
            )

        # Synchronous copy to CPU (internal)
        self.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
        self.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])

Decode: Policy-Driven Block Loading

def load_layer_kv_to_buffer_with_policy(
    self,
    buffer_idx: int,
    layer_id: int,
    cpu_block_ids: List[int],
    valid_tokens_per_block: List[int],
    query: Optional[Tensor] = None,
) -> int:
    """
    Load layer KV to buffer, optionally using sparse policy for block selection.

    Returns:
        Total tokens loaded
    """
    # Check if policy requires block selection
    if (self.sparse_policy is not None and
        self.sparse_policy.requires_block_selection and
        query is not None):
        # Build context
        ctx = PolicyContext(
            query_chunk_idx=0,
            num_query_chunks=1,
            layer_id=layer_id,
            query=query,
            is_prefill=False,
            block_size=self.block_size,
        )
        # Select blocks using policy
        selected_blocks = self.sparse_policy.select_blocks(cpu_block_ids, ctx)

        # Build valid_tokens for selected blocks
        block_to_valid = {bid: vt for bid, vt in zip(cpu_block_ids, valid_tokens_per_block)}
        selected_valid = [block_to_valid[bid] for bid in selected_blocks]

        return self._load_blocks_to_buffer(
            buffer_idx, layer_id, selected_blocks, selected_valid
        )
    else:
        # Load all blocks (no selection)
        return self._load_blocks_to_buffer(
            buffer_idx, layer_id, cpu_block_ids, valid_tokens_per_block
        )

Prefill Integration (MInference)

MInference only affects attention computation, not the load/offload flow:

# nanovllm/engine/model_runner.py - run_layerwise_offload_prefill()
def run_layerwise_offload_prefill(self, seqs):
    ...
    for layer_id in range(num_layers):
        # QKV projection + RoPE
        q, k = layer.self_attn.rotary_emb(positions, q, k)

        # Sparse or Full attention
        if self.sparse_prefill_policy is not None:
            # MInference: only changes attention computation
            attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
                q, k, v, layer_id
            )
        else:
            # Full attention using FlashAttention
            attn_output = flash_attn_varlen_func(q, k, v, ...)

        # MLP
        ...

        # Offload ALL KV (MInference doesn't affect this)
        offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens)

Execution Flow Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Layerwise Offload Prefill                     │
│                      with MInference                             │
└─────────────────────────────────────────────────────────────────┘

For each layer:
┌──────────────┐    ┌──────────────┐    ┌────────────────────────┐
│ QKV Proj     │───▶│ RoPE         │───▶│ sparse_prefill_attn()  │
│              │    │              │    │ (MInference pattern)   │
└──────────────┘    └──────────────┘    └───────────┬────────────┘
                                                    │
                    ┌──────────────┐    ┌───────────▼────────────┐
                    │ MLP          │◀───│ O Projection           │
                    │              │    │                        │
                    └──────┬───────┘    └────────────────────────┘
                           │
                    ┌──────▼───────┐
                    │ offload_     │    K, V still on GPU
                    │ layer_kv_    │───▶ Copy to CPU
                    │ sync()       │    (all blocks)
                    └──────────────┘

Decode Integration (Quest - Infrastructure Ready)

Quest affects block load strategy. The infrastructure is ready, full integration deferred.

# nanovllm/engine/model_runner.py - run_layerwise_offload_decode()
def run_layerwise_offload_decode(self, seqs):
    ...
    # Preload first N layers (no query available, full load)
    for i in range(num_preload):
        loaded_tokens[i] = offload_engine.load_layer_kv_to_buffer(
            i, i, cpu_block_table, valid_tokens_per_block
        )

    for layer_id in range(num_layers):
        current_buffer = layer_id % num_buffers

        # Wait for buffer load
        offload_engine.wait_buffer_load(current_buffer)

        # QKV projection
        q, k_new, v_new = ...

        # Get loaded KV from ring buffer
        k_prefill, v_prefill = offload_engine.get_buffer_kv(
            current_buffer, loaded_tokens[current_buffer]
        )

        # Attention
        ...

        # Mark buffer done
        offload_engine.record_buffer_compute_done(current_buffer)

        # Load next layer
        # Future: use load_layer_kv_to_buffer_with_policy(query=q) for Quest
        next_layer = layer_id + num_buffers
        if next_layer < num_layers:
            loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer(
                current_buffer, next_layer, cpu_block_table, valid_tokens_per_block
            )

Quest Integration (Future Work)

When Quest is fully integrated:

# Load next layer with Quest block selection
if next_layer < num_layers:
    loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer_with_policy(
        current_buffer, next_layer, cpu_block_table, valid_tokens_per_block,
        query=q  # Pass query for block selection
    )

Challenge: First N layers are preloaded before query is available, so they must use full load.

Configuration

Enabling Sparse Policy

from nanovllm import LLM
from nanovllm.config import SparsePolicyType

# GPU-only with MInference
llm = LLM(
    model_path,
    sparse_policy=SparsePolicyType.MINFERENCE,
    minference_adaptive_budget=0.3,  # 30% of seq_len
)

# Offload with MInference
llm = LLM(
    model_path,
    enable_cpu_offload=True,
    num_gpu_blocks=2,
    sparse_policy=SparsePolicyType.MINFERENCE,
    minference_adaptive_budget=0.3,
)

MInference Parameters

Parameter	Default	Description
`minference_adaptive_budget`	0.3	Budget as fraction of seq_len (0.3 = 30%)
`minference_vertical_size`	1000	Fixed vertical size (when budget=None)
`minference_slash_size`	6096	Fixed slash size (when budget=None)
`minference_num_sink_tokens`	30	Always-kept initial tokens
`minference_num_recent_diags`	100	Always-kept recent diagonals

Quest Parameters (for future decode integration)

Parameter	Default	Description
`sparse_topk_blocks`	8	Top-K blocks to load
`sparse_threshold_blocks`	4	Apply sparse only when blocks > threshold

Sparse Policy Hooks

Sparse policies can implement hooks for metadata collection:

class SparsePolicy(ABC):
    def on_prefill_offload(
        self,
        block_id: int,
        layer_id: int,
        key: torch.Tensor,
        valid_tokens: int,
    ) -> None:
        """
        Hook called during prefill offload BEFORE KV is copied to CPU.
        Key tensor is still on GPU - can compute metadata efficiently.

        Used by Quest to compute min/max key statistics for block selection.
        """
        pass

    def on_decode_offload(
        self,
        block_id: int,
        keys: torch.Tensor,  # [num_layers, block_size, kv_heads, head_dim]
    ) -> None:
        """
        Hook called when decode buffer is offloaded to CPU.
        """
        pass

File Changes Summary

File	Changes
`nanovllm/kvcache/sparse/policy.py`	Add `requires_block_selection` attribute
`nanovllm/kvcache/sparse/minference.py`	Set `requires_block_selection = False`
`nanovllm/kvcache/sparse/quest.py`	Set `requires_block_selection = True`
`nanovllm/kvcache/sparse/full_policy.py`	Set `requires_block_selection = False`
`nanovllm/kvcache/offload_engine.py`	Add `offload_layer_kv_sync()`, sparse hooks
`nanovllm/engine/model_runner.py`	Integrate sparse policies in offload paths

Key Design Principles

Encapsulation: All copy_() operations inside OffloadEngine
Interface Flag: requires_block_selection declares policy type
Separation of Concerns:
- MInference: only sparse_prefill_attention() (compute-level)
- Quest: select_blocks() + hooks (load-level)
Hooks Inside Engine: Policy hooks called within OffloadEngine methods

Test Results

Verified on Qwen3-4B-Instruct-2507 with 32K input:

# GPU-only + MInference
test_needle.py --model Qwen3-4B --input-len 32768 --enable-minference
- Prefill: 3383 tok/s
- Output: "7492<|im_end|>"
- Result: PASSED

# Offload + MInference
test_needle.py --model Qwen3-4B --input-len 32768 --enable-offload --enable-minference
- Prefill: 5373 tok/s
- Output: "7492<|im_end|>"
- Result: PASSED

Both configurations produce identical outputs, confirming correctness.

sparse_attention_guide.md: Algorithm details for sparse methods
architecture_guide.md: Overall system architecture
gpu_only_performance_issue.md: Why offload is faster than GPU-only

14 KiB Raw Blame History