nano-vllm/docs/sparse_offload_integration.md

# Sparse Policy Integration with Layerwise Offload

This document describes the architecture and design of integrating sparse attention policies (MInference, Quest) with the layerwise CPU offload execution path.

## Design Goals

1. **Extend sparse policies to offload path**: GPU-only path already supports sparse policies, but layerwise offload bypasses them
2. **Maintain encapsulation**: All `copy_()` operations must be inside OffloadEngine, not exposed to model_runner
3. **Distinguish policy types**: Some policies affect attention computation (MInference), others affect KV load strategy (Quest)
4. **Extensible architecture**: Easy to add new sparse policies in the future

## Key Insight

The existing sparse policy implementation works, but the layerwise offload path bypasses it:

| Path | Attention Method | Sparse Support |
|------|------------------|----------------|
| GPU-only | `attention.py` → `sparse_prefill_attention()` | YES |
| Layerwise offload | `model_runner.py` → `flash_attn_varlen_func()` | NO (direct call) |

## Two Types of Sparse Policies

The fundamental difference between sparse policies:

| Policy | Affects Attention Computation | Affects KV Load Strategy | `select_blocks()` Behavior |
|--------|------------------------------|--------------------------|---------------------------|
| **MInference** | YES (`sparse_prefill_attention`) | NO | `return available_blocks` (all) |
| **Quest** | NO | YES | Returns Top-K subset |

- **MInference**: Only changes how attention is computed, doesn't affect external load/offload flow
- **Quest**: Selectively loads only some blocks, affects H2D transfer

## The `requires_block_selection` Interface Flag

To distinguish these policy types, we add a flag to the base class:

```python
# nanovllm/kvcache/sparse/policy.py
class SparsePolicy(ABC):
    # Existing flags
    supports_prefill: bool = True
    supports_decode: bool = True

    # NEW: Whether this policy requires selective block loading
    # If True: OffloadEngine will call select_blocks() before loading
    # If False: OffloadEngine will load all blocks (select_blocks ignored)
    requires_block_selection: bool = False
```

### Policy Implementations

```python
# MInference: prefill-only, no block selection
class MInferencePolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = False
    requires_block_selection = False  # Only affects attention computation

# Quest: decode-only, requires block selection
class QuestPolicy(SparsePolicy):
    supports_prefill = False
    supports_decode = True
    requires_block_selection = True  # Affects KV load strategy

# Full attention: baseline
class FullAttentionPolicy(SparsePolicy):
    supports_prefill = True
    supports_decode = True
    requires_block_selection = False  # Load all blocks
```

## OffloadEngine Encapsulation

All KV cache operations are encapsulated in OffloadEngine. The model_runner never directly accesses internal storage.

### Prefill: Synchronous Offload with Hooks

```python
# nanovllm/kvcache/offload_engine.py
def offload_layer_kv_sync(
    self,
    layer_id: int,
    k: Tensor,
    v: Tensor,
    cpu_block_ids: List[int],
    total_tokens: int,
) -> None:
    """
    Synchronously offload layer KV to CPU.
    Calls sparse policy hooks internally.
    """
    for i, cpu_block_id in enumerate(cpu_block_ids):
        start = i * self.block_size
        end = min(start + self.block_size, total_tokens)
        actual_size = end - start

        # Hook: notify sparse policy BEFORE offload (k still on GPU)
        if self.sparse_policy is not None:
            self.sparse_policy.on_prefill_offload(
                cpu_block_id, layer_id, k[start:end], actual_size
            )

        # Synchronous copy to CPU (internal)
        self.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
        self.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
```

### Decode: Policy-Driven Block Loading

```python
def load_layer_kv_to_buffer_with_policy(
    self,
    buffer_idx: int,
    layer_id: int,
    cpu_block_ids: List[int],
    valid_tokens_per_block: List[int],
    query: Optional[Tensor] = None,
) -> int:
    """
    Load layer KV to buffer, optionally using sparse policy for block selection.

    Returns:
        Total tokens loaded
    """
    # Check if policy requires block selection
    if (self.sparse_policy is not None and
        self.sparse_policy.requires_block_selection and
        query is not None):
        # Build context
        ctx = PolicyContext(
            query_chunk_idx=0,
            num_query_chunks=1,
            layer_id=layer_id,
            query=query,
            is_prefill=False,
            block_size=self.block_size,
        )
        # Select blocks using policy
        selected_blocks = self.sparse_policy.select_blocks(cpu_block_ids, ctx)

        # Build valid_tokens for selected blocks
        block_to_valid = {bid: vt for bid, vt in zip(cpu_block_ids, valid_tokens_per_block)}
        selected_valid = [block_to_valid[bid] for bid in selected_blocks]

        return self._load_blocks_to_buffer(
            buffer_idx, layer_id, selected_blocks, selected_valid
        )
    else:
        # Load all blocks (no selection)
        return self._load_blocks_to_buffer(
            buffer_idx, layer_id, cpu_block_ids, valid_tokens_per_block
        )
```

## Prefill Integration (MInference)

MInference only affects attention computation, not the load/offload flow:

```python
# nanovllm/engine/model_runner.py - run_layerwise_offload_prefill()
def run_layerwise_offload_prefill(self, seqs):
    ...
    for layer_id in range(num_layers):
        # QKV projection + RoPE
        q, k = layer.self_attn.rotary_emb(positions, q, k)

        # Sparse or Full attention
        if self.sparse_prefill_policy is not None:
            # MInference: only changes attention computation
            attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
                q, k, v, layer_id
            )
        else:
            # Full attention using FlashAttention
            attn_output = flash_attn_varlen_func(q, k, v, ...)

        # MLP
        ...

        # Offload ALL KV (MInference doesn't affect this)
        offload_engine.offload_layer_kv_sync(layer_id, k, v, cpu_block_ids, total_tokens)
```

### Execution Flow Diagram

```
┌─────────────────────────────────────────────────────────────────┐
│                    Layerwise Offload Prefill                     │
│                      with MInference                             │
└─────────────────────────────────────────────────────────────────┘

For each layer:
┌──────────────┐    ┌──────────────┐    ┌────────────────────────┐
│ QKV Proj     │───▶│ RoPE         │───▶│ sparse_prefill_attn()  │
│              │    │              │    │ (MInference pattern)   │
└──────────────┘    └──────────────┘    └───────────┬────────────┘
                                                    │
                    ┌──────────────┐    ┌───────────▼────────────┐
                    │ MLP          │◀───│ O Projection           │
                    │              │    │                        │
                    └──────┬───────┘    └────────────────────────┘
                           │
                    ┌──────▼───────┐
                    │ offload_     │    K, V still on GPU
                    │ layer_kv_    │───▶ Copy to CPU
                    │ sync()       │    (all blocks)
                    └──────────────┘
```

## Decode Integration (Quest - Infrastructure Ready)

Quest affects block load strategy. The infrastructure is ready, full integration deferred.

```python
# nanovllm/engine/model_runner.py - run_layerwise_offload_decode()
def run_layerwise_offload_decode(self, seqs):
    ...
    # Preload first N layers (no query available, full load)
    for i in range(num_preload):
        loaded_tokens[i] = offload_engine.load_layer_kv_to_buffer(
            i, i, cpu_block_table, valid_tokens_per_block
        )

    for layer_id in range(num_layers):
        current_buffer = layer_id % num_buffers

        # Wait for buffer load
        offload_engine.wait_buffer_load(current_buffer)

        # QKV projection
        q, k_new, v_new = ...

        # Get loaded KV from ring buffer
        k_prefill, v_prefill = offload_engine.get_buffer_kv(
            current_buffer, loaded_tokens[current_buffer]
        )

        # Attention
        ...

        # Mark buffer done
        offload_engine.record_buffer_compute_done(current_buffer)

        # Load next layer
        # Future: use load_layer_kv_to_buffer_with_policy(query=q) for Quest
        next_layer = layer_id + num_buffers
        if next_layer < num_layers:
            loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer(
                current_buffer, next_layer, cpu_block_table, valid_tokens_per_block
            )
```

### Quest Integration (Future Work)

When Quest is fully integrated:

```python
# Load next layer with Quest block selection
if next_layer < num_layers:
    loaded_tokens[current_buffer] = offload_engine.load_layer_kv_to_buffer_with_policy(
        current_buffer, next_layer, cpu_block_table, valid_tokens_per_block,
        query=q  # Pass query for block selection
    )
```

**Challenge**: First N layers are preloaded before query is available, so they must use full load.

## Configuration

### Enabling Sparse Policy

```python
from nanovllm import LLM
from nanovllm.config import SparsePolicyType

# GPU-only with MInference
llm = LLM(
    model_path,
    sparse_policy=SparsePolicyType.MINFERENCE,
    minference_adaptive_budget=0.3,  # 30% of seq_len
)

# Offload with MInference
llm = LLM(
    model_path,
    enable_cpu_offload=True,
    num_gpu_blocks=2,
    sparse_policy=SparsePolicyType.MINFERENCE,
    minference_adaptive_budget=0.3,
)
```

### MInference Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `minference_adaptive_budget` | 0.3 | Budget as fraction of seq_len (0.3 = 30%) |
| `minference_vertical_size` | 1000 | Fixed vertical size (when budget=None) |
| `minference_slash_size` | 6096 | Fixed slash size (when budget=None) |
| `minference_num_sink_tokens` | 30 | Always-kept initial tokens |
| `minference_num_recent_diags` | 100 | Always-kept recent diagonals |

### Quest Parameters (for future decode integration)

| Parameter | Default | Description |
|-----------|---------|-------------|
| `sparse_topk_blocks` | 8 | Top-K blocks to load |
| `sparse_threshold_blocks` | 4 | Apply sparse only when blocks > threshold |

## Sparse Policy Hooks

Sparse policies can implement hooks for metadata collection:

```python
class SparsePolicy(ABC):
    def on_prefill_offload(
        self,
        block_id: int,
        layer_id: int,
        key: torch.Tensor,
        valid_tokens: int,
    ) -> None:
        """
        Hook called during prefill offload BEFORE KV is copied to CPU.
        Key tensor is still on GPU - can compute metadata efficiently.

        Used by Quest to compute min/max key statistics for block selection.
        """
        pass

    def on_decode_offload(
        self,
        block_id: int,
        keys: torch.Tensor,  # [num_layers, block_size, kv_heads, head_dim]
    ) -> None:
        """
        Hook called when decode buffer is offloaded to CPU.
        """
        pass
```

## File Changes Summary

| File | Changes |
|------|---------|
| `nanovllm/kvcache/sparse/policy.py` | Add `requires_block_selection` attribute |
| `nanovllm/kvcache/sparse/minference.py` | Set `requires_block_selection = False` |
| `nanovllm/kvcache/sparse/quest.py` | Set `requires_block_selection = True` |
| `nanovllm/kvcache/sparse/full_policy.py` | Set `requires_block_selection = False` |
| `nanovllm/kvcache/offload_engine.py` | Add `offload_layer_kv_sync()`, sparse hooks |
| `nanovllm/engine/model_runner.py` | Integrate sparse policies in offload paths |

## Key Design Principles

1. **Encapsulation**: All `copy_()` operations inside OffloadEngine
2. **Interface Flag**: `requires_block_selection` declares policy type
3. **Separation of Concerns**:
   - MInference: only `sparse_prefill_attention()` (compute-level)
   - Quest: `select_blocks()` + hooks (load-level)
4. **Hooks Inside Engine**: Policy hooks called within OffloadEngine methods

## Test Results

Verified on Qwen3-4B-Instruct-2507 with 32K input:

```
# GPU-only + MInference
test_needle.py --model Qwen3-4B --input-len 32768 --enable-minference
- Prefill: 3383 tok/s
- Output: "7492<|im_end|>"
- Result: PASSED

# Offload + MInference
test_needle.py --model Qwen3-4B --input-len 32768 --enable-offload --enable-minference
- Prefill: 5373 tok/s
- Output: "7492<|im_end|>"
- Result: PASSED
```

Both configurations produce identical outputs, confirming correctness.

## Related Documents

- [`sparse_attention_guide.md`](sparse_attention_guide.md): Algorithm details for sparse methods
- [`architecture_guide.md`](architecture_guide.md): Overall system architecture
- [`gpu_only_performance_issue.md`](gpu_only_performance_issue.md): Why offload is faster than GPU-only