nano-vllm/docs/architecture_guide.md

# Architecture Guide

This document describes the core architecture and layer-wise CPU offload system of nano-vLLM.

## Core Components

| Component | File | Purpose |
|-----------|------|---------|
| **LLMEngine** | `llm_engine.py` | Main entry, runs prefill-decode loop |
| **ModelRunner** | `model_runner.py` | Loads weights, allocates KV cache, CUDA graphs, layer-wise offload |
| **Scheduler** | `scheduler.py` | Two-phase scheduling (prefill → decode) |
| **BlockManager** | `block_manager.py` | Paged attention with prefix caching (xxhash), default block size 4096 |
| **Attention** | `layers/attention.py` | FlashAttention for standard inference |

## Layer-wise CPU Offload System

### Design Philosophy

Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:

```
Layer 0: [full sequence] → compute → offload K,V to CPU
Layer 1: [full sequence] → compute → offload K,V to CPU
...
Layer N: [full sequence] → compute → offload K,V to CPU
```

**Benefits**:
- Supports MInference sparse attention (requires full KV access per layer)
- Simpler memory management (one layer's KV in GPU at a time)
- Peak GPU memory = one layer's KV cache + attention workspace

### Key Files

| File | Purpose |
|------|---------|
| `nanovllm/engine/model_runner.py` | Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`) |
| `nanovllm/kvcache/hybrid_manager.py` | CPU block management helpers |
| `nanovllm/kvcache/offload_engine.py` | CPU/GPU cache storage, ring buffer, async transfers |

### Memory Layout

**CPU Cache** (pinned memory):
```python
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
```

**GPU Ring Buffer** (for decode H2D pipeline):
```python
layer_k_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
layer_v_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
```

**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):

| Context Length | KV per Layer |
|----------------|--------------|
| 128K tokens | 512 MB |
| 256K tokens | 1 GB |
| 512K tokens | 2 GB |
| 1M tokens | 4 GB |

---

## Prefill Flow

```python
def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
    # 1. Embedding
    hidden_states = self.model.model.embed_tokens(input_ids)

    # 2. Process each layer
    for layer_id in range(num_layers):
        # QKV projection + norms + RoPE
        q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
        k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
        v = v_proj(hidden_states)

        # Full FlashAttention (entire sequence)
        attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)

        # MLP
        hidden_states = mlp(attn_out + residual)

        # Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
        self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)

    # 3. Final norm + sampling
    return sampled_tokens
```

---

## Decode Flow

```python
def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
    # Ring buffer pipeline: preload first N layers
    for i in range(num_buffers):
        offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)

    # For each layer:
    for layer_id in range(num_layers):
        current_buffer = layer_id % num_buffers

        # 1. Wait for buffer load to complete
        offload_engine.wait_buffer_load(current_buffer)

        # 2. Get prefilled KV from ring buffer
        k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)

        # 3. Compute new Q,K,V for current token
        q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
        k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
        v_new = v_proj(hidden_states)

        # 4. Concatenate and compute attention
        k_full = torch.cat([k_prefill, k_new], dim=0)
        v_full = torch.cat([v_prefill, v_new], dim=0)
        attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
        # Note: causal=False because single query token should attend to ALL keys

        # 5. Mark buffer done, start loading next layer
        offload_engine.record_buffer_compute_done(current_buffer)
        if layer_id + num_buffers < num_layers:
            offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
```

---

## Critical Implementation Details

### 1. Synchronous Offload Required

Async offload with `non_blocking=True` causes memory reuse bugs:

```python
# BUG: PyTorch may reuse k,v GPU memory before async copy completes
offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)

# CORRECT: Synchronous copy ensures data integrity
offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end])  # sync
```

### 2. Decode Attention: causal=False

During decode, the single query token must attend to ALL keys (not just preceding ones):

```python
# Prefill: causal=True (each token only attends to previous tokens)
attn_out = flash_attn_varlen_func(..., causal=True)

# Decode: causal=False (query at position N attends to all N-1 prefill + itself)
attn_out = flash_attn_varlen_func(..., causal=False)
```

### 3. Ring Buffer Synchronization

The ring buffer pipeline requires careful ordering:

```python
# CORRECT order:
offload_engine.store_decode_kv(layer_id, pos, k_new, v_new)  # Store new KV
offload_engine.record_buffer_compute_done(current_buffer)     # Mark done FIRST
offload_engine.load_layer_kv_to_buffer(...)                   # THEN start next load

# BUG: Starting load before marking done causes race condition
offload_engine.load_layer_kv_to_buffer(...)  # WRONG: buffer still in use!
offload_engine.record_buffer_compute_done(current_buffer)
```

---

## Helper Methods in HybridKVCacheManager

```python
# Get all CPU blocks for a sequence
cpu_blocks = manager.get_all_cpu_blocks(seq)  # List[int]

# Get only prefilled (offloaded) CPU blocks
prefilled_blocks = manager.get_prefilled_cpu_blocks(seq)  # List[int]

# Get cached prefill length (doesn't change during decode)
prefill_len = manager.get_prefill_len(seq)  # int

# Get decode start position
decode_pos = manager.get_decode_start_pos(seq)  # int
```