nano-vllm/docs/layerwise_offload_memory_analysis.md

# Layer-wise Offload Memory Analysis

This document provides a detailed analysis of memory allocations in the layer-wise CPU offload system, distinguishing between pre-allocated (managed) memory and temporary (non-pre-allocated) memory.

## Variable Notation

| Symbol | Description | Example (Qwen3-4B) |
|--------|-------------|-------------------|
| `seq_len` | Input sequence length | 131072 (128k) |
| `hidden_size` | Model hidden dimension | 2560 |
| `num_heads` | Number of attention heads | 20 |
| `num_kv_heads` | Number of KV heads (GQA) | 8 |
| `head_dim` | Dimension per head | 128 |
| `intermediate_size` | MLP intermediate dimension | 13696 |
| `num_layers` | Number of transformer layers | 36 |
| `block_size` | KV cache block size | 1024 |
| `num_kv_buffers` | Ring buffer count | 4 |
| `num_cpu_blocks` | Number of CPU cache blocks | 128 |
| `vocab_size` | Vocabulary size | 151936 |
| `dtype_size` | Bytes per element (fp16/bf16) | 2 |

Derived values:
- `kv_dim = num_kv_heads × head_dim`
- `q_size = num_heads × head_dim`
- `kv_size = num_kv_heads × head_dim`
- `qkv_size = q_size + 2 × kv_size`

---

## 1. Pre-allocated Memory (Managed by nanovllm)

These tensors are allocated once during initialization and reused throughout inference.

### 1.1 OffloadEngine Managed Memory

| Tensor | Shape | Size Formula | Location |
|--------|-------|--------------|----------|
| `layer_k_cache` | `[num_kv_buffers, seq_len, num_kv_heads, head_dim]` | `num_kv_buffers × seq_len × kv_dim × dtype_size` | GPU |
| `layer_v_cache` | `[num_kv_buffers, seq_len, num_kv_heads, head_dim]` | `num_kv_buffers × seq_len × kv_dim × dtype_size` | GPU |
| `decode_k_buffer` | `[num_layers, block_size, num_kv_heads, head_dim]` | `num_layers × block_size × kv_dim × dtype_size` | GPU |
| `decode_v_buffer` | `[num_layers, block_size, num_kv_heads, head_dim]` | `num_layers × block_size × kv_dim × dtype_size` | GPU |
| `k_cache_cpu` | `[num_layers, num_cpu_blocks, block_size, num_kv_heads, head_dim]` | `num_layers × num_cpu_blocks × block_size × kv_dim × dtype_size` | CPU (pinned) |
| `v_cache_cpu` | `[num_layers, num_cpu_blocks, block_size, num_kv_heads, head_dim]` | `num_layers × num_cpu_blocks × block_size × kv_dim × dtype_size` | CPU (pinned) |

**Total GPU (OffloadEngine)**: `2 × (num_kv_buffers × seq_len + num_layers × block_size) × kv_dim × dtype_size`

**Total CPU (OffloadEngine)**: `2 × num_layers × num_cpu_blocks × block_size × kv_dim × dtype_size`

### 1.2 Model Weights

| Component | Approximate Size |
|-----------|-----------------|
| Embedding | `vocab_size × hidden_size × dtype_size` |
| Per-layer QKV proj | `hidden_size × qkv_size × dtype_size` |
| Per-layer O proj | `q_size × hidden_size × dtype_size` |
| Per-layer MLP | `hidden_size × 2 × intermediate_size × dtype_size + intermediate_size × hidden_size × dtype_size` |
| Per-layer LayerNorm | `2 × hidden_size × dtype_size` |
| LM Head | `hidden_size × vocab_size × dtype_size` |

### 1.3 RoPE Cache

| Tensor | Shape | Size |
|--------|-------|------|
| `cos_sin_cache` | `[max_position, 1, head_dim]` | `max_position × head_dim × 4` (float32) |

---

## 2. Non-Pre-allocated Memory: Prefill Phase

Location: `model_runner.py:run_layerwise_offload_prefill()`

### 2.1 Persistent Tensors (Live Throughout Prefill)

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `input_ids` | 488 | `[seq_len]` | `seq_len × 8` | int64 |
| `positions` | 489 | `[seq_len]` | `seq_len × 8` | int64 |
| `cu_seqlens` | 493 | `[2]` | negligible | int32 |
| `hidden_states` | 497 | `[seq_len, hidden_size]` | `seq_len × hidden_size × dtype_size` | Embedding output |
| `residual` | 506 | `[seq_len, hidden_size]` | `seq_len × hidden_size × dtype_size` | Residual connection |

### 2.2 Per-Layer Temporary Tensors

These are allocated and deallocated within each layer iteration.

#### 2.2.1 LayerNorm

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `hidden_ln` | 506-508 | `[seq_len, hidden_size]` | `seq_len × hidden_size × dtype_size` | Input layernorm output |

**Inside RMSNorm** (`layernorm.py:add_rms_forward`):
| Variable | Shape | Size | Notes |
|----------|-------|------|-------|
| `x.float()` | `[seq_len, hidden_size]` | `seq_len × hidden_size × 4` | Upcasted to float32 |
| `var` | `[seq_len, 1]` | `seq_len × 4` | Variance |

#### 2.2.2 QKV Projection

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `qkv` | 512 | `[seq_len, q_size + 2 × kv_size]` | `seq_len × qkv_size × dtype_size` | Merged QKV output |
| `q` | 513-519 | `[seq_len, num_heads, head_dim]` | 0 (view) | View of qkv |
| `k` | 513-520 | `[seq_len, num_kv_heads, head_dim]` | 0 (view) | View of qkv |
| `v` | 513-521 | `[seq_len, num_kv_heads, head_dim]` | 0 (view) | View of qkv |

#### 2.2.3 Q/K Norms (Qwen3 specific)

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `q.reshape()` | 526 | `[seq_len × num_heads, head_dim]` | 0 (view) | Reshape for norm |
| `k.reshape()` | 528 | `[seq_len × num_kv_heads, head_dim]` | 0 (view) | Reshape for norm |
| RMSNorm intermediates | - | see above | `seq_len × num_heads × head_dim × 4` | Float32 upcasting |

#### 2.2.4 RoPE (Rotary Position Embedding)

Location: `rotary_embedding.py:apply_rotary_emb()`

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `cos_sin` | 44 | `[seq_len, 1, head_dim]` | 0 (view) | View of cached cos_sin |
| `cos` | 45 | `[seq_len, 1, head_dim/2]` | 0 (view) | Chunk view |
| `sin` | 45 | `[seq_len, 1, head_dim/2]` | 0 (view) | Chunk view |

**Inside `apply_rotary_emb` for Q** (`rotary_embedding.py:6-14`):
| Variable | Shape | Size | Notes |
|----------|-------|------|-------|
| `x.float()` | `[seq_len, num_heads, head_dim]` | `seq_len × num_heads × head_dim × 4` | Upcast to float32 |
| `x1` | `[seq_len, num_heads, head_dim/2]` | 0 (view) | Chunk view |
| `x2` | `[seq_len, num_heads, head_dim/2]` | 0 (view) | Chunk view |
| `y1 = x1*cos - x2*sin` | `[seq_len, num_heads, head_dim/2]` | `seq_len × num_heads × head_dim/2 × 4` | New tensor |
| `y2 = x2*cos + x1*sin` | `[seq_len, num_heads, head_dim/2]` | `seq_len × num_heads × head_dim/2 × 4` | New tensor |
| `torch.cat((y1, y2))` | `[seq_len, num_heads, head_dim]` | `seq_len × num_heads × head_dim × 4` | New tensor |
| `.to(x.dtype)` | `[seq_len, num_heads, head_dim]` | `seq_len × num_heads × head_dim × dtype_size` | Downcast |

**Inside `apply_rotary_emb` for K**:
| Variable | Shape | Size | Notes |
|----------|-------|------|-------|
| Same pattern as Q | `[seq_len, num_kv_heads, head_dim]` | Similar, with `num_kv_heads` | |

**Total RoPE temporary for Q+K**: ~`seq_len × (num_heads + num_kv_heads) × head_dim × 4 × 3` (float32 intermediates)

#### 2.2.5 FlashAttention

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `attn_output` | 535 | `[seq_len, num_heads, head_dim]` | `seq_len × num_heads × head_dim × dtype_size` | Attention output |
| Internal workspace | - | O(seq_len) | Variable | FlashAttention internal |

#### 2.2.6 Output Projection

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `attn_output.view()` | 546 | `[seq_len, q_size]` | 0 (view) | Reshape for o_proj |
| `o_proj(attn_output)` | 547 | `[seq_len, hidden_size]` | `seq_len × hidden_size × dtype_size` | O projection output |

#### 2.2.7 Post-Attention LayerNorm

Same as input layernorm (2.2.1).

#### 2.2.8 MLP

Location: `qwen3.py:Qwen3MLP.forward()`

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `gate_up` | 117 | `[seq_len, 2 × intermediate_size]` | `seq_len × 2 × intermediate_size × dtype_size` | **LARGEST TEMPORARY!** |
| `x, y = chunk()` | activation.py:13 | `[seq_len, intermediate_size]` × 2 | 0 (views) | Chunk views |
| `F.silu(x)` | activation.py:14 | `[seq_len, intermediate_size]` | `seq_len × intermediate_size × dtype_size` | SiLU activation |
| `silu(x) * y` | activation.py:14 | `[seq_len, intermediate_size]` | `seq_len × intermediate_size × dtype_size` | Gated output |
| `down_proj()` | 119 | `[seq_len, hidden_size]` | `seq_len × hidden_size × dtype_size` | MLP output |

### 2.3 Prefill Memory Summary

**Peak per-layer temporary memory**:
```
= qkv + RoPE_temps + attn_output + o_proj + layernorm + MLP_gate_up + MLP_activation
≈ seq_len × (qkv_size + (num_heads + num_kv_heads) × head_dim × 4 × 3
           + num_heads × head_dim + hidden_size × 2 + 2 × intermediate_size + intermediate_size) × dtype_size
```

**Dominant term**: `seq_len × 2 × intermediate_size × dtype_size` (MLP gate_up)

---

## 3. Non-Pre-allocated Memory: Decode Phase

Location: `model_runner.py:run_layerwise_offload_decode()`

### 3.1 Persistent Tensors

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `input_ids` | 604 | `[1]` | 8 bytes | Single token |
| `positions` | 605 | `[1]` | 8 bytes | Single position |
| `cu_seqlens_q` | 631 | `[2]` | 8 bytes | Fixed |
| `valid_tokens_per_block` | 613-622 | Python list | negligible | |

### 3.2 Per-Layer Temporary Tensors

#### 3.2.1 Views (Zero Additional Memory)

| Variable | Line | Shape | Notes |
|----------|------|-------|-------|
| `k_prefill` | 682 | `[prefill_len, num_kv_heads, head_dim]` | View of ring buffer |
| `v_prefill` | 682 | `[prefill_len, num_kv_heads, head_dim]` | View of ring buffer |
| `k_decode_prev` | 686-687 | `[num_decode_tokens-1, num_kv_heads, head_dim]` | View of decode buffer |
| `v_decode_prev` | 686-688 | `[num_decode_tokens-1, num_kv_heads, head_dim]` | View of decode buffer |

#### 3.2.2 New Allocations

| Variable | Line | Shape | Size | Notes |
|----------|------|-------|------|-------|
| `hidden_ln` | 654-657 | `[1, hidden_size]` | `hidden_size × dtype_size` | Tiny |
| `qkv` | 660 | `[1, qkv_size]` | `qkv_size × dtype_size` | Tiny |
| `q` | 667 | `[1, num_heads, head_dim]` | 0 (view) | |
| `k_new` | 668 | `[1, num_kv_heads, head_dim]` | 0 (view) | |
| `v_new` | 669 | `[1, num_kv_heads, head_dim]` | 0 (view) | |
| **`k_full`** | 689/692 | `[prefill_len + num_decode_tokens, num_kv_heads, head_dim]` | `(prefill_len + num_decode_tokens) × kv_dim × dtype_size` | **torch.cat - NEW ALLOCATION** |
| **`v_full`** | 690/693 | `[prefill_len + num_decode_tokens, num_kv_heads, head_dim]` | `(prefill_len + num_decode_tokens) × kv_dim × dtype_size` | **torch.cat - NEW ALLOCATION** |
| `cu_seqlens_k` | 710 | `[2]` | 8 bytes | Created per layer |
| `attn_output` | 712 | `[1, num_heads, head_dim]` | `num_heads × head_dim × dtype_size` | Tiny |
| MLP temps | 728 | `[1, ...]` | negligible | Single token |

### 3.3 Decode Memory Summary

**Peak per-layer temporary memory**:
```
= k_full + v_full + small_tensors
≈ 2 × (prefill_len + num_decode_tokens) × num_kv_heads × head_dim × dtype_size
≈ 2 × seq_len × kv_dim × dtype_size
```

**Dominant term**: `k_full` and `v_full` from `torch.cat()`

---

## 4. Memory Comparison Table

For Qwen3-4B with 128k context:

| Category | Memory | Notes |
|----------|--------|-------|
| **Pre-allocated GPU** | ~2.2 GB | Ring buffer + decode buffer |
| **Pre-allocated CPU** | ~18.4 GB | Pinned memory |
| **Model Weights** | ~8 GB | |
| **Prefill Peak Temp** | ~10-12 GB | MLP gate_up dominant |
| **Decode Peak Temp** | ~512 MB | k_full + v_full |

---

## 5. Optimization Opportunities

### 5.1 Decode: Pre-allocate k_full/v_full

**Current** (L689-693):
```python
k_full = torch.cat([k_prefill, k_decode_prev, k_new], dim=0)  # New allocation each layer
v_full = torch.cat([v_prefill, v_decode_prev, v_new], dim=0)  # New allocation each layer
```

**Optimized**:
```python
# Pre-allocate in OffloadEngine.__init__():
self.k_full_buffer = torch.zeros(max_seq_len + block_size, num_kv_heads, head_dim, ...)
self.v_full_buffer = torch.zeros(max_seq_len + block_size, num_kv_heads, head_dim, ...)

# In decode loop:
total_len = prefill_len + num_decode_tokens
k_full = self.k_full_buffer[:total_len]
k_full[:prefill_len].copy_(k_prefill)
k_full[prefill_len:prefill_len+num_decode_prev].copy_(k_decode_prev)
k_full[-1:].copy_(k_new)
```

**Savings**: ~512 MB per decode step (for 128k)

### 5.2 Decode: Reuse cu_seqlens_k

**Current** (L710):
```python
cu_seqlens_k = torch.tensor([0, total_kv_tokens], dtype=torch.int32, device="cuda")
```

**Optimized**:
```python
# Pre-allocate once:
self.cu_seqlens_k = torch.zeros(2, dtype=torch.int32, device="cuda")

# In decode loop:
self.cu_seqlens_k[1] = total_kv_tokens
```

**Savings**: Negligible memory, but reduces allocation overhead.

### 5.3 RoPE: In-place or Pre-allocated Buffers

The RoPE implementation creates multiple float32 intermediate tensors. Options:
1. Pre-allocate buffers for Q and K rotary outputs
2. Use in-place operations where possible
3. Use fused RoPE kernel (e.g., from FlashAttention)

**Potential savings**: ~1.5 GB during prefill per layer

### 5.4 MLP: Cannot Optimize Easily

The MLP `gate_up` tensor is inherently required for the gated activation:
```python
gate_up = gate_up_proj(x)  # [seq_len, 2 × intermediate_size]
x, y = gate_up.chunk(2, -1)
output = silu(x) * y
```

This is a fundamental computation pattern. Potential optimizations:
- Chunked MLP computation (process seq_len in chunks)
- Fused kernels that avoid materializing full gate_up

---

## 6. Memory Flow Diagram

### Prefill (per layer):

```
hidden_states ──┬──► LayerNorm ──► hidden_ln
                │
residual ◄──────┘

hidden_ln ──► QKV_proj ──► qkv ──┬──► q ──► Q_norm ──► RoPE ──► q_rotated
                                 ├──► k ──► K_norm ──► RoPE ──► k_rotated
                                 └──► v

q_rotated, k_rotated, v ──► FlashAttention ──► attn_output

attn_output ──► O_proj ──► hidden_states'

hidden_states', residual ──► LayerNorm ──► hidden_ln', residual'

hidden_ln' ──► MLP_gate_up ──► gate_up ──► SiLU×gate ──► MLP_down ──► hidden_states''

k_rotated, v ──► CPU_offload (sync copy)
```

### Decode (per layer):

```
[CPU] k_cache_cpu, v_cache_cpu
           │
           ▼ (H2D async to ring buffer)
[GPU] layer_k_cache[buffer_idx], layer_v_cache[buffer_idx]
           │
           ▼ (view)
      k_prefill, v_prefill
           │
           ├──► torch.cat([k_prefill, k_decode_prev, k_new]) ──► k_full  ⚠️ NEW ALLOC
           │
           └──► torch.cat([v_prefill, v_decode_prev, v_new]) ──► v_full  ⚠️ NEW ALLOC

q_new, k_full, v_full ──► FlashAttention ──► attn_output

k_new, v_new ──► decode_k_buffer, decode_v_buffer (in-place store)
```

---

## 7. Appendix: Size Calculations

### Qwen3-4B Example (128k context)

```python
# Model config
seq_len = 131072
hidden_size = 2560
num_heads = 20
num_kv_heads = 8
head_dim = 128
intermediate_size = 13696
num_layers = 36
block_size = 1024
num_kv_buffers = 4
num_cpu_blocks = 128
dtype_size = 2  # fp16/bf16

# Derived
kv_dim = num_kv_heads * head_dim  # 1024
q_size = num_heads * head_dim     # 2560
qkv_size = q_size + 2 * kv_dim    # 4608

# Pre-allocated GPU (OffloadEngine)
ring_buffer = 2 * num_kv_buffers * seq_len * kv_dim * dtype_size
# = 2 * 4 * 131072 * 1024 * 2 = 2,147,483,648 bytes = 2048 MB

decode_buffer = 2 * num_layers * block_size * kv_dim * dtype_size
# = 2 * 36 * 1024 * 1024 * 2 = 150,994,944 bytes = 144 MB

# Pre-allocated CPU
cpu_cache = 2 * num_layers * num_cpu_blocks * block_size * kv_dim * dtype_size
# = 2 * 36 * 128 * 1024 * 1024 * 2 = 19,327,352,832 bytes = 18432 MB

# Prefill temporaries (per layer peak)
mlp_gate_up = seq_len * 2 * intermediate_size * dtype_size
# = 131072 * 2 * 13696 * 2 = 7,180,648,448 bytes = 6848 MB

# Decode temporaries (per layer)
k_full = seq_len * kv_dim * dtype_size
# = 131072 * 1024 * 2 = 268,435,456 bytes = 256 MB
v_full = k_full  # = 256 MB
# Total: 512 MB
```

---

## 8. Empirical Validation

This section validates the theoretical memory analysis against actual measurements.

### 8.1 Test Configuration

```bash
python tests/test_needle.py --enable-offload --input-len 100000 --block-size 1024
```

**Parameters:**
- Model: Qwen3-4B-Instruct
- `seq_len = 100000` (actual tokens: 99925)
- `block_size = 1024`
- `max_model_len = 131072`
- `num_kv_buffers = 4`

### 8.2 Theoretical Peak Memory Calculation

#### Step 1: Model Load Memory

| Component | Formula | Size |
|-----------|---------|------|
| Model weights | ~4B params × 2 bytes | ~8 GB |
| Ring buffer | 2 × 4 × 131072 × 1024 × 2 | 2048 MB |
| Decode buffer | 2 × 36 × 1024 × 1024 × 2 | 144 MB |
| **Subtotal** | | **~10.2 GB** |

#### Step 2: Prefill Activation Peak (per-layer)

| Component | Formula | Size |
|-----------|---------|------|
| hidden_states | 100000 × 2560 × 2 | 512 MB |
| residual | 100000 × 2560 × 2 | 512 MB |
| MLP gate_up | 100000 × 27392 × 2 | **5478 MB** |
| MLP silu×gate | 100000 × 13696 × 2 | 2739 MB |
| Other intermediates (qkv, RoPE, attn) | ~1-2 GB | ~1500 MB |
| **Subtotal** | | **~10 GB** |

#### Step 3: Total Peak

```
Total Peak = Model Load + Activation Peak
           = 10.2 GB + 10 GB
           = ~20.2 GB
```

### 8.3 Actual Measurement Results

```python
import torch
torch.cuda.reset_peak_memory_stats()
# ... run inference ...
peak = torch.cuda.max_memory_allocated()
```

| Metric | Value |
|--------|-------|
| After model load | 9.82 GB |
| Peak during inference | **20.02 GB** |
| Activation peak (delta) | 10.20 GB |

### 8.4 Comparison: Theory vs Actual

| Component | Theoretical | Actual | Error |
|-----------|-------------|--------|-------|
| Model load memory | ~10.2 GB | 9.82 GB | -3.7% |
| Activation peak | ~10 GB | 10.20 GB | +2.0% |
| **Total peak** | **~20.2 GB** | **20.02 GB** | **< 1%** |

### 8.5 Key Findings

1. **Theoretical model is accurate**: < 5% error in all components.

2. **MLP gate_up is the dominant temporary**:
   - Size: 5.35 GB (for 100k tokens)
   - Accounts for ~50% of activation peak
   - Formula: `seq_len × 2 × intermediate_size × dtype_size`

3. **Memory scaling with sequence length**:
   | seq_len | Model Load | Activation Peak | Total Peak |
   |---------|------------|-----------------|------------|
   | 8k | ~10 GB | ~0.8 GB | ~11 GB |
   | 32k | ~10 GB | ~3.2 GB | ~13 GB |
   | 64k | ~10 GB | ~6.4 GB | ~16 GB |
   | 100k | ~10 GB | ~10 GB | ~20 GB |
   | 128k | ~10 GB | ~13 GB | ~23 GB |

4. **Decode memory is much smaller**:
   - Per-step: ~512 MB for k_full + v_full (at 100k context)
   - Does not grow with decode steps (constant per layer)

### 8.6 Memory Profiling Script

To reproduce the measurement:

```python
import os
os.environ["NANOVLLM_LOG_LEVEL"] = "INFO"

import torch
from nanovllm import LLM, SamplingParams
from tests.utils import generate_needle_prompt

# Reset memory stats
torch.cuda.reset_peak_memory_stats()
torch.cuda.empty_cache()

# Initialize LLM
llm = LLM(
    "path/to/model",
    enforce_eager=True,
    max_model_len=131072,
    max_num_batched_tokens=131072,
    enable_cpu_offload=True,
    kvcache_block_size=1024,
    num_gpu_blocks=2,
)

after_load = torch.cuda.memory_allocated()
print(f"After model load: {after_load / 1024**3:.2f} GB")

# Generate prompt and run inference
prompt, expected = generate_needle_prompt(
    tokenizer=llm.tokenizer,
    target_length=100000,
    needle_position=0.5,
)

torch.cuda.reset_peak_memory_stats()
outputs = llm.generate([prompt], SamplingParams(max_tokens=32))

peak = torch.cuda.max_memory_allocated()
print(f"Peak during inference: {peak / 1024**3:.2f} GB")
```