nano-vllm/docs/64k_mlp_activation_oom.md

# 64K Prefill MLP Activation OOM Issue

## Problem Summary

When running RULER benchmark with 64K context length using CPU offload mode, OOM occurs during MLP forward pass in `run_layerwise_offload_prefill`. The KV cache is successfully offloaded to CPU, but MLP intermediate activations exceed available GPU memory.

## Environment

- GPU: RTX 3090 (24GB)
- Model: LLaMA 3.1 8B
- Sequence Length: 65536 tokens
- Mode: `enable_cpu_offload=True`, `num_gpu_blocks=2`

## Error Message

```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
GPU 0 has a total capacity of 23.57 GiB of which 2.66 GiB is free.
Including non-PyTorch memory, this process has 20.88 GiB memory in use.
Of the allocated memory 20.51 GiB is allocated by PyTorch, and 32.26 MiB
is reserved by PyTorch but unallocated.
```

## Stack Trace

```
File "nanovllm/engine/model_runner.py", line 843, in run_layerwise_offload_prefill
    hidden_states = layer.mlp(hidden_states)
  File "nanovllm/models/llama.py", line 103, in forward
    gate_up = self.gate_up_proj(x)
  File "nanovllm/layers/linear.py", line 73, in forward
    return F.linear(x, self.weight, self.bias)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
```

## Root Cause Analysis

### Memory Breakdown

| Component | Calculation | Size |
|-----------|-------------|------|
| Model weights (BF16) | 8B params × 2 bytes | ~16 GB |
| GPU KV cache | 2 blocks × 1024 tokens × 8KB/token | ~16 MB |
| **Remaining for activations** | 24 - 16 - overhead | **~6-7 GB** |

### MLP Activation Memory (per layer)

For LLaMA 3.1 8B with `hidden_size=4096`, `intermediate_size=14336`:

| Tensor | Shape | Size (BF16) |
|--------|-------|-------------|
| MLP input | [65536, 4096] | 512 MB |
| gate_up output | [65536, 28672] | **3.47 GB** |
| down_proj input | [65536, 14336] | 1.75 GB |
| MLP output | [65536, 4096] | 512 MB |

**Peak MLP memory**: ~3.5-4 GB for intermediate tensors

### Why OOM Occurs

1. Model weights consume ~16 GB (loaded on GPU for layer-wise processing)
2. Available memory: ~7 GB
3. MLP `gate_up_proj` output: 3.47 GB
4. Additional tensors (input, gradients, etc.): ~1-2 GB
5. **Total required > Available** → OOM

## Code Location

The issue is in `nanovllm/engine/model_runner.py`:

```python
# Line 843 in run_layerwise_offload_prefill
hidden_states = layer.mlp(hidden_states)  # <-- OOM here
```

The entire sequence (65536 tokens) is passed through MLP in one shot.

## Current Configuration

From `model_wrappers.py` (RULER integration):

```python
llm_kwargs = {
    "max_model_len": max_model_len,           # 128 * 1024
    "max_num_batched_tokens": max_model_len,  # Same as max_model_len
    "enable_cpu_offload": True,
    "num_gpu_blocks": 2,
    ...
}
```

Setting `max_num_batched_tokens = max_model_len` causes nanovllm to process all tokens at once.

## Potential Solutions

### Option 1: Chunked MLP Processing

Modify `run_layerwise_offload_prefill` to process MLP in chunks:

```python
# Instead of:
hidden_states = layer.mlp(hidden_states)

# Do:
chunk_size = 8192  # Process 8K tokens at a time
chunks = hidden_states.split(chunk_size, dim=0)
outputs = []
for chunk in chunks:
    outputs.append(layer.mlp(chunk))
hidden_states = torch.cat(outputs, dim=0)
```

### Option 2: Activation Checkpointing

Use gradient checkpointing to recompute activations instead of storing them:

```python
from torch.utils.checkpoint import checkpoint
hidden_states = checkpoint(layer.mlp, hidden_states, use_reentrant=False)
```

### Option 3: Reduce Chunk Size via Config

Add a new config parameter `prefill_chunk_size` to control how many tokens are processed per forward pass.

## Memory Estimation Formula

For a given sequence length `S` and model config:

```
MLP_peak_memory = S × intermediate_size × 2 × 2 bytes
                = S × 14336 × 4 bytes

For S = 65536:
MLP_peak = 65536 × 14336 × 4 = 3.76 GB
```

Maximum safe sequence length for RTX 3090 (24GB):
```
S_max = available_memory / (intermediate_size × 4)
      = 6GB / (14336 × 4)
      ≈ 100K tokens (theoretical)
      ≈ 8-16K tokens (practical, with safety margin)
```

## Reproduction Steps

```bash
cd /home/zijie/Code/COMPASS/eval/RULER/scripts

# Set SEQ_LENGTHS to 65536 in config_models.sh
# Then run:
./run.sh llama3.1-8b-nanovllm synthetic --metric full --task niah_single_1
```

## Related Files

- `nanovllm/engine/model_runner.py`: `run_layerwise_offload_prefill()` (line 751+)
- `nanovllm/models/llama.py`: `LlamaMLP.forward()` (line 103)
- `nanovllm/config.py`: Config parameters
- RULER integration: `eval/RULER/scripts/pred/model_wrappers.py`