Add comprehensive memory analysis for 64k inference on Llama 3.1 8B: New documentation: - docs/64k_memory_analysis.md: GPU-only vs offload memory analysis, OOM root cause (memory fragmentation), RTX 3090 limitations, theoretical vs actual memory usage breakdown Test configuration updates: - tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer size tuning (default 4, can reduce to 1 for lower memory) - Update default data_dir to ruler_64k - Update default max_model_len to 65664 for 64k support CLAUDE.md updates: - Add 64k_memory_analysis.md to documentation index - Document num_kv_buffers parameter in Configuration section - Add 64k hardware requirements note to Model Limits Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload) due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the recommended hardware for 64k workloads. Co-Authored-By: Claude <noreply@anthropic.com>
4.5 KiB
64K Prefill MLP Activation OOM Issue
Problem Summary
When running RULER benchmark with 64K context length using CPU offload mode, OOM occurs during MLP forward pass in run_layerwise_offload_prefill. The KV cache is successfully offloaded to CPU, but MLP intermediate activations exceed available GPU memory.
Environment
- GPU: RTX 3090 (24GB)
- Model: LLaMA 3.1 8B
- Sequence Length: 65536 tokens
- Mode:
enable_cpu_offload=True,num_gpu_blocks=2
Error Message
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
GPU 0 has a total capacity of 23.57 GiB of which 2.66 GiB is free.
Including non-PyTorch memory, this process has 20.88 GiB memory in use.
Of the allocated memory 20.51 GiB is allocated by PyTorch, and 32.26 MiB
is reserved by PyTorch but unallocated.
Stack Trace
File "nanovllm/engine/model_runner.py", line 843, in run_layerwise_offload_prefill
hidden_states = layer.mlp(hidden_states)
File "nanovllm/models/llama.py", line 103, in forward
gate_up = self.gate_up_proj(x)
File "nanovllm/layers/linear.py", line 73, in forward
return F.linear(x, self.weight, self.bias)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
Root Cause Analysis
Memory Breakdown
| Component | Calculation | Size |
|---|---|---|
| Model weights (BF16) | 8B params × 2 bytes | ~16 GB |
| GPU KV cache | 2 blocks × 1024 tokens × 8KB/token | ~16 MB |
| Remaining for activations | 24 - 16 - overhead | ~6-7 GB |
MLP Activation Memory (per layer)
For LLaMA 3.1 8B with hidden_size=4096, intermediate_size=14336:
| Tensor | Shape | Size (BF16) |
|---|---|---|
| MLP input | [65536, 4096] | 512 MB |
| gate_up output | [65536, 28672] | 3.47 GB |
| down_proj input | [65536, 14336] | 1.75 GB |
| MLP output | [65536, 4096] | 512 MB |
Peak MLP memory: ~3.5-4 GB for intermediate tensors
Why OOM Occurs
- Model weights consume ~16 GB (loaded on GPU for layer-wise processing)
- Available memory: ~7 GB
- MLP
gate_up_projoutput: 3.47 GB - Additional tensors (input, gradients, etc.): ~1-2 GB
- Total required > Available → OOM
Code Location
The issue is in nanovllm/engine/model_runner.py:
# Line 843 in run_layerwise_offload_prefill
hidden_states = layer.mlp(hidden_states) # <-- OOM here
The entire sequence (65536 tokens) is passed through MLP in one shot.
Current Configuration
From model_wrappers.py (RULER integration):
llm_kwargs = {
"max_model_len": max_model_len, # 128 * 1024
"max_num_batched_tokens": max_model_len, # Same as max_model_len
"enable_cpu_offload": True,
"num_gpu_blocks": 2,
...
}
Setting max_num_batched_tokens = max_model_len causes nanovllm to process all tokens at once.
Potential Solutions
Option 1: Chunked MLP Processing
Modify run_layerwise_offload_prefill to process MLP in chunks:
# Instead of:
hidden_states = layer.mlp(hidden_states)
# Do:
chunk_size = 8192 # Process 8K tokens at a time
chunks = hidden_states.split(chunk_size, dim=0)
outputs = []
for chunk in chunks:
outputs.append(layer.mlp(chunk))
hidden_states = torch.cat(outputs, dim=0)
Option 2: Activation Checkpointing
Use gradient checkpointing to recompute activations instead of storing them:
from torch.utils.checkpoint import checkpoint
hidden_states = checkpoint(layer.mlp, hidden_states, use_reentrant=False)
Option 3: Reduce Chunk Size via Config
Add a new config parameter prefill_chunk_size to control how many tokens are processed per forward pass.
Memory Estimation Formula
For a given sequence length S and model config:
MLP_peak_memory = S × intermediate_size × 2 × 2 bytes
= S × 14336 × 4 bytes
For S = 65536:
MLP_peak = 65536 × 14336 × 4 = 3.76 GB
Maximum safe sequence length for RTX 3090 (24GB):
S_max = available_memory / (intermediate_size × 4)
= 6GB / (14336 × 4)
≈ 100K tokens (theoretical)
≈ 8-16K tokens (practical, with safety margin)
Reproduction Steps
cd /home/zijie/Code/COMPASS/eval/RULER/scripts
# Set SEQ_LENGTHS to 65536 in config_models.sh
# Then run:
./run.sh llama3.1-8b-nanovllm synthetic --metric full --task niah_single_1
Related Files
nanovllm/engine/model_runner.py:run_layerwise_offload_prefill()(line 751+)nanovllm/models/llama.py:LlamaMLP.forward()(line 103)nanovllm/config.py: Config parameters- RULER integration:
eval/RULER/scripts/pred/model_wrappers.py