Files

Zijie Tian 86633004ca 📝 docs: add 64k memory analysis and test configuration updates

Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-14 07:02:09 +08:00

4.5 KiB

Raw Blame History

64K Prefill MLP Activation OOM Issue

Problem Summary

When running RULER benchmark with 64K context length using CPU offload mode, OOM occurs during MLP forward pass in run_layerwise_offload_prefill. The KV cache is successfully offloaded to CPU, but MLP intermediate activations exceed available GPU memory.

Environment

GPU: RTX 3090 (24GB)
Model: LLaMA 3.1 8B
Sequence Length: 65536 tokens
Mode: enable_cpu_offload=True, num_gpu_blocks=2

Error Message

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
GPU 0 has a total capacity of 23.57 GiB of which 2.66 GiB is free.
Including non-PyTorch memory, this process has 20.88 GiB memory in use.
Of the allocated memory 20.51 GiB is allocated by PyTorch, and 32.26 MiB
is reserved by PyTorch but unallocated.

Stack Trace

File "nanovllm/engine/model_runner.py", line 843, in run_layerwise_offload_prefill
    hidden_states = layer.mlp(hidden_states)
  File "nanovllm/models/llama.py", line 103, in forward
    gate_up = self.gate_up_proj(x)
  File "nanovllm/layers/linear.py", line 73, in forward
    return F.linear(x, self.weight, self.bias)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.

Root Cause Analysis

Memory Breakdown

Component	Calculation	Size
Model weights (BF16)	8B params × 2 bytes	~16 GB
GPU KV cache	2 blocks × 1024 tokens × 8KB/token	~16 MB
Remaining for activations	24 - 16 - overhead	~6-7 GB

MLP Activation Memory (per layer)

For LLaMA 3.1 8B with hidden_size=4096, intermediate_size=14336:

Tensor	Shape	Size (BF16)
MLP input	[65536, 4096]	512 MB
gate_up output	[65536, 28672]	3.47 GB
down_proj input	[65536, 14336]	1.75 GB
MLP output	[65536, 4096]	512 MB

Peak MLP memory: ~3.5-4 GB for intermediate tensors

Why OOM Occurs

Model weights consume ~16 GB (loaded on GPU for layer-wise processing)
Available memory: ~7 GB
MLP gate_up_proj output: 3.47 GB
Additional tensors (input, gradients, etc.): ~1-2 GB
Total required > Available → OOM

Code Location

The issue is in nanovllm/engine/model_runner.py:

# Line 843 in run_layerwise_offload_prefill
hidden_states = layer.mlp(hidden_states)  # <-- OOM here

The entire sequence (65536 tokens) is passed through MLP in one shot.

Current Configuration

From model_wrappers.py (RULER integration):

llm_kwargs = {
    "max_model_len": max_model_len,           # 128 * 1024
    "max_num_batched_tokens": max_model_len,  # Same as max_model_len
    "enable_cpu_offload": True,
    "num_gpu_blocks": 2,
    ...
}

Setting max_num_batched_tokens = max_model_len causes nanovllm to process all tokens at once.

Potential Solutions

Option 1: Chunked MLP Processing

Modify run_layerwise_offload_prefill to process MLP in chunks:

# Instead of:
hidden_states = layer.mlp(hidden_states)

# Do:
chunk_size = 8192  # Process 8K tokens at a time
chunks = hidden_states.split(chunk_size, dim=0)
outputs = []
for chunk in chunks:
    outputs.append(layer.mlp(chunk))
hidden_states = torch.cat(outputs, dim=0)

Option 2: Activation Checkpointing

Use gradient checkpointing to recompute activations instead of storing them:

from torch.utils.checkpoint import checkpoint
hidden_states = checkpoint(layer.mlp, hidden_states, use_reentrant=False)

Option 3: Reduce Chunk Size via Config

Add a new config parameter prefill_chunk_size to control how many tokens are processed per forward pass.

Memory Estimation Formula

For a given sequence length S and model config:

MLP_peak_memory = S × intermediate_size × 2 × 2 bytes
                = S × 14336 × 4 bytes

For S = 65536:
MLP_peak = 65536 × 14336 × 4 = 3.76 GB

Maximum safe sequence length for RTX 3090 (24GB):

S_max = available_memory / (intermediate_size × 4)
      = 6GB / (14336 × 4)
      ≈ 100K tokens (theoretical)
      ≈ 8-16K tokens (practical, with safety margin)

Reproduction Steps

cd /home/zijie/Code/COMPASS/eval/RULER/scripts

# Set SEQ_LENGTHS to 65536 in config_models.sh
# Then run:
./run.sh llama3.1-8b-nanovllm synthetic --metric full --task niah_single_1

nanovllm/engine/model_runner.py: run_layerwise_offload_prefill() (line 751+)
nanovllm/models/llama.py: LlamaMLP.forward() (line 103)
nanovllm/config.py: Config parameters
RULER integration: eval/RULER/scripts/pred/model_wrappers.py

4.5 KiB Raw Blame History Unescape Escape