📝 docs: add 64k memory analysis and test configuration updates

Add comprehensive memory analysis for 64k inference on Llama 3.1 8B: New documentation: - docs/64k_memory_analysis.md: GPU-only vs offload memory analysis, OOM root cause (memory fragmentation), RTX 3090 limitations, theoretical vs actual memory usage breakdown Test configuration updates: - tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer size tuning (default 4, can reduce to 1 for lower memory) - Update default data_dir to ruler_64k - Update default max_model_len to 65664 for 64k support CLAUDE.md updates: - Add 64k_memory_analysis.md to documentation index - Document num_kv_buffers parameter in Configuration section - Add 64k hardware requirements note to Model Limits Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload) due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the recommended hardware for 64k workloads. Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:02:09 +08:00
parent c51a640a29
commit 86633004ca
4 changed files with 303 additions and 4 deletions
--- a/docs/64k_mlp_activation_oom.md
+++ b/docs/64k_mlp_activation_oom.md
@@ -0,0 +1,161 @@
+# 64K Prefill MLP Activation OOM Issue
+
+## Problem Summary
+
+When running RULER benchmark with 64K context length using CPU offload mode, OOM occurs during MLP forward pass in `run_layerwise_offload_prefill`. The KV cache is successfully offloaded to CPU, but MLP intermediate activations exceed available GPU memory.
+
+## Environment
+
+- GPU: RTX 3090 (24GB)
+- Model: LLaMA 3.1 8B
+- Sequence Length: 65536 tokens
+- Mode: `enable_cpu_offload=True`, `num_gpu_blocks=2`
+
+## Error Message
+
+```
+torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
+GPU 0 has a total capacity of 23.57 GiB of which 2.66 GiB is free.
+Including non-PyTorch memory, this process has 20.88 GiB memory in use.
+Of the allocated memory 20.51 GiB is allocated by PyTorch, and 32.26 MiB
+is reserved by PyTorch but unallocated.
+```
+
+## Stack Trace
+
+```
+File "nanovllm/engine/model_runner.py", line 843, in run_layerwise_offload_prefill
+    hidden_states = layer.mlp(hidden_states)
+  File "nanovllm/models/llama.py", line 103, in forward
+    gate_up = self.gate_up_proj(x)
+  File "nanovllm/layers/linear.py", line 73, in forward
+    return F.linear(x, self.weight, self.bias)
+torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
+```
+
+## Root Cause Analysis
+
+### Memory Breakdown
+
+| Component | Calculation | Size |
+|-----------|-------------|------|
+| Model weights (BF16) | 8B params × 2 bytes | ~16 GB |
+| GPU KV cache | 2 blocks × 1024 tokens × 8KB/token | ~16 MB |
+| **Remaining for activations** | 24 - 16 - overhead | **~6-7 GB** |
+
+### MLP Activation Memory (per layer)
+
+For LLaMA 3.1 8B with `hidden_size=4096`, `intermediate_size=14336`:
+
+| Tensor | Shape | Size (BF16) |
+|--------|-------|-------------|
+| MLP input | [65536, 4096] | 512 MB |
+| gate_up output | [65536, 28672] | **3.47 GB** |
+| down_proj input | [65536, 14336] | 1.75 GB |
+| MLP output | [65536, 4096] | 512 MB |
+
+**Peak MLP memory**: ~3.5-4 GB for intermediate tensors
+
+### Why OOM Occurs
+
+1. Model weights consume ~16 GB (loaded on GPU for layer-wise processing)
+2. Available memory: ~7 GB
+3. MLP `gate_up_proj` output: 3.47 GB
+4. Additional tensors (input, gradients, etc.): ~1-2 GB
+5. **Total required > Available** → OOM
+
+## Code Location
+
+The issue is in `nanovllm/engine/model_runner.py`:
+
+```python
+# Line 843 in run_layerwise_offload_prefill
+hidden_states = layer.mlp(hidden_states)  # <-- OOM here
+```
+
+The entire sequence (65536 tokens) is passed through MLP in one shot.
+
+## Current Configuration
+
+From `model_wrappers.py` (RULER integration):
+
+```python
+llm_kwargs = {
+    "max_model_len": max_model_len,           # 128 * 1024
+    "max_num_batched_tokens": max_model_len,  # Same as max_model_len
+    "enable_cpu_offload": True,
+    "num_gpu_blocks": 2,
+    ...
+}
+```
+
+Setting `max_num_batched_tokens = max_model_len` causes nanovllm to process all tokens at once.
+
+## Potential Solutions
+
+### Option 1: Chunked MLP Processing
+
+Modify `run_layerwise_offload_prefill` to process MLP in chunks:
+
+```python
+# Instead of:
+hidden_states = layer.mlp(hidden_states)
+
+# Do:
+chunk_size = 8192  # Process 8K tokens at a time
+chunks = hidden_states.split(chunk_size, dim=0)
+outputs = []
+for chunk in chunks:
+    outputs.append(layer.mlp(chunk))
+hidden_states = torch.cat(outputs, dim=0)
+```
+
+### Option 2: Activation Checkpointing
+
+Use gradient checkpointing to recompute activations instead of storing them:
+
+```python
+from torch.utils.checkpoint import checkpoint
+hidden_states = checkpoint(layer.mlp, hidden_states, use_reentrant=False)
+```
+
+### Option 3: Reduce Chunk Size via Config
+
+Add a new config parameter `prefill_chunk_size` to control how many tokens are processed per forward pass.
+
+## Memory Estimation Formula
+
+For a given sequence length `S` and model config:
+
+```
+MLP_peak_memory = S × intermediate_size × 2 × 2 bytes
+                = S × 14336 × 4 bytes
+
+For S = 65536:
+MLP_peak = 65536 × 14336 × 4 = 3.76 GB
+```
+
+Maximum safe sequence length for RTX 3090 (24GB):
+```
+S_max = available_memory / (intermediate_size × 4)
+      = 6GB / (14336 × 4)
+      ≈ 100K tokens (theoretical)
+      ≈ 8-16K tokens (practical, with safety margin)
+```
+
+## Reproduction Steps
+
+```bash
+cd /home/zijie/Code/COMPASS/eval/RULER/scripts
+
+# Set SEQ_LENGTHS to 65536 in config_models.sh
+# Then run:
+./run.sh llama3.1-8b-nanovllm synthetic --metric full --task niah_single_1
+```
+
+## Related Files
+
+- `nanovllm/engine/model_runner.py`: `run_layerwise_offload_prefill()` (line 751+)
+- `nanovllm/models/llama.py`: `LlamaMLP.forward()` (line 103)
+- `nanovllm/config.py`: Config parameters
+- RULER integration: `eval/RULER/scripts/pred/model_wrappers.py`