📝 docs: add 64k memory analysis and test configuration updates
Add comprehensive memory analysis for 64k inference on Llama 3.1 8B: New documentation: - docs/64k_memory_analysis.md: GPU-only vs offload memory analysis, OOM root cause (memory fragmentation), RTX 3090 limitations, theoretical vs actual memory usage breakdown Test configuration updates: - tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer size tuning (default 4, can reduce to 1 for lower memory) - Update default data_dir to ruler_64k - Update default max_model_len to 65664 for 64k support CLAUDE.md updates: - Add 64k_memory_analysis.md to documentation index - Document num_kv_buffers parameter in Configuration section - Add 64k hardware requirements note to Model Limits Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload) due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the recommended hardware for 64k workloads. Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
161
docs/64k_mlp_activation_oom.md
Normal file
161
docs/64k_mlp_activation_oom.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# 64K Prefill MLP Activation OOM Issue
|
||||
|
||||
## Problem Summary
|
||||
|
||||
When running RULER benchmark with 64K context length using CPU offload mode, OOM occurs during MLP forward pass in `run_layerwise_offload_prefill`. The KV cache is successfully offloaded to CPU, but MLP intermediate activations exceed available GPU memory.
|
||||
|
||||
## Environment
|
||||
|
||||
- GPU: RTX 3090 (24GB)
|
||||
- Model: LLaMA 3.1 8B
|
||||
- Sequence Length: 65536 tokens
|
||||
- Mode: `enable_cpu_offload=True`, `num_gpu_blocks=2`
|
||||
|
||||
## Error Message
|
||||
|
||||
```
|
||||
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
|
||||
GPU 0 has a total capacity of 23.57 GiB of which 2.66 GiB is free.
|
||||
Including non-PyTorch memory, this process has 20.88 GiB memory in use.
|
||||
Of the allocated memory 20.51 GiB is allocated by PyTorch, and 32.26 MiB
|
||||
is reserved by PyTorch but unallocated.
|
||||
```
|
||||
|
||||
## Stack Trace
|
||||
|
||||
```
|
||||
File "nanovllm/engine/model_runner.py", line 843, in run_layerwise_offload_prefill
|
||||
hidden_states = layer.mlp(hidden_states)
|
||||
File "nanovllm/models/llama.py", line 103, in forward
|
||||
gate_up = self.gate_up_proj(x)
|
||||
File "nanovllm/layers/linear.py", line 73, in forward
|
||||
return F.linear(x, self.weight, self.bias)
|
||||
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.47 GiB.
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Memory Breakdown
|
||||
|
||||
| Component | Calculation | Size |
|
||||
|-----------|-------------|------|
|
||||
| Model weights (BF16) | 8B params × 2 bytes | ~16 GB |
|
||||
| GPU KV cache | 2 blocks × 1024 tokens × 8KB/token | ~16 MB |
|
||||
| **Remaining for activations** | 24 - 16 - overhead | **~6-7 GB** |
|
||||
|
||||
### MLP Activation Memory (per layer)
|
||||
|
||||
For LLaMA 3.1 8B with `hidden_size=4096`, `intermediate_size=14336`:
|
||||
|
||||
| Tensor | Shape | Size (BF16) |
|
||||
|--------|-------|-------------|
|
||||
| MLP input | [65536, 4096] | 512 MB |
|
||||
| gate_up output | [65536, 28672] | **3.47 GB** |
|
||||
| down_proj input | [65536, 14336] | 1.75 GB |
|
||||
| MLP output | [65536, 4096] | 512 MB |
|
||||
|
||||
**Peak MLP memory**: ~3.5-4 GB for intermediate tensors
|
||||
|
||||
### Why OOM Occurs
|
||||
|
||||
1. Model weights consume ~16 GB (loaded on GPU for layer-wise processing)
|
||||
2. Available memory: ~7 GB
|
||||
3. MLP `gate_up_proj` output: 3.47 GB
|
||||
4. Additional tensors (input, gradients, etc.): ~1-2 GB
|
||||
5. **Total required > Available** → OOM
|
||||
|
||||
## Code Location
|
||||
|
||||
The issue is in `nanovllm/engine/model_runner.py`:
|
||||
|
||||
```python
|
||||
# Line 843 in run_layerwise_offload_prefill
|
||||
hidden_states = layer.mlp(hidden_states) # <-- OOM here
|
||||
```
|
||||
|
||||
The entire sequence (65536 tokens) is passed through MLP in one shot.
|
||||
|
||||
## Current Configuration
|
||||
|
||||
From `model_wrappers.py` (RULER integration):
|
||||
|
||||
```python
|
||||
llm_kwargs = {
|
||||
"max_model_len": max_model_len, # 128 * 1024
|
||||
"max_num_batched_tokens": max_model_len, # Same as max_model_len
|
||||
"enable_cpu_offload": True,
|
||||
"num_gpu_blocks": 2,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Setting `max_num_batched_tokens = max_model_len` causes nanovllm to process all tokens at once.
|
||||
|
||||
## Potential Solutions
|
||||
|
||||
### Option 1: Chunked MLP Processing
|
||||
|
||||
Modify `run_layerwise_offload_prefill` to process MLP in chunks:
|
||||
|
||||
```python
|
||||
# Instead of:
|
||||
hidden_states = layer.mlp(hidden_states)
|
||||
|
||||
# Do:
|
||||
chunk_size = 8192 # Process 8K tokens at a time
|
||||
chunks = hidden_states.split(chunk_size, dim=0)
|
||||
outputs = []
|
||||
for chunk in chunks:
|
||||
outputs.append(layer.mlp(chunk))
|
||||
hidden_states = torch.cat(outputs, dim=0)
|
||||
```
|
||||
|
||||
### Option 2: Activation Checkpointing
|
||||
|
||||
Use gradient checkpointing to recompute activations instead of storing them:
|
||||
|
||||
```python
|
||||
from torch.utils.checkpoint import checkpoint
|
||||
hidden_states = checkpoint(layer.mlp, hidden_states, use_reentrant=False)
|
||||
```
|
||||
|
||||
### Option 3: Reduce Chunk Size via Config
|
||||
|
||||
Add a new config parameter `prefill_chunk_size` to control how many tokens are processed per forward pass.
|
||||
|
||||
## Memory Estimation Formula
|
||||
|
||||
For a given sequence length `S` and model config:
|
||||
|
||||
```
|
||||
MLP_peak_memory = S × intermediate_size × 2 × 2 bytes
|
||||
= S × 14336 × 4 bytes
|
||||
|
||||
For S = 65536:
|
||||
MLP_peak = 65536 × 14336 × 4 = 3.76 GB
|
||||
```
|
||||
|
||||
Maximum safe sequence length for RTX 3090 (24GB):
|
||||
```
|
||||
S_max = available_memory / (intermediate_size × 4)
|
||||
= 6GB / (14336 × 4)
|
||||
≈ 100K tokens (theoretical)
|
||||
≈ 8-16K tokens (practical, with safety margin)
|
||||
```
|
||||
|
||||
## Reproduction Steps
|
||||
|
||||
```bash
|
||||
cd /home/zijie/Code/COMPASS/eval/RULER/scripts
|
||||
|
||||
# Set SEQ_LENGTHS to 65536 in config_models.sh
|
||||
# Then run:
|
||||
./run.sh llama3.1-8b-nanovllm synthetic --metric full --task niah_single_1
|
||||
```
|
||||
|
||||
## Related Files
|
||||
|
||||
- `nanovllm/engine/model_runner.py`: `run_layerwise_offload_prefill()` (line 751+)
|
||||
- `nanovllm/models/llama.py`: `LlamaMLP.forward()` (line 103)
|
||||
- `nanovllm/config.py`: Config parameters
|
||||
- RULER integration: `eval/RULER/scripts/pred/model_wrappers.py`
|
||||
Reference in New Issue
Block a user