Files
nano-vllm/docs/issue_xattn_offload_gqa_buffer_oom.md
Zijie Tian ef37d4f1a8 🐛 docs: document XAttention offload GQA buffer OOM issue
Document OOM issue when using XAttention BSA + CPU offload
with large models (GLM-4-9B) on 24GB GPUs.

Issue: 8GB allocation for k_expanded buffer fails due to
using num_heads instead of num_kv_heads in GQA models.

Root cause analysis and proposed fix included.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:46:50 +08:00

170 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Issue: XAttention Offload Mode GQA Buffer OOM
## 问题描述
在使用 XAttention BSA (Block Sparse Attention) + CPU Offload 模式运行 GLM-4-9B 等大模型时,出现 CUDA OOM 错误。
### 错误信息
```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB.
GPU 0 has a total capacity of 23.57 GiB of which 4.19 GiB is free.
```
### 复现环境
| 项目 | 值 |
|------|-----|
| 模型 | GLM-4-9B-Chat-1M |
| GPU | RTX 3090 (24GB) |
| Context Length | 32K |
| sparse_policy | XATTN_BSA |
| enable_cpu_offload | true |
| max_model_len | 1048576 (1M) |
### 错误位置
```
File "nanovllm/kvcache/sparse/xattn_bsa.py", line 246, in alloc_policy_metadata
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
```
---
## 问题分析
### 内存分配分析
`alloc_policy_metadata()` 在 KV cache 初始化时分配以下 buffer
| Buffer | 用途 | 大小 (GLM-4, 1M seq) |
|--------|------|----------------------|
| `_prefill_mask_buffer` | BSA mask | ~32 MB |
| `_m_partial_buffer` | KV chunking m stats | ~32 MB |
| `_l_partial_buffer` | KV chunking l stats | ~32 MB |
| `_block_sums_buffer` | Block sums | ~64 MB |
| **`_k_expanded`** | GQA K 扩展 | **~8 GB** |
| **`_v_expanded`** | GQA V 扩展 | **~8 GB** |
### GQA Buffer 计算
```python
shape = (1, num_heads, max_seq_len, head_dim)
= (1, 32, 1048576, 128)
size = 1 × 32 × 1048576 × 128 × 2 bytes (fp16)
= 8,589,934,592 bytes
= 8 GB per buffer
```
### 根本原因
1. **设计意图冲突**`_k_expanded``_v_expanded` 的文档注释明确说是 "for GPU-only mode"
2. **条件检查不完整**:代码只检查了 `num_heads == num_kv_heads` 来跳过分配,没有检查 offload 模式
3. **Offload 模式不需要这些 buffer**`compute_chunked_prefill()` 使用不同的计算路径,不依赖预分配的 GQA buffer
### 相关代码
```python
# xattn_bsa.py:238-247
# Only allocate GQA expansion buffers if GQA (num_heads != num_kv_heads)
if num_heads == num_kv_heads:
logger.info(f"[XAttn] No GQA expansion needed (num_heads == num_kv_heads = {num_heads})")
return # <-- 只检查了 GQA没检查 offload 模式
# Shape: [1, num_heads, max_seq_len, head_dim] for xattn_estimate format
shape = (1, num_heads, max_seq_len, head_dim)
self._k_expanded = torch.empty(shape, dtype=dtype, device=device) # <-- OOM here
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
```
---
## 解决思路
### 方案 1: 在 Offload 模式下跳过 GQA Buffer 分配 (推荐)
`alloc_policy_metadata()` 中添加 offload 模式检查:
```python
def alloc_policy_metadata(
self,
num_heads: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
dtype: torch.dtype,
device: torch.device,
enable_cpu_offload: bool = False, # <-- 新增参数
) -> None:
# ... 分配 mask buffer 和 KV chunking buffers (offload 模式需要)
# Skip GQA buffers in offload mode
# Chunked prefill uses compute_chunked_prefill() which doesn't need these
if enable_cpu_offload:
logger.info("[XAttn] Offload mode: skipping GQA expansion buffers")
return
# GPU-only mode: pre-allocate GQA buffers for compute_prefill()
if num_heads == num_kv_heads:
logger.info(f"[XAttn] No GQA expansion needed")
return
shape = (1, num_heads, max_seq_len, head_dim)
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
```
**需要修改的文件**
1. `nanovllm/kvcache/sparse/xattn_bsa.py` - `alloc_policy_metadata()` 方法
2. `nanovllm/engine/model_runner.py` - 调用 `alloc_policy_metadata()` 时传入 `enable_cpu_offload`
### 方案 2: 延迟分配 (Lazy Allocation)
只在 `compute_prefill()` 首次调用时分配 GQA bufferoffload 模式走 `compute_chunked_prefill()` 不会触发分配。
```python
def compute_prefill(self, ...):
# Lazy allocation on first use
if self._k_expanded is None and num_heads != num_kv_heads:
self._allocate_gqa_buffers(...)
...
```
### 方案 3: 基于 chunk_size 限制 buffer 大小
不预分配 max_seq_len 大小,而是只分配 chunk_size 大小:
```python
# 原来: max_seq_len (1M tokens) -> 8 GB
# 修改后: chunk_size (16K tokens) -> ~130 MB
buffer_len = self.chunk_size if enable_cpu_offload else max_seq_len
shape = (1, num_heads, buffer_len, head_dim)
```
---
## 验证方法
修复后运行以下命令验证:
```bash
cd /home/zijie/Code/COMPASS
GPULIST=0 ./scripts/run_ruler.sh glm4-9b-xattn-nanovllm synthetic xattn --task niah_single_1
```
预期结果:
- 不再出现 8GB allocation 的 OOM 错误
- 模型正常加载并完成推理
---
## 相关文档
- `docs/xattn_bsa_policy_design.md` - XAttention BSA Policy 设计文档
- `docs/gpu_only_xattn_guide.md` - GPU-Only XAttention 指南
## 优先级
**High** - 阻塞 9B+ 模型在 24GB 显存 GPU 上使用 XAttention + Offload 模式