🐛 docs: document XAttention offload GQA buffer OOM issue
Document OOM issue when using XAttention BSA + CPU offload with large models (GLM-4-9B) on 24GB GPUs. Issue: 8GB allocation for k_expanded buffer fails due to using num_heads instead of num_kv_heads in GQA models. Root cause analysis and proposed fix included. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
169
docs/issue_xattn_offload_gqa_buffer_oom.md
Normal file
169
docs/issue_xattn_offload_gqa_buffer_oom.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# Issue: XAttention Offload Mode GQA Buffer OOM
|
||||
|
||||
## 问题描述
|
||||
|
||||
在使用 XAttention BSA (Block Sparse Attention) + CPU Offload 模式运行 GLM-4-9B 等大模型时,出现 CUDA OOM 错误。
|
||||
|
||||
### 错误信息
|
||||
|
||||
```
|
||||
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB.
|
||||
GPU 0 has a total capacity of 23.57 GiB of which 4.19 GiB is free.
|
||||
```
|
||||
|
||||
### 复现环境
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|-----|
|
||||
| 模型 | GLM-4-9B-Chat-1M |
|
||||
| GPU | RTX 3090 (24GB) |
|
||||
| Context Length | 32K |
|
||||
| sparse_policy | XATTN_BSA |
|
||||
| enable_cpu_offload | true |
|
||||
| max_model_len | 1048576 (1M) |
|
||||
|
||||
### 错误位置
|
||||
|
||||
```
|
||||
File "nanovllm/kvcache/sparse/xattn_bsa.py", line 246, in alloc_policy_metadata
|
||||
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 问题分析
|
||||
|
||||
### 内存分配分析
|
||||
|
||||
`alloc_policy_metadata()` 在 KV cache 初始化时分配以下 buffer:
|
||||
|
||||
| Buffer | 用途 | 大小 (GLM-4, 1M seq) |
|
||||
|--------|------|----------------------|
|
||||
| `_prefill_mask_buffer` | BSA mask | ~32 MB |
|
||||
| `_m_partial_buffer` | KV chunking m stats | ~32 MB |
|
||||
| `_l_partial_buffer` | KV chunking l stats | ~32 MB |
|
||||
| `_block_sums_buffer` | Block sums | ~64 MB |
|
||||
| **`_k_expanded`** | GQA K 扩展 | **~8 GB** |
|
||||
| **`_v_expanded`** | GQA V 扩展 | **~8 GB** |
|
||||
|
||||
### GQA Buffer 计算
|
||||
|
||||
```python
|
||||
shape = (1, num_heads, max_seq_len, head_dim)
|
||||
= (1, 32, 1048576, 128)
|
||||
|
||||
size = 1 × 32 × 1048576 × 128 × 2 bytes (fp16)
|
||||
= 8,589,934,592 bytes
|
||||
= 8 GB per buffer
|
||||
```
|
||||
|
||||
### 根本原因
|
||||
|
||||
1. **设计意图冲突**:`_k_expanded` 和 `_v_expanded` 的文档注释明确说是 "for GPU-only mode"
|
||||
2. **条件检查不完整**:代码只检查了 `num_heads == num_kv_heads` 来跳过分配,没有检查 offload 模式
|
||||
3. **Offload 模式不需要这些 buffer**:`compute_chunked_prefill()` 使用不同的计算路径,不依赖预分配的 GQA buffer
|
||||
|
||||
### 相关代码
|
||||
|
||||
```python
|
||||
# xattn_bsa.py:238-247
|
||||
# Only allocate GQA expansion buffers if GQA (num_heads != num_kv_heads)
|
||||
if num_heads == num_kv_heads:
|
||||
logger.info(f"[XAttn] No GQA expansion needed (num_heads == num_kv_heads = {num_heads})")
|
||||
return # <-- 只检查了 GQA,没检查 offload 模式
|
||||
|
||||
# Shape: [1, num_heads, max_seq_len, head_dim] for xattn_estimate format
|
||||
shape = (1, num_heads, max_seq_len, head_dim)
|
||||
self._k_expanded = torch.empty(shape, dtype=dtype, device=device) # <-- OOM here
|
||||
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 解决思路
|
||||
|
||||
### 方案 1: 在 Offload 模式下跳过 GQA Buffer 分配 (推荐)
|
||||
|
||||
在 `alloc_policy_metadata()` 中添加 offload 模式检查:
|
||||
|
||||
```python
|
||||
def alloc_policy_metadata(
|
||||
self,
|
||||
num_heads: int,
|
||||
num_kv_heads: int,
|
||||
head_dim: int,
|
||||
max_seq_len: int,
|
||||
dtype: torch.dtype,
|
||||
device: torch.device,
|
||||
enable_cpu_offload: bool = False, # <-- 新增参数
|
||||
) -> None:
|
||||
# ... 分配 mask buffer 和 KV chunking buffers (offload 模式需要)
|
||||
|
||||
# Skip GQA buffers in offload mode
|
||||
# Chunked prefill uses compute_chunked_prefill() which doesn't need these
|
||||
if enable_cpu_offload:
|
||||
logger.info("[XAttn] Offload mode: skipping GQA expansion buffers")
|
||||
return
|
||||
|
||||
# GPU-only mode: pre-allocate GQA buffers for compute_prefill()
|
||||
if num_heads == num_kv_heads:
|
||||
logger.info(f"[XAttn] No GQA expansion needed")
|
||||
return
|
||||
|
||||
shape = (1, num_heads, max_seq_len, head_dim)
|
||||
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
```
|
||||
|
||||
**需要修改的文件**:
|
||||
1. `nanovllm/kvcache/sparse/xattn_bsa.py` - `alloc_policy_metadata()` 方法
|
||||
2. `nanovllm/engine/model_runner.py` - 调用 `alloc_policy_metadata()` 时传入 `enable_cpu_offload`
|
||||
|
||||
### 方案 2: 延迟分配 (Lazy Allocation)
|
||||
|
||||
只在 `compute_prefill()` 首次调用时分配 GQA buffer,offload 模式走 `compute_chunked_prefill()` 不会触发分配。
|
||||
|
||||
```python
|
||||
def compute_prefill(self, ...):
|
||||
# Lazy allocation on first use
|
||||
if self._k_expanded is None and num_heads != num_kv_heads:
|
||||
self._allocate_gqa_buffers(...)
|
||||
...
|
||||
```
|
||||
|
||||
### 方案 3: 基于 chunk_size 限制 buffer 大小
|
||||
|
||||
不预分配 max_seq_len 大小,而是只分配 chunk_size 大小:
|
||||
|
||||
```python
|
||||
# 原来: max_seq_len (1M tokens) -> 8 GB
|
||||
# 修改后: chunk_size (16K tokens) -> ~130 MB
|
||||
buffer_len = self.chunk_size if enable_cpu_offload else max_seq_len
|
||||
shape = (1, num_heads, buffer_len, head_dim)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 验证方法
|
||||
|
||||
修复后运行以下命令验证:
|
||||
|
||||
```bash
|
||||
cd /home/zijie/Code/COMPASS
|
||||
GPULIST=0 ./scripts/run_ruler.sh glm4-9b-xattn-nanovllm synthetic xattn --task niah_single_1
|
||||
```
|
||||
|
||||
预期结果:
|
||||
- 不再出现 8GB allocation 的 OOM 错误
|
||||
- 模型正常加载并完成推理
|
||||
|
||||
---
|
||||
|
||||
## 相关文档
|
||||
|
||||
- `docs/xattn_bsa_policy_design.md` - XAttention BSA Policy 设计文档
|
||||
- `docs/gpu_only_xattn_guide.md` - GPU-Only XAttention 指南
|
||||
|
||||
## 优先级
|
||||
|
||||
**High** - 阻塞 9B+ 模型在 24GB 显存 GPU 上使用 XAttention + Offload 模式
|
||||
Reference in New Issue
Block a user