From ef37d4f1a836fea6b10d3bb5fe8b1b7bdddfd8eb Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Thu, 5 Feb 2026 02:46:50 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=90=9B=20docs:=20document=20XAttention=20?= =?UTF-8?q?offload=20GQA=20buffer=20OOM=20issue?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document OOM issue when using XAttention BSA + CPU offload with large models (GLM-4-9B) on 24GB GPUs. Issue: 8GB allocation for k_expanded buffer fails due to using num_heads instead of num_kv_heads in GQA models. Root cause analysis and proposed fix included. Co-Authored-By: Claude Opus 4.5 --- docs/issue_xattn_offload_gqa_buffer_oom.md | 169 +++++++++++++++++++++ 1 file changed, 169 insertions(+) create mode 100644 docs/issue_xattn_offload_gqa_buffer_oom.md diff --git a/docs/issue_xattn_offload_gqa_buffer_oom.md b/docs/issue_xattn_offload_gqa_buffer_oom.md new file mode 100644 index 0000000..a6511bc --- /dev/null +++ b/docs/issue_xattn_offload_gqa_buffer_oom.md @@ -0,0 +1,169 @@ +# Issue: XAttention Offload Mode GQA Buffer OOM + +## 问题描述 + +在使用 XAttention BSA (Block Sparse Attention) + CPU Offload 模式运行 GLM-4-9B 等大模型时,出现 CUDA OOM 错误。 + +### 错误信息 + +``` +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB. +GPU 0 has a total capacity of 23.57 GiB of which 4.19 GiB is free. +``` + +### 复现环境 + +| 项目 | 值 | +|------|-----| +| 模型 | GLM-4-9B-Chat-1M | +| GPU | RTX 3090 (24GB) | +| Context Length | 32K | +| sparse_policy | XATTN_BSA | +| enable_cpu_offload | true | +| max_model_len | 1048576 (1M) | + +### 错误位置 + +``` +File "nanovllm/kvcache/sparse/xattn_bsa.py", line 246, in alloc_policy_metadata + self._k_expanded = torch.empty(shape, dtype=dtype, device=device) +``` + +--- + +## 问题分析 + +### 内存分配分析 + +`alloc_policy_metadata()` 在 KV cache 初始化时分配以下 buffer: + +| Buffer | 用途 | 大小 (GLM-4, 1M seq) | +|--------|------|----------------------| +| `_prefill_mask_buffer` | BSA mask | ~32 MB | +| `_m_partial_buffer` | KV chunking m stats | ~32 MB | +| `_l_partial_buffer` | KV chunking l stats | ~32 MB | +| `_block_sums_buffer` | Block sums | ~64 MB | +| **`_k_expanded`** | GQA K 扩展 | **~8 GB** | +| **`_v_expanded`** | GQA V 扩展 | **~8 GB** | + +### GQA Buffer 计算 + +```python +shape = (1, num_heads, max_seq_len, head_dim) + = (1, 32, 1048576, 128) + +size = 1 × 32 × 1048576 × 128 × 2 bytes (fp16) + = 8,589,934,592 bytes + = 8 GB per buffer +``` + +### 根本原因 + +1. **设计意图冲突**:`_k_expanded` 和 `_v_expanded` 的文档注释明确说是 "for GPU-only mode" +2. **条件检查不完整**:代码只检查了 `num_heads == num_kv_heads` 来跳过分配,没有检查 offload 模式 +3. **Offload 模式不需要这些 buffer**:`compute_chunked_prefill()` 使用不同的计算路径,不依赖预分配的 GQA buffer + +### 相关代码 + +```python +# xattn_bsa.py:238-247 +# Only allocate GQA expansion buffers if GQA (num_heads != num_kv_heads) +if num_heads == num_kv_heads: + logger.info(f"[XAttn] No GQA expansion needed (num_heads == num_kv_heads = {num_heads})") + return # <-- 只检查了 GQA,没检查 offload 模式 + +# Shape: [1, num_heads, max_seq_len, head_dim] for xattn_estimate format +shape = (1, num_heads, max_seq_len, head_dim) +self._k_expanded = torch.empty(shape, dtype=dtype, device=device) # <-- OOM here +self._v_expanded = torch.empty(shape, dtype=dtype, device=device) +``` + +--- + +## 解决思路 + +### 方案 1: 在 Offload 模式下跳过 GQA Buffer 分配 (推荐) + +在 `alloc_policy_metadata()` 中添加 offload 模式检查: + +```python +def alloc_policy_metadata( + self, + num_heads: int, + num_kv_heads: int, + head_dim: int, + max_seq_len: int, + dtype: torch.dtype, + device: torch.device, + enable_cpu_offload: bool = False, # <-- 新增参数 +) -> None: + # ... 分配 mask buffer 和 KV chunking buffers (offload 模式需要) + + # Skip GQA buffers in offload mode + # Chunked prefill uses compute_chunked_prefill() which doesn't need these + if enable_cpu_offload: + logger.info("[XAttn] Offload mode: skipping GQA expansion buffers") + return + + # GPU-only mode: pre-allocate GQA buffers for compute_prefill() + if num_heads == num_kv_heads: + logger.info(f"[XAttn] No GQA expansion needed") + return + + shape = (1, num_heads, max_seq_len, head_dim) + self._k_expanded = torch.empty(shape, dtype=dtype, device=device) + self._v_expanded = torch.empty(shape, dtype=dtype, device=device) +``` + +**需要修改的文件**: +1. `nanovllm/kvcache/sparse/xattn_bsa.py` - `alloc_policy_metadata()` 方法 +2. `nanovllm/engine/model_runner.py` - 调用 `alloc_policy_metadata()` 时传入 `enable_cpu_offload` + +### 方案 2: 延迟分配 (Lazy Allocation) + +只在 `compute_prefill()` 首次调用时分配 GQA buffer,offload 模式走 `compute_chunked_prefill()` 不会触发分配。 + +```python +def compute_prefill(self, ...): + # Lazy allocation on first use + if self._k_expanded is None and num_heads != num_kv_heads: + self._allocate_gqa_buffers(...) + ... +``` + +### 方案 3: 基于 chunk_size 限制 buffer 大小 + +不预分配 max_seq_len 大小,而是只分配 chunk_size 大小: + +```python +# 原来: max_seq_len (1M tokens) -> 8 GB +# 修改后: chunk_size (16K tokens) -> ~130 MB +buffer_len = self.chunk_size if enable_cpu_offload else max_seq_len +shape = (1, num_heads, buffer_len, head_dim) +``` + +--- + +## 验证方法 + +修复后运行以下命令验证: + +```bash +cd /home/zijie/Code/COMPASS +GPULIST=0 ./scripts/run_ruler.sh glm4-9b-xattn-nanovllm synthetic xattn --task niah_single_1 +``` + +预期结果: +- 不再出现 8GB allocation 的 OOM 错误 +- 模型正常加载并完成推理 + +--- + +## 相关文档 + +- `docs/xattn_bsa_policy_design.md` - XAttention BSA Policy 设计文档 +- `docs/gpu_only_xattn_guide.md` - GPU-Only XAttention 指南 + +## 优先级 + +**High** - 阻塞 9B+ 模型在 24GB 显存 GPU 上使用 XAttention + Offload 模式