Compare commits
2 Commits
1c36d53570
...
ef37d4f1a8
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ef37d4f1a8 | ||
|
|
c8a5ef04c0 |
90
.claude/rules/test-ruler.md
Normal file
90
.claude/rules/test-ruler.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# test_ruler.py 使用规则
|
||||
|
||||
## 强制规则
|
||||
|
||||
**执行 `test_ruler.py` 前必须查阅文档**,禁止运行 `--help` 或猜测参数。
|
||||
|
||||
| 禁止 | 原因 |
|
||||
|------|------|
|
||||
| `python tests/test_ruler.py --help` | 浪费交互,文档已有完整说明 |
|
||||
| 猜测参数格式 | 容易出错,降低效率 |
|
||||
|
||||
## 必读文档
|
||||
|
||||
**[`docs/test_ruler_usage_guide.md`](../docs/test_ruler_usage_guide.md)** - 包含:
|
||||
- 完整参数说明
|
||||
- 已验证的命令示例
|
||||
- GPU 模式选择指南
|
||||
- max-model-len 设置指南
|
||||
|
||||
## 快速参考
|
||||
|
||||
### 标准命令格式
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=<GPU> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/<MODEL> \
|
||||
--data-dir tests/data/ruler_<CTX> \
|
||||
--datasets <TASK> \
|
||||
--num-samples <N> \
|
||||
--max-model-len <LEN> \
|
||||
--enable-offload \
|
||||
[--sparse-policy XATTN_BSA] \
|
||||
[--sparse-threshold 0.9]
|
||||
```
|
||||
|
||||
### 常用参数速查
|
||||
|
||||
| 参数 | 用途 | 示例 |
|
||||
|------|------|------|
|
||||
| `--datasets` | 指定任务 | `niah_single_1,qa_1` |
|
||||
| `--num-samples` | 样本数 | `1`, `10`, `0`(全部) |
|
||||
| `--sample-indices` | 指定索引 | `0,5,10` |
|
||||
| `--enable-offload` | CPU offload | RTX 3090 必须 |
|
||||
| `--sparse-policy` | 稀疏策略 | `XATTN_BSA` |
|
||||
| `--json-output` | JSON 输出 | 脚本使用 |
|
||||
| `--quiet` | 安静模式 | 减少输出 |
|
||||
|
||||
### max-model-len 速查
|
||||
|
||||
| 数据目录 | max-model-len |
|
||||
|---------|---------------|
|
||||
| ruler_32k | 40960 |
|
||||
| ruler_64k | 72000 |
|
||||
| ruler_128k | 135000 |
|
||||
|
||||
### 常用命令模板
|
||||
|
||||
**32K Offload + XAttn**:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=<GPU> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--sparse-policy XATTN_BSA
|
||||
```
|
||||
|
||||
**64K Offload + XAttn**:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=<GPU> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_64k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 72000 \
|
||||
--enable-offload \
|
||||
--sparse-policy XATTN_BSA
|
||||
```
|
||||
|
||||
## 执行前检查清单
|
||||
|
||||
- [ ] 用户指定了 GPU?否则询问
|
||||
- [ ] RTX 3090/4090?必须 `--enable-offload`
|
||||
- [ ] data-dir 与 max-model-len 匹配?
|
||||
- [ ] 需要 density 统计?添加 `--sparse-policy XATTN_BSA`
|
||||
@@ -45,6 +45,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
||||
| [`docs/xattn_offload_stream_sync_fix.md`](docs/xattn_offload_stream_sync_fix.md) | 🐛 FIX: XAttention Offload stream 同步 bug,Pass1/Pass2 K 数据不一致,compute_stream 包装 |
|
||||
| [`docs/xattn_density_types.md`](docs/xattn_density_types.md) | 📊 Compute vs Comm density: BSA block (128) vs CPU block (4096) 粒度,聚合效应导致 comm=100% |
|
||||
| [`docs/xattn_density_alignment_verification.md`](docs/xattn_density_alignment_verification.md) | ✅ VERIFIED: GPU-only vs Offload density 对齐验证 (32K 差异 0.37%, 64K 差异 0.09%) |
|
||||
| [`docs/test_ruler_usage_guide.md`](docs/test_ruler_usage_guide.md) | 📖 GUIDE: test_ruler.py 使用指南,RULER benchmark 测试命令,已验证的命令示例 |
|
||||
|
||||
## Rules Index
|
||||
|
||||
@@ -55,6 +56,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
||||
| [`.claude/rules/sparse-policy.md`](.claude/rules/sparse-policy.md) | SparsePolicy implementation requirements |
|
||||
| [`.claude/rules/planning-with-files.md`](.claude/rules/planning-with-files.md) | Planning file management for complex tasks |
|
||||
| [`.claude/rules/gpu-monitor.md`](.claude/rules/gpu-monitor.md) | **GPU memory monitoring**: 必须使用 gpu-monitor agent,禁止手动 nvidia-smi 循环 |
|
||||
| [`.claude/rules/test-ruler.md`](.claude/rules/test-ruler.md) | **test_ruler.py 规则**: 禁止 --help,必须查阅文档,含快速参考和命令模板 |
|
||||
|
||||
## GPU Mutex for Multi-Instance Debugging
|
||||
|
||||
|
||||
169
docs/issue_xattn_offload_gqa_buffer_oom.md
Normal file
169
docs/issue_xattn_offload_gqa_buffer_oom.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# Issue: XAttention Offload Mode GQA Buffer OOM
|
||||
|
||||
## 问题描述
|
||||
|
||||
在使用 XAttention BSA (Block Sparse Attention) + CPU Offload 模式运行 GLM-4-9B 等大模型时,出现 CUDA OOM 错误。
|
||||
|
||||
### 错误信息
|
||||
|
||||
```
|
||||
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB.
|
||||
GPU 0 has a total capacity of 23.57 GiB of which 4.19 GiB is free.
|
||||
```
|
||||
|
||||
### 复现环境
|
||||
|
||||
| 项目 | 值 |
|
||||
|------|-----|
|
||||
| 模型 | GLM-4-9B-Chat-1M |
|
||||
| GPU | RTX 3090 (24GB) |
|
||||
| Context Length | 32K |
|
||||
| sparse_policy | XATTN_BSA |
|
||||
| enable_cpu_offload | true |
|
||||
| max_model_len | 1048576 (1M) |
|
||||
|
||||
### 错误位置
|
||||
|
||||
```
|
||||
File "nanovllm/kvcache/sparse/xattn_bsa.py", line 246, in alloc_policy_metadata
|
||||
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 问题分析
|
||||
|
||||
### 内存分配分析
|
||||
|
||||
`alloc_policy_metadata()` 在 KV cache 初始化时分配以下 buffer:
|
||||
|
||||
| Buffer | 用途 | 大小 (GLM-4, 1M seq) |
|
||||
|--------|------|----------------------|
|
||||
| `_prefill_mask_buffer` | BSA mask | ~32 MB |
|
||||
| `_m_partial_buffer` | KV chunking m stats | ~32 MB |
|
||||
| `_l_partial_buffer` | KV chunking l stats | ~32 MB |
|
||||
| `_block_sums_buffer` | Block sums | ~64 MB |
|
||||
| **`_k_expanded`** | GQA K 扩展 | **~8 GB** |
|
||||
| **`_v_expanded`** | GQA V 扩展 | **~8 GB** |
|
||||
|
||||
### GQA Buffer 计算
|
||||
|
||||
```python
|
||||
shape = (1, num_heads, max_seq_len, head_dim)
|
||||
= (1, 32, 1048576, 128)
|
||||
|
||||
size = 1 × 32 × 1048576 × 128 × 2 bytes (fp16)
|
||||
= 8,589,934,592 bytes
|
||||
= 8 GB per buffer
|
||||
```
|
||||
|
||||
### 根本原因
|
||||
|
||||
1. **设计意图冲突**:`_k_expanded` 和 `_v_expanded` 的文档注释明确说是 "for GPU-only mode"
|
||||
2. **条件检查不完整**:代码只检查了 `num_heads == num_kv_heads` 来跳过分配,没有检查 offload 模式
|
||||
3. **Offload 模式不需要这些 buffer**:`compute_chunked_prefill()` 使用不同的计算路径,不依赖预分配的 GQA buffer
|
||||
|
||||
### 相关代码
|
||||
|
||||
```python
|
||||
# xattn_bsa.py:238-247
|
||||
# Only allocate GQA expansion buffers if GQA (num_heads != num_kv_heads)
|
||||
if num_heads == num_kv_heads:
|
||||
logger.info(f"[XAttn] No GQA expansion needed (num_heads == num_kv_heads = {num_heads})")
|
||||
return # <-- 只检查了 GQA,没检查 offload 模式
|
||||
|
||||
# Shape: [1, num_heads, max_seq_len, head_dim] for xattn_estimate format
|
||||
shape = (1, num_heads, max_seq_len, head_dim)
|
||||
self._k_expanded = torch.empty(shape, dtype=dtype, device=device) # <-- OOM here
|
||||
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 解决思路
|
||||
|
||||
### 方案 1: 在 Offload 模式下跳过 GQA Buffer 分配 (推荐)
|
||||
|
||||
在 `alloc_policy_metadata()` 中添加 offload 模式检查:
|
||||
|
||||
```python
|
||||
def alloc_policy_metadata(
|
||||
self,
|
||||
num_heads: int,
|
||||
num_kv_heads: int,
|
||||
head_dim: int,
|
||||
max_seq_len: int,
|
||||
dtype: torch.dtype,
|
||||
device: torch.device,
|
||||
enable_cpu_offload: bool = False, # <-- 新增参数
|
||||
) -> None:
|
||||
# ... 分配 mask buffer 和 KV chunking buffers (offload 模式需要)
|
||||
|
||||
# Skip GQA buffers in offload mode
|
||||
# Chunked prefill uses compute_chunked_prefill() which doesn't need these
|
||||
if enable_cpu_offload:
|
||||
logger.info("[XAttn] Offload mode: skipping GQA expansion buffers")
|
||||
return
|
||||
|
||||
# GPU-only mode: pre-allocate GQA buffers for compute_prefill()
|
||||
if num_heads == num_kv_heads:
|
||||
logger.info(f"[XAttn] No GQA expansion needed")
|
||||
return
|
||||
|
||||
shape = (1, num_heads, max_seq_len, head_dim)
|
||||
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
|
||||
```
|
||||
|
||||
**需要修改的文件**:
|
||||
1. `nanovllm/kvcache/sparse/xattn_bsa.py` - `alloc_policy_metadata()` 方法
|
||||
2. `nanovllm/engine/model_runner.py` - 调用 `alloc_policy_metadata()` 时传入 `enable_cpu_offload`
|
||||
|
||||
### 方案 2: 延迟分配 (Lazy Allocation)
|
||||
|
||||
只在 `compute_prefill()` 首次调用时分配 GQA buffer,offload 模式走 `compute_chunked_prefill()` 不会触发分配。
|
||||
|
||||
```python
|
||||
def compute_prefill(self, ...):
|
||||
# Lazy allocation on first use
|
||||
if self._k_expanded is None and num_heads != num_kv_heads:
|
||||
self._allocate_gqa_buffers(...)
|
||||
...
|
||||
```
|
||||
|
||||
### 方案 3: 基于 chunk_size 限制 buffer 大小
|
||||
|
||||
不预分配 max_seq_len 大小,而是只分配 chunk_size 大小:
|
||||
|
||||
```python
|
||||
# 原来: max_seq_len (1M tokens) -> 8 GB
|
||||
# 修改后: chunk_size (16K tokens) -> ~130 MB
|
||||
buffer_len = self.chunk_size if enable_cpu_offload else max_seq_len
|
||||
shape = (1, num_heads, buffer_len, head_dim)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 验证方法
|
||||
|
||||
修复后运行以下命令验证:
|
||||
|
||||
```bash
|
||||
cd /home/zijie/Code/COMPASS
|
||||
GPULIST=0 ./scripts/run_ruler.sh glm4-9b-xattn-nanovllm synthetic xattn --task niah_single_1
|
||||
```
|
||||
|
||||
预期结果:
|
||||
- 不再出现 8GB allocation 的 OOM 错误
|
||||
- 模型正常加载并完成推理
|
||||
|
||||
---
|
||||
|
||||
## 相关文档
|
||||
|
||||
- `docs/xattn_bsa_policy_design.md` - XAttention BSA Policy 设计文档
|
||||
- `docs/gpu_only_xattn_guide.md` - GPU-Only XAttention 指南
|
||||
|
||||
## 优先级
|
||||
|
||||
**High** - 阻塞 9B+ 模型在 24GB 显存 GPU 上使用 XAttention + Offload 模式
|
||||
338
docs/test_ruler_usage_guide.md
Normal file
338
docs/test_ruler_usage_guide.md
Normal file
@@ -0,0 +1,338 @@
|
||||
# test_ruler.py 使用指南
|
||||
|
||||
RULER benchmark 综合测试工具,用于评估 LLM 长上下文能力。
|
||||
|
||||
**测试日期**: 2026-02-05
|
||||
**测试 GPU**: RTX 3090 (GPU 4)
|
||||
|
||||
---
|
||||
|
||||
## 支持的任务
|
||||
|
||||
| 类别 | 任务 |
|
||||
|------|------|
|
||||
| NIAH (Needle-In-A-Haystack) | `niah_single_1/2/3`, `niah_multikey_1/2/3`, `niah_multiquery`, `niah_multivalue` |
|
||||
| QA (Question Answering) | `qa_1`, `qa_2` |
|
||||
| Recall | `cwe`, `fwe`, `vt` |
|
||||
|
||||
---
|
||||
|
||||
## 基本命令格式
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=<GPU_ID> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py [OPTIONS]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 参数说明
|
||||
|
||||
### 必要参数
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--model` | `~/models/Llama-3.1-8B-Instruct` | 模型路径 |
|
||||
| `--data-dir` | `tests/data/ruler_64k` | 数据目录 |
|
||||
| `--max-model-len` | 65664 | 最大上下文长度 |
|
||||
|
||||
### 数据选择
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--datasets` | 全部 | 逗号分隔的数据集名 |
|
||||
| `--num-samples` | 0 (全部) | 每个数据集测试样本数 |
|
||||
| `--sample-indices` | - | 指定样本索引 (如 `0,5,10`) |
|
||||
|
||||
### Offload 配置
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--enable-offload` | False | 启用 CPU offload 模式 |
|
||||
| `--num-gpu-blocks` | 4 | GPU 上的 KV cache blocks 数量 |
|
||||
| `--block-size` | 4096 | KV cache block 大小 (tokens) |
|
||||
| `--num-kv-buffers` | 4 | Ring buffer 数量 |
|
||||
| `--gpu-utilization` | 0.9 | GPU 显存利用率 |
|
||||
|
||||
### Sparse Attention 配置
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--sparse-policy` | - | 稀疏策略: `FULL`, `QUEST`, `XATTN_BSA` |
|
||||
| `--sparse-threshold` | 0.9 | XAttn cumulative attention 阈值 |
|
||||
| `--sparse-samples` | 128 | XAttn 每 chunk 采样数 |
|
||||
| `--sparse-stride` | 8 | XAttn Q/K 下采样步长 |
|
||||
|
||||
### 输出控制
|
||||
|
||||
| 参数 | 说明 |
|
||||
|------|------|
|
||||
| `--quiet` / `-q` | 安静模式 |
|
||||
| `--json-output` | JSON 格式输出 |
|
||||
| `--fresh-llm` | 每个样本重新初始化 LLM |
|
||||
|
||||
### 其他
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--dtype` | auto | 模型数据类型 (`bfloat16`, `float16`) |
|
||||
| `--use-cuda-graph` | False | 启用 CUDA Graph |
|
||||
| `--max-new-tokens` | 16 | 最大生成 token 数 |
|
||||
|
||||
---
|
||||
|
||||
## 已验证的命令示例
|
||||
|
||||
以下命令均在 RTX 3090 (24GB) 上测试通过。
|
||||
|
||||
### 1. 基础 Offload 测试 (32K)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload
|
||||
```
|
||||
|
||||
**结果**: 100% 准确率, 耗时 ~16s
|
||||
|
||||
### 2. Offload + XAttention BSA (32K)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--sparse-policy XATTN_BSA \
|
||||
--sparse-threshold 0.9
|
||||
```
|
||||
|
||||
**结果**: 100% 准确率, compute density ~50%, 耗时 ~19s
|
||||
|
||||
### 3. Offload + XAttention BSA (64K)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_64k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 72000 \
|
||||
--enable-offload \
|
||||
--sparse-policy XATTN_BSA \
|
||||
--sparse-threshold 0.9
|
||||
```
|
||||
|
||||
**结果**: 100% 准确率, compute density ~37%, 耗时 ~52s
|
||||
|
||||
### 4. 多数据集多样本测试
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1,qa_1 \
|
||||
--num-samples 2 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--sparse-policy XATTN_BSA
|
||||
```
|
||||
|
||||
**结果**: 4/4 (100%), 耗时 ~71s
|
||||
|
||||
### 5. 指定样本索引测试
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--sample-indices 0,5,10 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload
|
||||
```
|
||||
|
||||
### 6. JSON 输出模式 (用于脚本)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--json-output
|
||||
```
|
||||
|
||||
**输出格式**:
|
||||
```json
|
||||
{
|
||||
"total_correct": 1,
|
||||
"total_samples": 1,
|
||||
"overall_accuracy": 1.0,
|
||||
"avg_score": 1.0,
|
||||
"time": 30.44,
|
||||
"tasks": {"niah_single_1": {"correct": 1, "total": 1, "accuracy": 1.0}},
|
||||
"failed_samples": {}
|
||||
}
|
||||
```
|
||||
|
||||
### 7. 安静模式
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--quiet
|
||||
```
|
||||
|
||||
### 8. 调整 GPU blocks 数量
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--num-gpu-blocks 8 \
|
||||
--sparse-policy XATTN_BSA
|
||||
```
|
||||
|
||||
### 9. GLM-4 模型测试
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
|
||||
python tests/test_ruler.py \
|
||||
--model ~/models/GLM-4-9B-Chat-1M \
|
||||
--data-dir tests/data/ruler_32k \
|
||||
--datasets niah_single_1 \
|
||||
--num-samples 1 \
|
||||
--max-model-len 40960 \
|
||||
--enable-offload \
|
||||
--dtype bfloat16
|
||||
```
|
||||
|
||||
**结果**: 100% 准确率, 耗时 ~17s
|
||||
|
||||
---
|
||||
|
||||
## 数据目录结构
|
||||
|
||||
```
|
||||
tests/data/
|
||||
├── ruler_4k/ # 4K context
|
||||
├── ruler_8k/ # 8K context
|
||||
├── ruler_16k/ # 16K context
|
||||
├── ruler_32k/ # 32K context (推荐测试)
|
||||
├── ruler_64k/ # 64K context
|
||||
├── ruler_128k/ # 128K context
|
||||
├── ruler_256k/ # 256K context
|
||||
├── ruler_512k/ # 512K context
|
||||
├── ruler_768k/ # 768K context
|
||||
└── ruler_1m/ # 1M context
|
||||
```
|
||||
|
||||
每个目录包含 13 个任务子目录,每个任务有 `validation.jsonl` 文件。
|
||||
|
||||
---
|
||||
|
||||
## GPU 与模式选择
|
||||
|
||||
| GPU 显存 | 推荐模式 | 说明 |
|
||||
|---------|---------|------|
|
||||
| 24GB (3090/4090) | `--enable-offload` | 必须使用 offload |
|
||||
| 40GB+ (A100) | 两种模式均可 | 可测试 GPU-only |
|
||||
|
||||
**RTX 3090 限制**: 由于显存限制,必须使用 `--enable-offload` 参数。
|
||||
|
||||
---
|
||||
|
||||
## max-model-len 设置指南
|
||||
|
||||
| 数据目录 | 推荐 max-model-len | 说明 |
|
||||
|---------|-------------------|------|
|
||||
| ruler_4k | 5000 | 留出 output 空间 |
|
||||
| ruler_8k | 9000 | |
|
||||
| ruler_16k | 17000 | |
|
||||
| ruler_32k | 40960 | |
|
||||
| ruler_64k | 72000 | |
|
||||
| ruler_128k | 135000 | |
|
||||
|
||||
**公式**: `max_model_len >= max_input_len + max_new_tokens`
|
||||
|
||||
---
|
||||
|
||||
## DensityObserver 输出
|
||||
|
||||
使用 `--sparse-policy XATTN_BSA` 时自动启用,输出示例:
|
||||
|
||||
```
|
||||
============================================================
|
||||
Density Statistics (XAttention BSA)
|
||||
============================================================
|
||||
[DensityObserver] Mode: offload
|
||||
Compute density: 0.3691 (min: 0.3691 @ layer 0)
|
||||
Comm density: 1.0000 (CPU block granularity)
|
||||
Savings ratio: 0.0% H2D transfer reduction
|
||||
Num layers: 1
|
||||
Layer 0 density: 0.369052
|
||||
```
|
||||
|
||||
| 指标 | 说明 |
|
||||
|------|------|
|
||||
| Compute density | BSA block (128 tokens) 粒度的计算密度 |
|
||||
| Comm density | CPU block (4096 tokens) 粒度的通信密度 |
|
||||
| Savings ratio | H2D 传输减少比例 |
|
||||
|
||||
---
|
||||
|
||||
## 常见问题
|
||||
|
||||
### 1. OOM 错误
|
||||
|
||||
**原因**: 显存不足
|
||||
**解决**:
|
||||
- 使用 `--enable-offload`
|
||||
- 降低 `--gpu-utilization`
|
||||
- 减少 `--num-gpu-blocks`
|
||||
|
||||
### 2. 模型加载失败
|
||||
|
||||
**原因**: 模型配置不兼容
|
||||
**解决**:
|
||||
- 检查 `--dtype` 参数 (GLM 模型需要 `--dtype bfloat16`)
|
||||
- 确认模型路径正确
|
||||
|
||||
### 3. 准确率异常
|
||||
|
||||
**原因**: 状态泄漏
|
||||
**解决**: 使用 `--fresh-llm` 参数为每个样本重新初始化 LLM
|
||||
|
||||
---
|
||||
|
||||
## 相关文档
|
||||
|
||||
- [`docs/xattn_density_types.md`](xattn_density_types.md) - Compute vs Comm density 解释
|
||||
- [`docs/xattn_density_alignment_verification.md`](xattn_density_alignment_verification.md) - GPU-only vs Offload 对齐验证
|
||||
- [`docs/ruler_benchmark_results_32k.md`](ruler_benchmark_results_32k.md) - RULER 32K 基准测试结果
|
||||
Reference in New Issue
Block a user