- Add DensityObserver class to track per-layer density statistics - Integrate DensityObserver into compute_prefill for GPU-only mode - Fix stride parameter not being passed to xattn_estimate - Add density statistics output to test_ruler.py for XATTN_BSA - Add comprehensive density benchmark documentation Key changes: - nanovllm/utils/density_observer.py: New Observer for density tracking - xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver - test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA - docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
196 lines
6.0 KiB
Markdown
196 lines
6.0 KiB
Markdown
# XAttention Density Benchmark
|
||
|
||
GPU-only 模式下 XAttention Block Sparse Attention 的 density 测试结果。
|
||
|
||
## 测试配置
|
||
|
||
| 参数 | 值 | 说明 |
|
||
|------|-----|------|
|
||
| Model | Llama-3.1-8B-Instruct | 32 layers, 32 heads, 8 KV heads |
|
||
| Block Size | 128 tokens | BSA kernel 固定要求 |
|
||
| Threshold | 0.9 / 0.95 | 累积注意力阈值 |
|
||
| Stride | 4 / 8 / 16 | Q/K 下采样步长 |
|
||
| Dataset | RULER niah_single_1 | Sample 0 |
|
||
| Mode | GPU-only | 无 CPU offload |
|
||
|
||
## Density 定义
|
||
|
||
```python
|
||
# Density = selected_blocks / total_causal_blocks
|
||
# 在 causal attention 下,只计算下三角区域的 blocks
|
||
# Overall density = 所有层的平均值
|
||
|
||
def compute_density(mask, causal=True):
|
||
"""
|
||
mask: [batch, heads, q_blocks, k_blocks] boolean tensor
|
||
"""
|
||
if causal:
|
||
causal_mask = torch.tril(torch.ones(q_blocks, k_blocks))
|
||
total = causal_mask.sum() * batch * heads
|
||
selected = (mask & causal_mask).sum()
|
||
return selected / total
|
||
```
|
||
|
||
## 测试结果
|
||
|
||
### threshold=0.9
|
||
|
||
#### Overall Density (平均)
|
||
|
||
| Context | stride=4 | stride=8 | stride=16 |
|
||
|---------|----------|----------|-----------|
|
||
| **4K** | 0.5220 (52.2%) | 0.5292 (52.9%) | 0.5430 (54.3%) |
|
||
| **8K** | 0.5152 (51.5%) | 0.5252 (52.5%) | 0.5396 (54.0%) |
|
||
| **16K** | 0.4682 (46.8%) | 0.4775 (47.8%) | 0.4888 (48.9%) |
|
||
| **32K** | 0.3700 (37.0%) | 0.4012 (40.1%) | 0.4196 (42.0%) |
|
||
|
||
#### Min Density (per layer)
|
||
|
||
| Context | stride=4 | stride=8 | stride=16 |
|
||
|---------|----------|----------|-----------|
|
||
| **4K** | 0.2805 (Layer 3) | 0.3132 (Layer 3) | 0.3376 (Layer 5) |
|
||
| **8K** | 0.2886 (Layer 5) | 0.2725 (Layer 5) | 0.2995 (Layer 5) |
|
||
| **16K** | 0.2247 (Layer 5) | 0.2349 (Layer 5) | 0.2451 (Layer 5) |
|
||
| **32K** | 0.1799 (Layer 5) | 0.1846 (Layer 5) | 0.1964 (Layer 5) |
|
||
|
||
### threshold=0.95
|
||
|
||
#### Overall Density (平均)
|
||
|
||
| Context | stride=4 | stride=8 | stride=16 |
|
||
|---------|----------|----------|-----------|
|
||
| **4K** | 0.6561 (65.6%) | 0.6699 (67.0%) | 0.6815 (68.2%) |
|
||
| **8K** | 0.6462 (64.6%) | 0.6584 (65.8%) | 0.6732 (67.3%) |
|
||
| **16K** | 0.6004 (60.0%) | 0.6114 (61.1%) | 0.6193 (61.9%) |
|
||
| **32K** | 0.4894 (48.9%) | 0.5203 (52.0%) | 0.5385 (53.9%) |
|
||
|
||
#### Min Density (per layer)
|
||
|
||
| Context | stride=4 | stride=8 | stride=16 |
|
||
|---------|----------|----------|-----------|
|
||
| **4K** | 0.3972 (Layer 3) | 0.4348 (Layer 5) | 0.4517 (Layer 4) |
|
||
| **8K** | 0.4004 (Layer 5) | 0.3906 (Layer 5) | 0.4239 (Layer 5) |
|
||
| **16K** | 0.3331 (Layer 5) | 0.3453 (Layer 5) | 0.3589 (Layer 5) |
|
||
| **32K** | 0.2656 (Layer 5) | 0.2784 (Layer 5) | 0.2917 (Layer 5) |
|
||
|
||
### threshold 对比 (stride=8)
|
||
|
||
| Context | threshold=0.9 | threshold=0.95 | 差异 |
|
||
|---------|---------------|----------------|------|
|
||
| **4K** | 0.5292 (52.9%) | 0.6699 (67.0%) | -14.1% |
|
||
| **8K** | 0.5252 (52.5%) | 0.6584 (65.8%) | -13.3% |
|
||
| **16K** | 0.4775 (47.8%) | 0.6114 (61.1%) | -13.4% |
|
||
| **32K** | 0.4012 (40.1%) | 0.5203 (52.0%) | -11.9% |
|
||
|
||
## 关键发现
|
||
|
||
### 1. Context Length 影响最大
|
||
|
||
Density 随 context length 显著下降(threshold=0.9, stride=8):
|
||
- 4K: 52.9% density
|
||
- 8K: 52.5% density
|
||
- 16K: 47.8% density
|
||
- 32K: 40.1% density
|
||
|
||
**结论**: 长序列有更多稀疏化机会,XAttention 的优势在长序列上更明显。
|
||
|
||
### 2. Threshold 影响显著
|
||
|
||
threshold=0.9 比 0.95 的 density 低约 12-14%:
|
||
- 0.9 更激进,选择更少的 blocks
|
||
- 0.95 更保守,保留更多 blocks
|
||
- 两者准确性都不受影响(RULER NIAH 全部 PASS)
|
||
|
||
### 3. Stride 影响较小
|
||
|
||
同一 context 下,不同 stride 的 density 差异约 2-5%:
|
||
- stride 越大 → density 略高(采样越粗糙,选择更保守)
|
||
- stride=4 最激进,stride=16 最保守
|
||
|
||
### 4. Min Density 集中在中间层
|
||
|
||
- 大多数情况下 min density 出现在 Layer 5
|
||
- 中间层的稀疏性最高,首尾层相对密集
|
||
- 这符合 Transformer 注意力模式的一般规律
|
||
|
||
### 5. 最佳稀疏化配置
|
||
|
||
32K + stride=4 + threshold=0.9 达到最低 density:
|
||
- Overall: **37.0%** (节省 63% 计算)
|
||
- Min: **18.0%** (Layer 5)
|
||
|
||
### 6. 准确性稳定
|
||
|
||
所有配置下 RULER NIAH 测试都 PASS (score=1.0),说明:
|
||
- threshold=0.9 和 0.95 都足够保守,不损失准确性
|
||
- 不同 stride 不影响最终结果
|
||
|
||
## 推荐配置
|
||
|
||
| 场景 | threshold | stride | 说明 |
|
||
|------|-----------|--------|------|
|
||
| 精度优先 | 0.95 | 8 | 保守配置,density ~52-67% |
|
||
| 平衡 | 0.9 | 8 | 默认配置,density ~40-53% |
|
||
| 性能优先 | 0.9 | 4 | 激进配置,density ~37-52% |
|
||
|
||
## 测试命令
|
||
|
||
```bash
|
||
# 基本测试
|
||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
|
||
python tests/test_ruler.py \
|
||
--data-dir tests/data/ruler_32k \
|
||
--datasets niah_single_1 \
|
||
--sample-indices 0 \
|
||
--max-model-len 33792 \
|
||
--sparse-policy XATTN_BSA \
|
||
--sparse-threshold 0.9 \
|
||
--sparse-stride 8 \
|
||
--gpu-utilization 0.85
|
||
|
||
# 参数说明
|
||
# --sparse-policy XATTN_BSA 启用 XAttention Block Sparse Attention
|
||
# --sparse-threshold 0.9 累积注意力阈值 (0.9-0.99)
|
||
# --sparse-stride 8 Q/K 下采样步长 (4/8/16)
|
||
```
|
||
|
||
## DensityObserver 使用
|
||
|
||
```python
|
||
from nanovllm.utils.density_observer import DensityObserver
|
||
|
||
# 启用并重置
|
||
DensityObserver.enable()
|
||
DensityObserver.complete_reset()
|
||
|
||
# ... 运行 inference (compute_prefill 自动记录) ...
|
||
|
||
# 获取结果
|
||
summary = DensityObserver.get_summary()
|
||
# {
|
||
# "mode": "gpu_only",
|
||
# "overall_density": 0.40, # 所有层的平均值
|
||
# "per_layer_density": {0: 0.55, 1: 0.45, ...},
|
||
# "num_layers": 32
|
||
# }
|
||
|
||
# 获取最低 density
|
||
min_layer, min_density = DensityObserver.get_min_density()
|
||
|
||
# 打印摘要
|
||
DensityObserver.print_summary()
|
||
# [DensityObserver] Mode: gpu_only
|
||
# Overall density: 0.4012
|
||
# Min density: 0.1846 (layer 5)
|
||
# Num layers: 32
|
||
```
|
||
|
||
## 相关文件
|
||
|
||
| 文件 | 说明 |
|
||
|------|------|
|
||
| `nanovllm/kvcache/sparse/xattn_bsa.py` | XAttention BSA Policy 实现 |
|
||
| `nanovllm/utils/density_observer.py` | Density 统计 Observer |
|
||
| `nanovllm/ops/xattn.py` | xattn_estimate 核心算法 |
|
||
| `tests/test_ruler.py` | RULER benchmark 测试脚本 |
|