- Add DensityObserver class to track per-layer density statistics - Integrate DensityObserver into compute_prefill for GPU-only mode - Fix stride parameter not being passed to xattn_estimate - Add density statistics output to test_ruler.py for XATTN_BSA - Add comprehensive density benchmark documentation Key changes: - nanovllm/utils/density_observer.py: New Observer for density tracking - xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver - test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA - docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.0 KiB
6.0 KiB
XAttention Density Benchmark
GPU-only 模式下 XAttention Block Sparse Attention 的 density 测试结果。
测试配置
| 参数 | 值 | 说明 |
|---|---|---|
| Model | Llama-3.1-8B-Instruct | 32 layers, 32 heads, 8 KV heads |
| Block Size | 128 tokens | BSA kernel 固定要求 |
| Threshold | 0.9 / 0.95 | 累积注意力阈值 |
| Stride | 4 / 8 / 16 | Q/K 下采样步长 |
| Dataset | RULER niah_single_1 | Sample 0 |
| Mode | GPU-only | 无 CPU offload |
Density 定义
# Density = selected_blocks / total_causal_blocks
# 在 causal attention 下,只计算下三角区域的 blocks
# Overall density = 所有层的平均值
def compute_density(mask, causal=True):
"""
mask: [batch, heads, q_blocks, k_blocks] boolean tensor
"""
if causal:
causal_mask = torch.tril(torch.ones(q_blocks, k_blocks))
total = causal_mask.sum() * batch * heads
selected = (mask & causal_mask).sum()
return selected / total
测试结果
threshold=0.9
Overall Density (平均)
| Context | stride=4 | stride=8 | stride=16 |
|---|---|---|---|
| 4K | 0.5220 (52.2%) | 0.5292 (52.9%) | 0.5430 (54.3%) |
| 8K | 0.5152 (51.5%) | 0.5252 (52.5%) | 0.5396 (54.0%) |
| 16K | 0.4682 (46.8%) | 0.4775 (47.8%) | 0.4888 (48.9%) |
| 32K | 0.3700 (37.0%) | 0.4012 (40.1%) | 0.4196 (42.0%) |
Min Density (per layer)
| Context | stride=4 | stride=8 | stride=16 |
|---|---|---|---|
| 4K | 0.2805 (Layer 3) | 0.3132 (Layer 3) | 0.3376 (Layer 5) |
| 8K | 0.2886 (Layer 5) | 0.2725 (Layer 5) | 0.2995 (Layer 5) |
| 16K | 0.2247 (Layer 5) | 0.2349 (Layer 5) | 0.2451 (Layer 5) |
| 32K | 0.1799 (Layer 5) | 0.1846 (Layer 5) | 0.1964 (Layer 5) |
threshold=0.95
Overall Density (平均)
| Context | stride=4 | stride=8 | stride=16 |
|---|---|---|---|
| 4K | 0.6561 (65.6%) | 0.6699 (67.0%) | 0.6815 (68.2%) |
| 8K | 0.6462 (64.6%) | 0.6584 (65.8%) | 0.6732 (67.3%) |
| 16K | 0.6004 (60.0%) | 0.6114 (61.1%) | 0.6193 (61.9%) |
| 32K | 0.4894 (48.9%) | 0.5203 (52.0%) | 0.5385 (53.9%) |
Min Density (per layer)
| Context | stride=4 | stride=8 | stride=16 |
|---|---|---|---|
| 4K | 0.3972 (Layer 3) | 0.4348 (Layer 5) | 0.4517 (Layer 4) |
| 8K | 0.4004 (Layer 5) | 0.3906 (Layer 5) | 0.4239 (Layer 5) |
| 16K | 0.3331 (Layer 5) | 0.3453 (Layer 5) | 0.3589 (Layer 5) |
| 32K | 0.2656 (Layer 5) | 0.2784 (Layer 5) | 0.2917 (Layer 5) |
threshold 对比 (stride=8)
| Context | threshold=0.9 | threshold=0.95 | 差异 |
|---|---|---|---|
| 4K | 0.5292 (52.9%) | 0.6699 (67.0%) | -14.1% |
| 8K | 0.5252 (52.5%) | 0.6584 (65.8%) | -13.3% |
| 16K | 0.4775 (47.8%) | 0.6114 (61.1%) | -13.4% |
| 32K | 0.4012 (40.1%) | 0.5203 (52.0%) | -11.9% |
关键发现
1. Context Length 影响最大
Density 随 context length 显著下降(threshold=0.9, stride=8):
- 4K: 52.9% density
- 8K: 52.5% density
- 16K: 47.8% density
- 32K: 40.1% density
结论: 长序列有更多稀疏化机会,XAttention 的优势在长序列上更明显。
2. Threshold 影响显著
threshold=0.9 比 0.95 的 density 低约 12-14%:
- 0.9 更激进,选择更少的 blocks
- 0.95 更保守,保留更多 blocks
- 两者准确性都不受影响(RULER NIAH 全部 PASS)
3. Stride 影响较小
同一 context 下,不同 stride 的 density 差异约 2-5%:
- stride 越大 → density 略高(采样越粗糙,选择更保守)
- stride=4 最激进,stride=16 最保守
4. Min Density 集中在中间层
- 大多数情况下 min density 出现在 Layer 5
- 中间层的稀疏性最高,首尾层相对密集
- 这符合 Transformer 注意力模式的一般规律
5. 最佳稀疏化配置
32K + stride=4 + threshold=0.9 达到最低 density:
- Overall: 37.0% (节省 63% 计算)
- Min: 18.0% (Layer 5)
6. 准确性稳定
所有配置下 RULER NIAH 测试都 PASS (score=1.0),说明:
- threshold=0.9 和 0.95 都足够保守,不损失准确性
- 不同 stride 不影响最终结果
推荐配置
| 场景 | threshold | stride | 说明 |
|---|---|---|---|
| 精度优先 | 0.95 | 8 | 保守配置,density ~52-67% |
| 平衡 | 0.9 | 8 | 默认配置,density ~40-53% |
| 性能优先 | 0.9 | 4 | 激进配置,density ~37-52% |
测试命令
# 基本测试
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--sample-indices 0 \
--max-model-len 33792 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9 \
--sparse-stride 8 \
--gpu-utilization 0.85
# 参数说明
# --sparse-policy XATTN_BSA 启用 XAttention Block Sparse Attention
# --sparse-threshold 0.9 累积注意力阈值 (0.9-0.99)
# --sparse-stride 8 Q/K 下采样步长 (4/8/16)
DensityObserver 使用
from nanovllm.utils.density_observer import DensityObserver
# 启用并重置
DensityObserver.enable()
DensityObserver.complete_reset()
# ... 运行 inference (compute_prefill 自动记录) ...
# 获取结果
summary = DensityObserver.get_summary()
# {
# "mode": "gpu_only",
# "overall_density": 0.40, # 所有层的平均值
# "per_layer_density": {0: 0.55, 1: 0.45, ...},
# "num_layers": 32
# }
# 获取最低 density
min_layer, min_density = DensityObserver.get_min_density()
# 打印摘要
DensityObserver.print_summary()
# [DensityObserver] Mode: gpu_only
# Overall density: 0.4012
# Min density: 0.1846 (layer 5)
# Num layers: 32
相关文件
| 文件 | 说明 |
|---|---|
nanovllm/kvcache/sparse/xattn_bsa.py |
XAttention BSA Policy 实现 |
nanovllm/utils/density_observer.py |
Density 统计 Observer |
nanovllm/ops/xattn.py |
xattn_estimate 核心算法 |
tests/test_ruler.py |
RULER benchmark 测试脚本 |