Files
nano-vllm/docs/gpuonly_density_alignment_test.md
Zijie Tian 232fcf043e 📝 docs: add GPU-only density alignment test results
Document test results verifying XAttention density calculation in
GPU-only mode matches independent xattn_estimate calls.

Test results (Llama-3.1-8B-Instruct, threshold=0.9):
- 4k:  Layer 0 density 63.8%, verified 
- 8k:  Layer 0 density 65.0%, verified 
- 16k: Layer 0 density 61.6%, verified 
- 32k: Layer 0 density 50.2%, verified 
- 64k: Layer 0 density 37.0%, verified 

All tests show exact match (attn_sums diff=0, mask exact match).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 11:22:34 +08:00

103 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GPU-Only Density Alignment Test Results
验证 GPU-only 模式下 `xattn_bsa.py` 的 density 计算与独立调用 `xattn_estimate` 的一致性。
## 测试配置
- **模型**: Llama-3.1-8B-Instruct (32 layers, 32 heads, 8 KV heads, head_dim=128)
- **Threshold**: 0.9 (选择覆盖 90% attention 的 blocks)
- **Block Size**: 128 tokens (BSA block)
- **Stride**: 8
- **数据集**: RULER niah_single_1 (各长度 1 sample)
## 测试结果
| Context | Tokens | Layer 0 Density | Compute Density | Min Layer | 验证结果 |
|---------|--------|-----------------|-----------------|-----------|----------|
| 4k | 3,692 | 63.8% | 52.9% | Layer 3 (31.3%) | ✅ PASSED |
| 8k | 7,892 | 65.0% | 52.5% | Layer 5 (27.3%) | ✅ PASSED |
| 16k | 15,689 | 61.6% | 47.8% | Layer 5 (23.5%) | ✅ PASSED |
| 32k | 32,485 | 50.2% | 40.1% | Layer 5 (18.5%) | ✅ PASSED |
| 64k | 64,891 | 37.0% | 29.6% | Layer 5 (12.4%) | ✅ PASSED |
## 验证指标
对于所有测试长度,验证脚本检查以下指标:
| 指标 | 预期 | 实际结果 |
|------|------|----------|
| attn_sums max diff | 0 | 0.000000e+00 |
| attn_sums mean diff | 0 | 0.000000e+00 |
| mask exact match | True | True |
| density diff | 0 | 0.000000 |
## Density 计算公式
### Total (分母)
```python
# Causal mask: Q block i 只能看到 K block 0 到 i
causal_mask[i, j] = (j <= i + q_offset_blocks)
# Total = causal 区域内的 block 数 × batch × heads
total = causal_mask.sum() × batch × heads
= (n × (n+1) / 2) × 1 × 32 # n = valid_q_blocks
```
### Selected (分子)
```python
# 在 causal 区域内,被选中 (mask=True) 的 block 数量
selected = (mask & causal_mask).sum()
```
### Density
```python
density = selected / total
```
## 观察
1. **Density 随 context 增长而降低**: 4k (63.8%) → 64k (37.0%),这是因为长序列中 attention 更加分散
2. **Layer 5 通常是最稀疏的层**: 在所有长度测试中Layer 5 的 density 最低
3. **Layer 0 density 最高**: 第一层的 attention pattern 最密集,可能与 sink token 效应有关
4. **Threshold=0.9 对应 ~50% density**: 在 32k context 下threshold=0.9 意味着选择覆盖 90% attention 的 blocks实际 density 约 50%
## 使用方法
### Step 1: 启用 debug 保存
```python
# nanovllm/kvcache/sparse/xattn_bsa.py
_DEBUG_SAVE_MASK = True # 改为 True
```
### Step 2: 运行 GPU-only 推理
```bash
CUDA_VISIBLE_DEVICES=0 python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
### Step 3: 运行验证脚本
```bash
python tests/test_gpuonly_density_alignment.py
```
## 相关文件
- `nanovllm/kvcache/sparse/xattn_bsa.py`: XAttention BSA Policy 实现
- `nanovllm/ops/xattn.py`: xattn_estimate 函数
- `tests/test_gpuonly_density_alignment.py`: 验证脚本