From 1eb7521994444e5c5c01a9e8207b42ef01b412d6 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Thu, 5 Feb 2026 01:44:11 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20docs:=20add=20XAttention=20densi?= =?UTF-8?q?ty=20types=20documentation?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document the difference between compute density (BSA block level) and communication density (CPU block level). Key finding: Even with 37% compute density, comm density can be 100% due to any() aggregation across heads/Q-positions spreading sparse blocks across all CPU blocks. Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 1 + docs/xattn_density_types.md | 152 ++++++++++++++++++++++++++++++++++++ 2 files changed, 153 insertions(+) create mode 100644 docs/xattn_density_types.md diff --git a/CLAUDE.md b/CLAUDE.md index ea89f70..ddb3fd4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -43,6 +43,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L | [`docs/gpuonly_density_alignment_test.md`](docs/gpuonly_density_alignment_test.md) | ✅ TEST: Density 对齐验证 (GPU-only + Offload, 4K-64K),xattn_estimate vs KV chunking 完全一致 | | [`docs/xattn_memory_benchmark.md`](docs/xattn_memory_benchmark.md) | 📊 BENCH: XAttention 内存基准测试,Qwen3-0.6B 32K 在 24GB 显存可行 (gpu-util=0.28) | | [`docs/xattn_offload_stream_sync_fix.md`](docs/xattn_offload_stream_sync_fix.md) | 🐛 FIX: XAttention Offload stream 同步 bug,Pass1/Pass2 K 数据不一致,compute_stream 包装 | +| [`docs/xattn_density_types.md`](docs/xattn_density_types.md) | 📊 Compute vs Comm density: BSA block (128) vs CPU block (4096) 粒度,聚合效应导致 comm=100% | ## Rules Index diff --git a/docs/xattn_density_types.md b/docs/xattn_density_types.md new file mode 100644 index 0000000..b7ff331 --- /dev/null +++ b/docs/xattn_density_types.md @@ -0,0 +1,152 @@ +# XAttention Density Types: Compute vs Communication + +XAttention BSA 统计两种不同粒度的 density,它们反映不同的优化效果。 + +## 两种 Density 的定义 + +### 1. Compute Density(计算密度) + +**粒度**: BSA block (128 tokens) + +**公式**: +``` +compute_density = selected_bsa_blocks / total_causal_bsa_blocks +``` + +**含义**: 实际需要计算 attention 的 blocks 占 causal 区域的比例。 + +**影响**: 决定 attention 计算量的减少。 + +### 2. Communication Density(通信密度) + +**粒度**: CPU block (4096 tokens = 32 BSA blocks) + +**公式**: +``` +comm_density = selected_cpu_blocks / total_cpu_blocks +``` + +**含义**: 需要从 CPU 传输到 GPU 的 blocks 占总 blocks 的比例。 + +**影响**: 决定 H2D 传输量的减少。 + +## 为什么 Comm Density 通常高于 Compute Density + +### 聚合效应 + +由于 CPU block 粒度是 BSA block 的 32 倍,CPU block 选择使用 `any()` 聚合: + +```python +# BSA mask: [B, H, Q_bsa, K_bsa] +# Reshape to CPU block level +mask_per_cpu = mask.view(B, H, Q_bsa, num_cpu_blocks, bsa_per_cpu) +# Any BSA block selected -> whole CPU block needed +cpu_needed = mask_per_cpu.any(dim=-1).any(dim=2).any(dim=1) +``` + +只要 CPU block 中**任意一个**: +- Head 选择了该 block,或 +- Q position 选择了该 block,或 +- BSA sub-block 被选中 + +则整个 CPU block 都需要传输。 + +### 示例 + +| 场景 | Compute Density | Comm Density | 说明 | +|------|-----------------|--------------|------| +| 64K context, threshold=0.9 | 37% | 100% | 稀疏 blocks 均匀分布在所有 CPU blocks | +| 32K context, threshold=0.9 | 50% | 100% | 同上 | + +## 测试结果 + +### 测试命令 + +```bash +# Offload 模式测试 +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --data-dir tests/data/ruler_64k \ + --datasets niah_single_1 \ + --num-samples 1 \ + --max-model-len 72000 \ + --enable-offload \ + --sparse-policy XATTN_BSA \ + --sparse-threshold 0.9 +``` + +### 输出示例 + +``` +[DensityObserver] Mode: offload + Compute density: 0.3691 (min: 0.3691 @ layer 0) + Comm density: 1.0000 (CPU block granularity) + Savings ratio: 0.0% H2D transfer reduction + Num layers: 1 + Layer 0 density: 0.369052 +``` + +## 关键发现 + +### 当前 XAttention 的通信优化局限 + +1. **Compute density 有效降低**: ~37% @ 64K context(计算量减少 63%) +2. **Comm density 没有降低**: 100%(通信量没有减少) + +### 原因分析 + +Attention pattern 的特点: +- 不同 heads 关注不同位置 +- 不同 Q positions 关注不同 K positions +- 稀疏选择分布在整个 sequence 上 + +这导致虽然每个 (head, Q, K) 组合只选择少量 blocks,但聚合后覆盖了所有 CPU blocks。 + +### 潜在优化方向 + +1. **Per-head block selection**: 每个 head 独立选择 CPU blocks +2. **Block clustering**: 将相关 blocks 聚合到同一 CPU block +3. **Dynamic block size**: 根据 attention pattern 动态调整 CPU block 大小 + +## DensityObserver API + +### 启用和重置 + +```python +from nanovllm.utils.density_observer import DensityObserver + +DensityObserver.enable() +DensityObserver.complete_reset() +DensityObserver.set_mode("offload") # or "gpu_only" +``` + +### 记录 + +```python +# Compute density (GPU-only 模式自动记录) +DensityObserver.record(layer_id, mask, causal=True) + +# Comm density (Offload 模式在 select_blocks 中记录) +DensityObserver.record_comm_density(layer_id, selected_cpu_blocks, total_cpu_blocks) +``` + +### 获取结果 + +```python +# 总体 density +overall_compute = DensityObserver.get_overall_density() +overall_comm = DensityObserver.get_overall_comm_density() + +# Per-layer density +per_layer_compute = DensityObserver.get_per_layer_density() +per_layer_comm = DensityObserver.get_per_layer_comm_density() + +# 打印摘要 +DensityObserver.print_summary() +``` + +## 相关文件 + +- `nanovllm/utils/density_observer.py`: DensityObserver 实现 +- `nanovllm/kvcache/sparse/xattn_bsa.py`: XAttention BSA Policy(select_blocks 中记录 comm density) +- `tests/test_ruler.py`: RULER benchmark 测试脚本