diff --git a/CLAUDE.md b/CLAUDE.md index f5c18c5..715d4cc 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -40,7 +40,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L | [`docs/new_model_integration_guide.md`](docs/new_model_integration_guide.md) | 🔧 GUIDE: 新模型整合指南 - 配置映射、RoPE变体、EOS处理、权重转换、验证清单 | | [`docs/xattn_density_alignment_analysis.md`](docs/xattn_density_alignment_analysis.md) | 📊 ANALYSIS: GPU-only vs Offload 模式 density 对齐分析,chunked softmax 边界效应,5-7% 差异根因 | | [`docs/xattn_kv_chunking_density_test.md`](docs/xattn_kv_chunking_density_test.md) | 🧪 TEST: XAttention KV chunking density 验证,threshold=1.0 对齐,threshold<1.0 差异 10-13% | -| [`docs/gpuonly_density_alignment_test.md`](docs/gpuonly_density_alignment_test.md) | ✅ TEST: GPU-only density 对齐验证 (4K-64K),xattn_bsa vs xattn_estimate 完全一致 | +| [`docs/gpuonly_density_alignment_test.md`](docs/gpuonly_density_alignment_test.md) | ✅ TEST: Density 对齐验证 (GPU-only + Offload, 4K-64K),xattn_estimate vs KV chunking 完全一致 | ## Rules Index diff --git a/docs/gpuonly_density_alignment_test.md b/docs/gpuonly_density_alignment_test.md index 36582a3..e14d7f4 100644 --- a/docs/gpuonly_density_alignment_test.md +++ b/docs/gpuonly_density_alignment_test.md @@ -1,16 +1,132 @@ -# GPU-Only Density Alignment Test Results +# Density Alignment Test Results -验证 GPU-only 模式下 `xattn_bsa.py` 的 density 计算与独立调用 `xattn_estimate` 的一致性。 +验证 GPU-only 和 Offload 模式下三阶段 KV chunking 流程的正确性。 ## 测试配置 -- **模型**: Llama-3.1-8B-Instruct (32 layers, 32 heads, 8 KV heads, head_dim=128) -- **Threshold**: 0.9 (选择覆盖 90% attention 的 blocks) +### GPU-only 模式 +- **模型**: Qwen3-0.6B (28 layers, 16 heads, 8 KV heads, head_dim=128) +- **Threshold**: 0.9 - **Block Size**: 128 tokens (BSA block) - **Stride**: 8 -- **数据集**: RULER niah_single_1 (各长度 1 sample) +- **Chunk Size**: 16384 tokens -## 测试结果 +### Offload 模式 +- **模型**: Llama-3.1-8B-Instruct (32 layers, 32 heads, 8 KV heads, head_dim=128) +- **Threshold**: 0.9 +- **Block Size**: 128 tokens (BSA block) +- **Stride**: 4 +- **Chunk Size**: 4096 tokens + +## 三阶段 KV Chunking 对齐测试 (2026-02-02) + +### 测试目的 + +验证 `xattn_estimate` 高层 API 与手动实现的三阶段 KV chunking 流程是否完全一致。 + +### 三阶段流程 + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Stage 1: softmax_compute_partial_stats │ +│ └── 每个 KV chunk 独立计算 partial stats (m_i, l_i) │ +│ │ +│ Stage 2: merge_softmax_stats │ +│ └── Host 端合并所有 chunks: (m_global, l_global) │ +│ │ +│ Stage 3: softmax_normalize_and_block_sum │ +│ └── 使用全局 stats 归一化并计算 block sums │ +└─────────────────────────────────────────────────────────────┘ +``` + +### 测试结果 + +#### CHUNK_SIZE = 16384 (默认) + +| Context | Tokens | Q Chunks | KV Chunks | Density | Mask 差异 | attn_sums 差异 | 结果 | +|---------|--------|----------|-----------|---------|-----------|----------------|------| +| 4K | 3,692 | 1 | 1 | 63.84% | 0 | 0.0 | ✅ | +| 8K | 7,892 | 1 | 1 | 64.98% | 0 | 0.0 | ✅ | +| 16K | 15,689 | 1 | 1 | 61.63% | 0 | 0.0 | ✅ | +| 32K | 32,485 | 2 | 2 | 50.21% | 0 | 0.0 | ✅ | +| **64K** | **64,891** | **4** | **4** | **37.00%** | **0** | **0.0** | ✅ | + +#### CHUNK_SIZE = 4096 (更多 chunks) + +| Context | Tokens | Q Chunks | KV Chunks | Density | xattn_estimate vs KV chunking | 结果 | +|---------|--------|----------|-----------|---------|-------------------------------|------| +| 4K | 3,692 | 1 | 1 | 63.84% | 0.000000 | ✅ | +| 8K | 7,892 | 2 | 2 | 63.02% | 0.000000 | ✅ | +| 16K | 15,689 | 4 | 4 | 60.08% | 0.000000 | ✅ | +| 32K | 32,485 | 8 | 8 | 49.84% | 0.000000 | ✅ | +| **64K** | **64,891** | **16** | **16** | **36.91%** | **0.000000** | ✅ | + +### 64K 详细验证 (CHUNK_SIZE=4096) + +64K 序列使用 chunk_size=4096 时产生 16×16 的 chunk 矩阵: + +``` +seq_len: 64891, q_chunk_num: 16, kv_chunk_num: 16 + +Q chunk 0: merged 16 KV chunks → attn_sum shape=[1, 32, 32, 512] +Q chunk 1: merged 16 KV chunks → attn_sum shape=[1, 32, 32, 512] +... +Q chunk 15: merged 16 KV chunks → attn_sum shape=[1, 32, 32, 512] +``` + +每个 Q chunk 需要合并 16 个 KV chunks 的 softmax stats,充分验证了 `merge_softmax_stats` 在大规模 chunk 合并场景下的正确性。 + +### 验证指标 + +| 指标 | 预期 | 所有长度实际结果 | +|------|------|------------------| +| attn_sums max diff | 0 | 0.000000e+00 | +| attn_sums mean diff | 0 | 0.000000e+00 | +| mask exact match | True | True | +| density diff | 0% | 0.000000% | + +### 结论 + +✅ **三阶段 KV chunking 与一次性处理完全等价,无任何精度损失。** + +- 当 seq_len < CHUNK_SIZE (16384):单 chunk 处理 +- 当 seq_len >= CHUNK_SIZE:多 chunk 分段处理后合并,结果与一次性处理完全一致 + +--- + +## Offload 模式测试 (2026-02-02) + +使用 Offload 模式保存的真实 KV cache 数据进行测试。 + +### 测试结果 + +| 文件 | Tokens | Layer | Saved Density | Computed Density | Q/KV Chunks | 结果 | +|------|--------|-------|---------------|------------------|-------------|------| +| `qkv_3688.pt` | 3.7K | 3 | 38.34% | 38.34% | 1/1 | ✅ PASSED | +| `qkv_7888.pt` | 7.9K | 3 | 29.06% | 27.56% | 2/2 | ✅ PASSED | +| `qkv_15685.pt` | 15.7K | 3 | 19.77% | 18.60% | 4/4 | ✅ PASSED | +| `qkv_32485.pt` | 32.5K | 5 | 15.71% | 15.62% | 8/8 | ✅ PASSED | +| `qkv_64891.pt` | 64.9K | 3 | 11.09% | 11.09% | 16/16 | ✅ PASSED | + +### Layer 5 GPU-only 测试 (threshold=0.9) + +| 指标 | 结果 | +|------|------| +| Q/K shape | `[1, 16, 21001, 128]` (21K tokens) | +| Density | 6.24% | +| xattn_estimate vs KV chunking | 完全一致 (0.0000%) | +| mask 差异 | 0 / 435600 blocks | +| attn_sums 差异 | max=0.0, mean=0.0 | + +### 观察 + +1. **Density 随 context 增长而降低**: 3.7K (38%) → 64.9K (11%) +2. **xattn_estimate API 与三阶段 KV chunking 完全一致**: 所有长度差异均为 0.0000% +3. **Saved density vs Computed density 略有差异**: 这是因为 saved density 可能在不同 chunk 下记录,累积计算方式略有不同 + +--- + +## 附录:xattn_bsa vs xattn_estimate 对齐 | Context | Tokens | Layer 0 Density | Compute Density | Min Layer | 验证结果 | |---------|--------|-----------------|-----------------|-----------|----------| @@ -20,17 +136,6 @@ | 32k | 32,485 | 50.2% | 40.1% | Layer 5 (18.5%) | ✅ PASSED | | 64k | 64,891 | 37.0% | 29.6% | Layer 5 (12.4%) | ✅ PASSED | -## 验证指标 - -对于所有测试长度,验证脚本检查以下指标: - -| 指标 | 预期 | 实际结果 | -|------|------|----------| -| attn_sums max diff | 0 | 0.000000e+00 | -| attn_sums mean diff | 0 | 0.000000e+00 | -| mask exact match | True | True | -| density diff | 0 | 0.000000 | - ## Density 计算公式 ### Total (分母) @@ -79,7 +184,8 @@ _DEBUG_SAVE_MASK = True # 改为 True ### Step 2: 运行 GPU-only 推理 ```bash -CUDA_VISIBLE_DEVICES=0 python tests/test_ruler.py \ +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \ + python tests/test_ruler.py \ --model ~/models/Llama-3.1-8B-Instruct \ --data-dir tests/data/ruler_32k \ --datasets niah_single_1 \ @@ -89,14 +195,52 @@ CUDA_VISIBLE_DEVICES=0 python tests/test_ruler.py \ --sparse-threshold 0.9 ``` -### Step 3: 运行验证脚本 +### Step 3: 运行 KV chunking 对齐验证 ```bash -python tests/test_gpuonly_density_alignment.py +# 使用 GPU-only 保存的数据 +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \ + python tests/test_xattn_estimate_alignment.py --gpuonly + +# 使用 Offload 模式保存的数据 (默认) +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \ + python tests/test_xattn_estimate_alignment.py + +# 指定自定义数据文件 +python tests/test_xattn_estimate_alignment.py --data-file /path/to/data.pt + +# 批量测试所有 Offload 数据 +for f in results/kvcache/qkv_*.pt; do + echo "Testing: $(basename $f)" + python tests/test_xattn_estimate_alignment.py --data-file "$f" +done +``` + +### 批量测试所有长度 + +```bash +for ctx in 4k 8k 16k 32k 64k; do + case $ctx in + 4k) max_len=5000 ;; + 8k) max_len=9000 ;; + 16k) max_len=17000 ;; + 32k) max_len=34000 ;; + 64k) max_len=65664 ;; + esac + + echo "Testing $ctx..." + python tests/test_ruler.py \ + --data-dir tests/data/ruler_$ctx \ + --max-model-len $max_len \ + --sparse-policy XATTN_BSA \ + --num-samples 1 --quiet + + python tests/test_xattn_estimate_alignment.py --gpuonly +done ``` ## 相关文件 - `nanovllm/kvcache/sparse/xattn_bsa.py`: XAttention BSA Policy 实现 -- `nanovllm/ops/xattn.py`: xattn_estimate 函数 -- `tests/test_gpuonly_density_alignment.py`: 验证脚本 +- `nanovllm/ops/xattn.py`: xattn_estimate 函数及三阶段 KV chunking kernels +- `tests/test_xattn_estimate_alignment.py`: KV chunking 对齐验证脚本 diff --git a/tests/test_xattn_estimate_alignment.py b/tests/test_xattn_estimate_alignment.py index 20dc0b9..d7b1aba 100644 --- a/tests/test_xattn_estimate_alignment.py +++ b/tests/test_xattn_estimate_alignment.py @@ -10,13 +10,23 @@ Test: 验证 xattn_estimate 与 KV chunking kernels 的一致性 2. merge_softmax_stats: Host 端合并所有 chunks 的 stats 3. softmax_normalize_and_block_sum: 使用全局 stats 归一化 +支持两种数据格式: + 1. offload 模式保存: {"query", "key", "stride", "threshold", "density", "layer_id"} + 2. GPU-only 模式保存: {"Q", "K", "chunk_size", "block_size", "stride", "threshold", "mask", "attn_sums", ...} + Usage: + # 使用 offload 模式数据 CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \ python tests/test_xattn_estimate_alignment.py + + # 使用 GPU-only 模式数据 + CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \ + python tests/test_xattn_estimate_alignment.py --gpuonly """ import sys sys.path.insert(0, "/home/zijie/Code/nano-vllm") +import argparse import torch import math from nanovllm.ops.xattn import ( @@ -28,13 +38,22 @@ from nanovllm.ops.xattn import ( find_blocks_chunked, ) +# ============================================================ +# 命令行参数 +# ============================================================ +parser = argparse.ArgumentParser() +parser.add_argument("--gpuonly", action="store_true", help="使用 GPU-only 模式保存的数据") +parser.add_argument("--data-file", type=str, default=None, help="数据文件路径") +parser.add_argument("--chunk-size", type=int, default=None, help="覆盖 CHUNK_SIZE (用于测试不同分块大小)") +args = parser.parse_args() + # ============================================================ # 参数配置 # ============================================================ -DATA_FILE = "/home/zijie/Code/nano-vllm/results/kvcache/qkv_32485.pt" -BSA_BLOCK_SIZE = 128 -CHUNK_SIZE = 16384 # xattn_estimate 默认值 -USE_SAVED_PARAMS = True # 设为 False 则使用默认值 +if args.gpuonly: + DATA_FILE = args.data_file or "/home/zijie/Code/nano-vllm/results/mask_alignment/gpuonly_layer0.pt" +else: + DATA_FILE = args.data_file or "/home/zijie/Code/nano-vllm/results/kvcache/qkv_32485.pt" device = "cuda" @@ -46,23 +65,54 @@ print("Step 1: 加载真实 KV cache 数据") print("=" * 60) data = torch.load(DATA_FILE, map_location="cpu") -Q = data["query"].to(device) # [1, 32, seq_len, 128] -K = data["key"].to(device) # [1, 32, seq_len, 128] + +# 检测数据格式并加载 +if "Q" in data: + # GPU-only 模式保存的格式 + print(f"[INFO] 检测到 GPU-only 模式数据格式") + Q = data["Q"].to(device) + K = data["K"].to(device) + BSA_BLOCK_SIZE = data.get("block_size", 128) + CHUNK_SIZE = data.get("chunk_size", 4096) + STRIDE = data.get("stride", 8) + THRESHOLD = data.get("threshold", 0.9) + if isinstance(THRESHOLD, torch.Tensor): + THRESHOLD = THRESHOLD.item() + # GPU-only 模式保存了 mask 和 attn_sums,可以用于验证 + saved_mask = data.get("mask", None) + saved_attn_sums = data.get("attn_sums", None) + saved_density = None # GPU-only 模式没有保存 density + layer_id = 0 # GPU-only 只保存 layer 0 +else: + # offload 模式保存的格式 + print(f"[INFO] 检测到 offload 模式数据格式") + Q = data["query"].to(device) + K = data["key"].to(device) + BSA_BLOCK_SIZE = 128 + CHUNK_SIZE = 4096 + STRIDE = data["stride"] + THRESHOLD = data["threshold"] + if isinstance(THRESHOLD, torch.Tensor): + THRESHOLD = THRESHOLD[0].item() + saved_mask = None + saved_attn_sums = None + saved_density = data.get("density", None) + layer_id = data.get("layer_id", 0) batch_size, num_heads, seq_len, head_dim = Q.shape -# 从保存的数据中读取参数 -if USE_SAVED_PARAMS: - STRIDE = data["stride"] - THRESHOLD = data["threshold"][0].item() if isinstance(data["threshold"], torch.Tensor) else data["threshold"] -else: - STRIDE = 8 - THRESHOLD = 0.9 +# 命令行覆盖 CHUNK_SIZE +if args.chunk_size is not None: + CHUNK_SIZE = args.chunk_size + print(f"[INFO] 使用命令行指定的 CHUNK_SIZE={CHUNK_SIZE}") print(f"Q shape: {Q.shape}") print(f"K shape: {K.shape}") -print(f"Data layer_id: {data['layer_id']}, saved density: {data['density']:.4f}") -print(f"使用参数: STRIDE={STRIDE}, THRESHOLD={THRESHOLD}, CHUNK_SIZE={CHUNK_SIZE}") +if saved_density is not None: + print(f"Data layer_id: {layer_id}, saved density: {saved_density:.4f}") +else: + print(f"Data layer_id: {layer_id}") +print(f"使用参数: STRIDE={STRIDE}, THRESHOLD={THRESHOLD}, CHUNK_SIZE={CHUNK_SIZE}, BSA_BLOCK_SIZE={BSA_BLOCK_SIZE}") print() # ============================================================ @@ -259,7 +309,57 @@ print(f"| xattn_estimate API | {density_api:.6f} | - | - |") print(f"| KV chunking | {density_kv:.6f} | {abs(density_api - density_kv):.6f} | {100*mask_diff/mask_total:.4f}% |") print() -if abs(density_api - density_kv) < 1e-6 and mask_diff / mask_total < 0.001: +passed = abs(density_api - density_kv) < 1e-6 and mask_diff / mask_total < 0.001 + +# ============================================================ +# Step 5: 与 GPU-only 保存的数据对比 (如果有) +# ============================================================ +if saved_mask is not None or saved_attn_sums is not None: + print("=" * 60) + print("Step 5: 与 GPU-only 保存的数据对比") + print("=" * 60) + print() + + if saved_mask is not None: + saved_mask_gpu = saved_mask.to(device) + # 比较 mask + mask_saved_diff = (mask_api_valid != saved_mask_gpu).sum().item() + mask_saved_total = saved_mask_gpu.numel() + print(f"| xattn_estimate vs GPU-only saved mask | 差异 blocks: {mask_saved_diff} / {mask_saved_total} ({100*mask_saved_diff/mask_saved_total:.4f}%) |") + + if mask_saved_diff == 0: + print("✅ mask 与 GPU-only 保存完全一致") + else: + print("❌ mask 与 GPU-only 保存存在差异") + passed = False + + if saved_attn_sums is not None: + saved_attn_sums_gpu = saved_attn_sums.to(device) + # 需要从 xattn_estimate 获取 attn_sums + # 重新调用一次获取 attn_sums + attn_sums_check, _ = xattn_estimate( + Q, K, + block_size=BSA_BLOCK_SIZE, + stride=STRIDE, + threshold=THRESHOLD, + chunk_size=CHUNK_SIZE, + causal=True, + ) + attn_sums_check_valid = attn_sums_check[:, :, :q_blocks, :k_blocks] + + max_diff = (attn_sums_check_valid - saved_attn_sums_gpu).abs().max().item() + mean_diff = (attn_sums_check_valid - saved_attn_sums_gpu).abs().mean().item() + print(f"| xattn_estimate vs GPU-only saved attn_sums | max diff: {max_diff:.6e}, mean diff: {mean_diff:.6e} |") + + if max_diff < 1e-5: + print("✅ attn_sums 与 GPU-only 保存一致") + else: + print("❌ attn_sums 与 GPU-only 保存存在差异") + passed = False + + print() + +if passed: print("test_xattn_estimate_alignment: PASSED") else: print("test_xattn_estimate_alignment: FAILED")