📝 docs: add XAttention offload stream sync fix documentation

- Document the CUDA stream synchronization bug in XAttention BSA - Include root cause analysis with stream timing diagrams - Add test commands and verification results (100% accuracy) - Update CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:32:50 +08:00
parent 829b311c02
commit 1ea5afd886
2 changed files with 315 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -42,6 +42,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 | [`docs/xattn_kv_chunking_density_test.md`](docs/xattn_kv_chunking_density_test.md) | 🧪 TEST: XAttention KV chunking density 验证，threshold=1.0 对齐，threshold<1.0 差异 10-13% |
 | [`docs/gpuonly_density_alignment_test.md`](docs/gpuonly_density_alignment_test.md) | ✅ TEST: Density 对齐验证 (GPU-only + Offload, 4K-64K)，xattn_estimate vs KV chunking 完全一致 |
 | [`docs/xattn_memory_benchmark.md`](docs/xattn_memory_benchmark.md) | 📊 BENCH: XAttention 内存基准测试，Qwen3-0.6B 32K 在 24GB 显存可行 (gpu-util=0.28) |
+| [`docs/xattn_offload_stream_sync_fix.md`](docs/xattn_offload_stream_sync_fix.md) | 🐛 FIX: XAttention Offload stream 同步 bug，Pass1/Pass2 K 数据不一致，compute_stream 包装 |

 ## Rules Index

@@ -106,6 +107,13 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py

 **Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)

+**GPU-only 测试模型选择**:
+
+| GPU | 显存 | GPU-only 测试模型 |
+|-----|------|------------------|
+| RTX 3090 | 24GB | **Qwen3-0.6B** (必须，7B+ 模型会 OOM) |
+| A100 | 40GB+ | Qwen3-0.6B / 4B / 7B 均可 |
+
 **Offload Mode Constraint**: When using `enable_cpu_offload=True`, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.

 **Common Issues**: