📝 docs: add XAttention offload stream sync fix documentation

- Document the CUDA stream synchronization bug in XAttention BSA
- Include root cause analysis with stream timing diagrams
- Add test commands and verification results (100% accuracy)
- Update CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Zijie Tian
2026-02-05 01:32:50 +08:00
parent 829b311c02
commit 1ea5afd886
2 changed files with 315 additions and 0 deletions

View File

@@ -42,6 +42,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`docs/xattn_kv_chunking_density_test.md`](docs/xattn_kv_chunking_density_test.md) | 🧪 TEST: XAttention KV chunking density 验证threshold=1.0 对齐threshold<1.0 差异 10-13% |
| [`docs/gpuonly_density_alignment_test.md`](docs/gpuonly_density_alignment_test.md) | ✅ TEST: Density 对齐验证 (GPU-only + Offload, 4K-64K)xattn_estimate vs KV chunking 完全一致 |
| [`docs/xattn_memory_benchmark.md`](docs/xattn_memory_benchmark.md) | 📊 BENCH: XAttention 内存基准测试Qwen3-0.6B 32K 在 24GB 显存可行 (gpu-util=0.28) |
| [`docs/xattn_offload_stream_sync_fix.md`](docs/xattn_offload_stream_sync_fix.md) | 🐛 FIX: XAttention Offload stream 同步 bugPass1/Pass2 K 数据不一致compute_stream 包装 |
## Rules Index
@@ -106,6 +107,13 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
**GPU-only 测试模型选择**:
| GPU | 显存 | GPU-only 测试模型 |
|-----|------|------------------|
| RTX 3090 | 24GB | **Qwen3-0.6B** (必须7B+ 模型会 OOM) |
| A100 | 40GB+ | Qwen3-0.6B / 4B / 7B 均可 |
**Offload Mode Constraint**: When using `enable_cpu_offload=True`, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.
**Common Issues**: