docs: add XAttention integration guide

Comprehensive documentation for XAttention sparse policy integration: - Algorithm principles (chunked estimation + block sparse attention) - COMPASS source code analysis - Design decisions for CPU offload mode - Implementation details (utils.py, kernels.py, xattn.py) - Problem-solving (OOM, GQA, abstract method) - Test validation results (RULER 32k benchmark) Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:16:21 +08:00
parent 24baeb6d5a
commit 2826a649de
2 changed files with 962 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -61,6 +61,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 | [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
 | [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
 | [`docs/64k_memory_analysis.md`](docs/64k_memory_analysis.md) | 64k inference memory analysis: GPU-only vs offload, OOM root cause (fragmentation), RTX 3090 limitations |
+| [`docs/xattention_integration.md`](docs/xattention_integration.md) | XAttention integration guide: algorithm, implementation, design decisions, and testing |
 | [`docs/xattention_analysis.md`](docs/xattention_analysis.md) | XAttention algorithm analysis: chunked estimation, block sparse attention, integration design |
 | [`docs/development_notes.md`](docs/development_notes.md) | Development notes and scratchpad for ongoing work |