📝 docs: add XAttention BSA Policy design documentation

- Create docs/xattn_bsa_policy_design.md with: - Algorithm overview and data flow diagram - select_blocks implementation details - GQA-aware aggregation and majority voting - compute_chunked_prefill ring buffer pipeline - Parameter configuration and usage examples - Performance characteristics and limitations - Update CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:36:56 +08:00
parent 4d1e40152d
commit ad361c2c3b
2 changed files with 295 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -17,6 +17,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 | [`docs/xattention_algorithm_guide.md`](docs/xattention_algorithm_guide.md) | XAttention 算法详解: stride reshape、Triton kernels、BSA 依赖、块选择算法 |
 | [`docs/xattn_kernels_guide.md`](docs/xattn_kernels_guide.md) | XAttention Triton kernels: flat_group_gemm (反对角线求和)、softmax_fuse_block_sum (block 聚合) |
 | [`docs/xattn_chunked_prefill.md`](docs/xattn_chunked_prefill.md) | XAttention chunked prefill: API、使用方式、一致性要求 |
+| [`docs/xattn_bsa_policy_design.md`](docs/xattn_bsa_policy_design.md) | XAttention BSA Policy 设计: select_blocks 算法、majority voting、compute_chunked_prefill |
 | [`docs/block_sparse_attn_interface.md`](docs/block_sparse_attn_interface.md) | BSA (Block Sparse Attention) 接口文档: 函数签名、使用示例、约束条件 |
 | [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling |
 | [`docs/optimization_guide.md`](docs/optimization_guide.md) | Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x) |