From 8d6fde3b2329cf3f2322333721ece11173b7d23e Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Wed, 14 Jan 2026 08:39:03 +0800 Subject: [PATCH] docs: add Block-Sparse-Attention library reference Add comprehensive documentation for the MIT-Han-Lab Block-Sparse-Attention library (3rdparty submodule, branch: tzj/minference). The new document covers: - Four sparse attention modes (dense, token/block streaming, block sparse) - Hybrid mask support (different patterns per head) - Complete API reference for all three functions - Performance benchmarks (up to 3-4x speedup on A100) - Integration considerations for nano-vllm Co-Authored-By: Claude --- CLAUDE.md | 1 + docs/block_sparse_attention_lib.md | 191 +++++++++++++++++++++++++++++ 2 files changed, 192 insertions(+) create mode 100644 docs/block_sparse_attention_lib.md diff --git a/CLAUDE.md b/CLAUDE.md index 8559748..ef08768 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -53,6 +53,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py | [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling | | [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup | | [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow | +| [`docs/block_sparse_attention_lib.md`](docs/block_sparse_attention_lib.md) | MIT-Han-Lab Block-Sparse-Attention library reference: sparse modes, API, performance | | [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface | | [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design | | [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) | diff --git a/docs/block_sparse_attention_lib.md b/docs/block_sparse_attention_lib.md new file mode 100644 index 0000000..28c3779 --- /dev/null +++ b/docs/block_sparse_attention_lib.md @@ -0,0 +1,191 @@ +# Block-Sparse-Attention Library Reference + +MIT Han Lab 的块稀疏注意力内核库,基于 FlashAttention 2.4.2 修改,支持多种稀疏注意力模式。 + +## 库信息 + +- **来源**: [MIT-Han-Lab/Block-Sparse-Attention](https://github.com/mit-han-lab/Block-Sparse-Attention) +- **本地路径**: `3rdparty/Block-Sparse-Attention` (submodule, branch: `tzj/minference`) +- **基于**: FlashAttention 2.4.2 +- **安装位置**: `site-packages/block_sparse_attn` + +## 支持的稀疏模式 + +### 1. Dense Attention +计算完整注意力矩阵,无稀疏化。 + +### 2. Token Streaming (token granularity) +固定数量的 sink tokens + local tokens,参考 [StreamingLLM](https://arxiv.org/abs/2309.17453)。 + +**适用场景**: 需要精确保留部分关键 token 的长上下文推理 + +### 3. Block Streaming (block granularity) +Block 粒度的 streaming attention,block_size = 128。 + +**适用场景**: 长序列推理,牺牲少量精度换取更大加速 + +### 4. Block Sparse +基于自定义 block mask 的稀疏注意力。 + +**适用场景**: 已知特定 attention 模式的工作负载 + +### 混合模式 + +**关键特性**: 支持不同 head 使用不同稀疏模式 + +```python +# 8 个 heads 的混合配置示例 +head_mask_type = [1, 1, 0, 0, 0, -1, 0, -1] +# 含义: +# - head 0,1: blocksparse (使用 basemask[0]) +# - head 2-4,6: dense +# - head 5,7: streaming +``` + +**Mask 类型编码**: +- `0` = Dense attention +- `-1` = Streaming attention +- `1, 2, ...` = Block sparse (使用 basemask[mask_type - 1]) + +## API 参考 + +### `block_sparse_attn_func` + +通用块稀疏注意力函数,支持所有模式。 + +```python +from block_sparse_attn import block_sparse_attn_func + +output = block_sparse_attn_func( + q, k, v, # [total_tokens, heads, head_dim] unpadded + cu_seqlens_q, cu_seqlens_k, # cumulative sequence lengths + head_mask_type, # [heads] tensor, 每个头的模式 + streaming_info, # streaming 配置 (sink/local 数量) + base_blockmask, # [q_blocks, k_blocks, n_masks] bool tensor + max_seqlen_q, max_seqlen_k, # 最大序列长度 + p_dropout, # dropout 概率 (推理时设为 0.0) + deterministic=False, + softmax_scale=None, + is_causal=False, + exact_streaming=False, # True=token streaming, False=block streaming + return_attn_probs=False, +) +``` + +**关键参数**: +| 参数 | 类型 | 说明 | +|------|------|------| +| `head_mask_type` | Tensor[heads] | 每个头的稀疏模式,0=dense, -1=streaming, 1+=blocksparse | +| `streaming_info` | Tensor | [sink_blocks, local_blocks] 或 [sink_tokens, local_tokens] | +| `base_blockmask` | Tensor | Block mask,形状 [q_blocks, k_blocks, n_masks] | +| `exact_streaming` | bool | True=token 粒度,False=block 粒度 streaming | + +### `block_streaming_attn_func` + +Block 粒度 streaming attention(block_size=128)。 + +```python +from block_sparse_attn import block_streaming_attn_func + +output = block_streaming_attn_func( + q, k, v, + cu_seqlens_q, cu_seqlens_k, + head_mask_type, + streaming_info, # [sink_blocks, local_blocks] + max_seqlen_q, max_seqlen_k, + p_dropout, + deterministic=False, + softmax_scale=None, + is_causal=True, + return_attn_probs=False, +) +``` + +### `token_streaming_attn_func` + +Token 粒度 streaming attention。 + +**注意**: 不支持反向传播(仅推理)。 + +```python +from block_sparse_attn import token_streaming_attn_func + +output = token_streaming_attn_func( + q, k, v, + cu_seqlens_q, cu_seqlens_k, + head_mask_type, + streaming_info, # [sink_tokens, local_tokens] + max_seqlen_q, max_seqlen_k, + deterministic=False, + softmax_scale=None, + return_attn_probs=False, +) +``` + +## 技术规格 + +| 特性 | 支持情况 | +|------|----------| +| **数据类型** | fp16, bf16 (bf16 需要 Ampere/Ada/Hopper GPU) | +| **Head 维度** | 32, 64, 128 | +| **Block Size** | 128 (固定) | +| **CUDA 要求** | 11.6+ | +| **PyTorch 要求** | 1.12+ | + +## 性能参考 + +测试环境: A100 GPU, head_dim=128, 32 heads, batch_size=1 + +### Block Sparse 加速比 +- 相比 FlashAttention2: 最高 **3-4x** 加速 +- 加速随序列长度增加而提升 + +### Streaming 混合模式加速比 +- Token streaming: 64 sink + 256 local tokens +- Block streaming: 1 sink block + 3 local blocks +- **50% Dense + 50% Streaming**: 最高 **2x** 加速 + +## 与 nano-vllm 的集成考虑 + +### 潜在集成点 + +1. **长上下文推理优化** + - 使用 block streaming 减少计算量 + - 在 CPU offload 模式下减少 GPU-CPU 传输 + +2. **混合注意力策略** + - 部分 head 使用 streaming(减少计算) + - 部分 head 使用 dense(保持精度) + - 参考 Duo Attention 论文的混合模式 + +3. **稀疏 offload** + - 只 offload 重要 blocks 的 KV cache + - 结合 `requires_block_selection` 接口 + +### 实现注意事项 + +1. **输入格式**: 库使用 unpadded 格式(`cu_seqlens`),需要与 nano-vllm 的 padded 格式转换 +2. **Block size 固定**: 库固定 block_size=128,需要适配 +3. **Streaming info 配置**: 需要根据模型特性调整 sink/local 数量 + +## 相关工作 + +- [FlashAttention](https://github.com/Dao-AILab/flash-attention) - 基础实现 +- [StreamingLLM](https://arxiv.org/abs/2309.17453) - Streaming attention 理论基础 +- [Duo Attention](https://github.com/mit-han-lab/duo-attention) - 混合 dense/streaming 模式 +- [MInference](https://arxiv.org/abs/2407.02490) - 混合 mask 方法 + +## 测试 + +库自带测试位于 `3rdparty/Block-Sparse-Attention/block_sparse_tests/`: + +```bash +# 正确性测试 +cd 3rdparty/Block-Sparse-Attention/block_sparse_tests/fwd/test_correctness +pytest full_test.py + +# 性能测试 +cd 3rdparty/Block-Sparse-Attention/block_sparse_tests/fwd/test_performance +python token_streaming.py +python blocksparse.py +```