From 05ce57ee8e743090092beda5a6220ae72794c056 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Tue, 27 Jan 2026 04:36:31 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20docs:=20add=20GPU-only=20sparse?= =?UTF-8?q?=20policy=20integration=20baseline?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document baseline performance before integrating sparse attention to GPU-only mode: - GPU-only Full Attention: 4869 tok/s (32K prefill) - CPU Offload Full Attention: 1500 tok/s (3.2x slower) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude Co-Authored-By: Happy --- docs/gpu_only_sparse_integration.md | 77 +++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) create mode 100644 docs/gpu_only_sparse_integration.md diff --git a/docs/gpu_only_sparse_integration.md b/docs/gpu_only_sparse_integration.md new file mode 100644 index 0000000..d165741 --- /dev/null +++ b/docs/gpu_only_sparse_integration.md @@ -0,0 +1,77 @@ +# GPU-only Sparse Policy 整合 + +本文档记录将 sparse attention 策略整合到 GPU-only 模式的过程和性能对比。 + +## 背景 + +当前 sparse policy(Quest、XAttention)仅在 CPU offload 路径中实现。目标是将其扩展到 GPU-only 模式,以提升长上下文场景下的性能。 + +## 基准性能(优化前) + +**测试环境**: +- GPU: NVIDIA A100-SXM4-80GB +- 模型: Llama-3.1-8B-Instruct +- 上下文长度: 32K tokens +- 日期: 2026-01-27 + +### Prefill Benchmark (32K context) + +| 模式 | Throughput | Time | KV Cache 分配 | +|------|------------|------|---------------| +| **GPU-only (Full Attention)** | 4869.67 tok/s | 6.73s | 438 blocks (56GB GPU) | +| CPU Offload (Full Attention) | 1500.29 tok/s | 21.84s | 4 blocks GPU + 32 blocks CPU | + +**性能比**: GPU-only 比 CPU Offload 快 **3.2x** + +### 配置详情 + +**GPU-only 模式**: +```bash +CUDA_VISIBLE_DEVICES=0 python bench.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --max-len 32768 +``` + +**CPU Offload 模式**: +```bash +CUDA_VISIBLE_DEVICES=0 python bench_offload.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --max-len 32768 +``` + +### KV Cache 配置 + +| 参数 | GPU-only | CPU Offload | +|------|----------|-------------| +| block_size | 1024 tokens | 1024 tokens | +| per-token KV | 128 KB | 128 KB | +| per-block KV | 128 MB | 128 MB | +| GPU blocks | 438 | 4 | +| CPU blocks | 0 | 32 | +| Total memory | 56 GB | 4.6 GB | + +## 目标 + +将以下 sparse policy 整合到 GPU-only 模式: + +| Policy | 阶段 | 描述 | +|--------|------|------| +| Quest | Decode | Top-K block selection based on query-key scores | +| XAttention BSA | Prefill | Block sparse attention with cumulative threshold | + +## 实现进度 + +- [ ] 分析现有 sparse policy 代码结构 +- [ ] 设计 GPU-only sparse policy 接口 +- [ ] 实现 GPU-only Quest decode +- [ ] 实现 GPU-only XAttention prefill +- [ ] 性能测试和对比 + +## 优化后性能 + +*待测试* + +| 模式 | Throughput | Speedup vs Full | +|------|------------|-----------------| +| GPU-only + Quest (decode) | TBD | TBD | +| GPU-only + XAttn (prefill) | TBD | TBD |