⚡️ perf: optimize XAttention estimate phase with K-only loading

Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase: - Only load K (not K+V) during block selection in select_blocks() - Reduces H2D transfer by 50% in estimate phase - 64K context: XAttn/Full ratio drops from 1.48x to 0.99x - 32K context: XAttn/Full ratio drops from 1.67x to 1.20x The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which only requires K for attention score computation. V is unused. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:20 +08:00
parent a832d127b6
commit 3da9b8aef2
3 changed files with 102 additions and 7 deletions
--- a/docs/memory_communication_benchmark.md
+++ b/docs/memory_communication_benchmark.md
@@ -34,17 +34,17 @@ GPU-CPU 通信量测试结果，对比 Full Policy 和 XAttention BSA Policy。
 | Decode H2D (32 tokens) | 262.13 GB | 262.13 GB | 1.00x |
 | TTFT | 27081 ms | 33634 ms | 1.24x |

-## 通信量比率对比
+## 通信量比率对比 (K-only 优化前)

 | 上下文长度 | XAttn/Full Prefill H2D 比率 |
 |------------|----------------------------|
 | 32K | 1.67x |
 | 64K | 1.48x |

-### 分析
+### 分析 (优化前)

 1. **XAttention 通信量增加原因**：
-   - Estimate 阶段：加载 **100%** 历史 blocks（用于 attention score 估计）
+   - Estimate 阶段：加载 **100%** 历史 blocks 的 **K+V**（用于 attention score 估计）
   - Compute 阶段：加载 **选中的** blocks（约 70-80%）
   - 理论比率：`1 + selection_density`

@@ -57,6 +57,44 @@ GPU-CPU 通信量测试结果，对比 Full Policy 和 XAttention BSA Policy。
   - XAttention 仅支持 prefill 阶段
   - Decode 阶段 fallback 到 Full Policy

+---
+
+## K-only 优化 (2026-01-28)
+
+### 优化原理
+
+XAttention 的 `select_blocks` 估计阶段只需要 K 来计算 attention scores：
+```python
+# flat_group_gemm_fuse_reshape 只使用 Q 和 K
+attn_scores = flat_group_gemm_fuse_reshape(Q, K_chunk, stride, ...)
+```
+
+V 在估计阶段完全不使用，但之前代码会同时加载 K 和 V，造成 50% 通信量浪费。
+
+### 优化实现
+
+1. **新增方法**: `OffloadEngine.load_k_only_to_slot_layer()` - 只加载 K
+2. **修改 select_blocks**: 使用只加载 K 的新方法
+
+### 优化后测试结果
+
+| 上下文 | Full Policy | XAttn (优化前) | XAttn (优化后) | 优化节省 |
+|--------|-------------|---------------|---------------|---------|
+| 32K | 66.57 GB | 111.12 GB | **79.76 GB** | **28.2%** |
+| 64K | 262.13 GB | 386.62 GB | **258.78 GB** | **33.1%** |
+
+### XAttn/Full 比率变化
+
+| 上下文 | 优化前比率 | 优化后比率 |
+|--------|-----------|-----------|
+| 32K | 1.67x | **1.20x** |
+| 64K | 1.48x | **0.99x** |
+
+### 结论
+
+优化后，64K 上下文的 XAttention 通信量与 Full Policy 基本持平 (0.99x)，
+而 32K 也从 1.67x 降到 1.20x。这说明估计阶段的 K-only 优化非常有效
+
 ## 测试命令

 ```bash