Compare commits

6 Commits

Author SHA1 Message Date
Zijie Tian
0d31b3f71f 📝 docs: add CPU offload optimization strategies guide
- Document chunk size optimization (simplest, most effective)
- Analyze CUDA Graph limitations for offload scenarios
- Cover CUDA Graph applicability for MLP/Proj layers
- Survey frontier research: InfiniGen, ShadowKV, L2 Prefetch, KVPR
- Add optimization priority recommendations

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 04:44:36 +08:00
Zijie Tian
73c9dc46ff feat: add XAttention BSA support to bench_offload.py
- Add --model parameter (default: Llama-3.1-8B-Instruct)
- Add --enable-xattn flag for XAttention BSA sparse prefill
- Add --xattn-threshold and --xattn-stride parameters
- Change default num-gpu-blocks from 6 to 4
- Add benchmark results doc with Full vs XAttn comparison (32K/128K)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 04:20:16 +08:00
Zijie Tian
924a0d2bfa 🔧 chore: add nsys profiling rule and update gitignore
- Add rule requiring profile_offload.sh for all nsys profiling
- Document available parameters and typical workflows
- Ignore Snipaste screenshot files

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 03:42:17 +08:00
Zijie Tian
0619accd1c 📝 docs: add CPU scheduling latency analysis for chunked attention
- Document kernel gap analysis showing 77-81% CPU scheduling overhead
- Identify GPU utilization at 12.8% with potential to reach 39.5%
- Outline optimization directions: CUDA Graph, Triton fusion, C++ extension
- Add documentation index entry in CLAUDE.md

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 03:42:12 +08:00
Zijie Tian
18bc433f09 perf: improve NVTX profiling with colored ranges and configurable slots
- Switch from torch.cuda.nvtx to nvtx package for colored range support
- Add color coding: blue for H2D, green for D2H decode, orange for D2H prefill
- Add --num-gpu-blocks parameter to profile_offload.sh
- Include slot count in output filename for easier comparison

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 03:42:05 +08:00
Zijie Tian
aea3812230 ♻️ refactor: unify KV cache operations through OffloadEngine
- Add write_to_prefill_buffer() and write_to_decode_buffer() methods
- Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling
- Replace direct copy_() calls with OffloadEngine methods in attention.py
- Update all load_to_slot_layer() calls to pass chunk_idx
- NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]"

All KV cache data transfers in chunked offload mode now go through
OffloadEngine, enabling better profiling and consistent management.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 02:20:59 +08:00
12 changed files with 793 additions and 38 deletions

View File

@@ -0,0 +1,89 @@
# Nsys Profiling Rule
## 强制规则
**所有 nsys profiling 任务必须使用 `scripts/profile_offload.sh` 脚本**,禁止直接运行 nsys 命令。
| 禁止 | 原因 |
|------|------|
| `nsys profile python tests/test_ruler.py ...` | 参数不一致,输出路径混乱 |
| 手动构造 nsys 命令 | 容易遗漏关键参数 |
## 使用方法
```bash
# 基本用法(默认 4 slots
bash scripts/profile_offload.sh
# 指定 GPU slots 数量
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 指定 sample
bash scripts/profile_offload.sh --sample 5
# 指定 dataset
bash scripts/profile_offload.sh --dataset niah_single_1
# 禁用 offload对比测试
bash scripts/profile_offload.sh --no-offload
# 组合参数
bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0 --gpu 1
```
## 参数说明
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--dataset` | `niah_single_1` | RULER 任务名称 |
| `--sample` | `0` | 样本索引 |
| `--gpu` | `0` | 使用的 GPU |
| `--num-gpu-blocks` | `4` | GPU ring buffer slots 数量 |
| `--no-offload` | - | 禁用 CPU offload |
## 输出文件
输出文件自动生成到 `results/nsys/` 目录:
```
results/nsys/ruler_<dataset>_sample<index>_offload_<slots>slots_<timestamp>.nsys-rep
```
示例:`ruler_niah_single_1_sample0_offload_8slots_20260127_031500.nsys-rep`
## 查看结果
```bash
# GUI 查看
nsight-sys results/nsys/<filename>.nsys-rep
# 命令行统计
nsys stats --report cuda_api_sum results/nsys/<filename>.nsys-rep
nsys stats --report cuda_gpu_kern_sum results/nsys/<filename>.nsys-rep
```
## 典型工作流
### 1. 对比不同 slots 数量
```bash
# 测试 4 slots默认
bash scripts/profile_offload.sh --num-gpu-blocks 4
# 测试 8 slots
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 对比结果
nsys stats --report cuda_gpu_kern_sum results/nsys/*4slots*.nsys-rep
nsys stats --report cuda_gpu_kern_sum results/nsys/*8slots*.nsys-rep
```
### 2. 分析 pipeline overlap
```bash
# 生成 profile
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 用 nsight-sys GUI 查看 CUDA HW timeline
# 检查 H2D 和 flash_fwd_kernel 是否 overlap
```

1
.gitignore vendored
View File

@@ -239,3 +239,4 @@ task_plan_*.md
findings_*.md findings_*.md
progress_*.md progress_*.md
notes.md notes.md
Snipaste*

View File

@@ -26,6 +26,9 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) | | [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
| [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 | | [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
| [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 | | [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |
| [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | ⚡ PERF: CPU 调度延迟分析kernel 间隙来源GPU 利用率优化方向 |
| [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | 📊 BENCH: CPU offload 性能测试结果Full vs XAttention 对比 (32K/128K) |
| [`docs/cpu_offload_optimization_strategies.md`](docs/cpu_offload_optimization_strategies.md) | 🚀 OPT: CPU offload 优化策略chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV) |
## Rules Index ## Rules Index

View File

@@ -46,24 +46,41 @@ def main():
from nanovllm.config import SparsePolicyType from nanovllm.config import SparsePolicyType
parser = argparse.ArgumentParser(description="Benchmark CPU offload performance") parser = argparse.ArgumentParser(description="Benchmark CPU offload performance")
parser.add_argument("--enable-quest", action="store_true", help="Enable Quest sparse attention for decode") parser.add_argument("--model", type=str, default="~/models/Llama-3.1-8B-Instruct",
help="Model path (default: ~/models/Llama-3.1-8B-Instruct)")
# Sparse policy selection (mutually exclusive)
sparse_group = parser.add_mutually_exclusive_group()
sparse_group.add_argument("--enable-quest", action="store_true",
help="Enable Quest sparse attention (decode only, prefill uses full)")
sparse_group.add_argument("--enable-xattn", action="store_true",
help="Enable XAttention BSA (prefill only, decode uses full)")
# Quest parameters
parser.add_argument("--topk", type=int, default=16, help="Top-K blocks for Quest (default: 16)") parser.add_argument("--topk", type=int, default=16, help="Top-K blocks for Quest (default: 16)")
parser.add_argument("--threshold", type=int, default=4, help="Apply sparse only when blocks > threshold (default: 4)") parser.add_argument("--threshold", type=int, default=4, help="Apply sparse only when blocks > threshold (default: 4)")
# XAttention parameters
parser.add_argument("--xattn-threshold", type=float, default=0.95,
help="XAttention cumulative attention threshold (default: 0.95)")
parser.add_argument("--xattn-stride", type=int, default=8,
help="XAttention Q/K downsampling stride (default: 8)")
# General parameters
parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens") parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens")
parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)") parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)")
parser.add_argument("--num-gpu-blocks", type=int, default=6, help="Number of GPU blocks (default: 6)") parser.add_argument("--num-gpu-blocks", type=int, default=4, help="Number of GPU blocks (default: 4)")
parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)") parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)") parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks") parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
args = parser.parse_args() args = parser.parse_args()
path = os.path.expanduser("~/models/Qwen3-4B-Instruct-2507/") path = os.path.expanduser(args.model)
max_len = args.max_len max_len = args.max_len
# Setup policy configuration # Setup policy configuration
if args.enable_quest: if args.enable_quest:
sparse_policy = SparsePolicyType.QUEST sparse_policy = SparsePolicyType.QUEST
print(f"\n[Quest Sparse Attention] topk={args.topk}, threshold={args.threshold}") print(f"\n[Quest Sparse Attention] decode: Quest (topk={args.topk}, threshold={args.threshold}), prefill: Full")
elif args.enable_xattn:
sparse_policy = SparsePolicyType.XATTN_BSA
print(f"\n[XAttention BSA] prefill: XAttn (tau={args.xattn_threshold}, stride={args.xattn_stride}), decode: Full")
else: else:
sparse_policy = SparsePolicyType.FULL sparse_policy = SparsePolicyType.FULL
print("\n[Full Attention] baseline (no sparse)") print("\n[Full Attention] baseline (no sparse)")
@@ -78,8 +95,12 @@ def main():
enable_cpu_offload=True, enable_cpu_offload=True,
num_gpu_blocks=args.num_gpu_blocks, num_gpu_blocks=args.num_gpu_blocks,
sparse_policy=sparse_policy, sparse_policy=sparse_policy,
# Quest parameters
sparse_topk_blocks=args.topk, sparse_topk_blocks=args.topk,
sparse_threshold_blocks=args.threshold, sparse_threshold_blocks=args.threshold,
# XAttention parameters
sparse_threshold=args.xattn_threshold,
sparse_stride=args.xattn_stride,
) )
# Warmup # Warmup

View File

@@ -0,0 +1,89 @@
# CPU Offload Benchmark Results
本文档记录 `bench_offload.py` 在不同配置下的性能测试结果。
## 测试环境
| 参数 | 值 |
|------|-----|
| GPU | NVIDIA A100-SXM4-80GB |
| 模型 | Llama-3.1-8B-Instruct |
| GPU slots | 4 |
| Block size | 1024 tokens |
| Chunk size | 2048 tokens |
## Sparse Policy 配置
| 策略 | Prefill | Decode | 说明 |
|------|---------|--------|------|
| FULL | Full Attention | Full Attention | 基线,加载所有 blocks |
| XATTN_BSA | XAttention (tau=0.95, stride=8) | Full Attention (fallback) | 稀疏 prefill |
## 测试结果
### 32K 上下文
| 策略 | 输入长度 | 耗时 | 吞吐量 | 相对性能 |
|------|----------|------|--------|----------|
| Full Attention | 32767 tok | 20.64s | **1587.74 tok/s** | baseline |
| XAttention BSA | 32767 tok | 27.95s | **1172.33 tok/s** | 0.74x |
### 128K 上下文
| 策略 | 输入长度 | 耗时 | 吞吐量 | 相对性能 |
|------|----------|------|--------|----------|
| Full Attention | 131071 tok | 237.18s | **552.63 tok/s** | baseline |
| XAttention BSA | 131071 tok | 281.17s | **466.17 tok/s** | 0.84x |
### KV Cache 配置
| 上下文 | GPU Memory | CPU Memory | Total |
|--------|------------|------------|-------|
| 32K | 512 MB (4 blocks) | 4096 MB (32 blocks) | 4608 MB |
| 128K | 512 MB (4 blocks) | 16384 MB (128 blocks) | 16896 MB |
## 分析
### XAttention 性能特点
1. **32K 上下文**: XAttention 比 Full 慢 26%
2. **128K 上下文**: XAttention 比 Full 慢 16%
随着上下文增长XAttention 的相对性能有所提升74% → 84%),但仍未超过 Full Attention。
### 原因分析
1. **tau=0.95 阈值较高**: 需要覆盖 95% 累积注意力,实际跳过的 block 较少
2. **估计开销**: `xattn_estimate_chunked` 需要对每个 chunk 计算稀疏 mask
3. **BSA kernel overhead**: Block sparse kernel 有额外的 mask 处理和索引开销
4. **Offload 瓶颈**: CPU→GPU 传输是主要瓶颈,稀疏注意力节省的是计算而非传输
### 适用场景
XAttention BSA 更适合以下场景:
- 更长的上下文256K+),稀疏收益更明显
- 计算密集型任务(非 offload 模式),传输不是瓶颈
- 较低的 tau 阈值(如 0.8),增加稀疏性
## 运行命令
```bash
# Full Attention (32K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 32768
# XAttention BSA (32K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 32768 --enable-xattn
# Full Attention (128K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 131072
# XAttention BSA (128K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 131072 --enable-xattn
# 调整 XAttention 参数
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --enable-xattn --xattn-threshold 0.8 --xattn-stride 16
```
## 更新记录
- 2026-01-27: 初始测试Llama-3.1-8B-Instruct, A100 80GB

View File

@@ -0,0 +1,300 @@
# CPU Offload 优化策略
本文档记录 CPU Offload 场景下的性能优化策略分析,包括实际可行的方案和前沿研究方向。
## 问题回顾
根据 [CPU 调度延迟分析](cpu_scheduling_latency_analysis.md),当前 chunked attention pipeline 的主要问题:
| 指标 | 当前值 | 理论值 |
|------|--------|--------|
| Flash kernel 执行时间 | ~138 μs | - |
| Flash kernel 间隔 | ~942 μs | ~211 μs (仅 H2D + merge) |
| GPU 利用率 | **12.8%** | **39.5%** (理论上限) |
| CPU 调度空闲占比 | **77-81%** | 0% |
**瓶颈根源**:每个 block 都经过完整的 Python 循环,导致大量 CPU 调度延迟。
---
## 优化方案一:调大 Chunk Size推荐
### 核心洞察
**Merge 多个小 chunk 和直接使用大 chunk 是等效的**
```
方案 A: Merge 4 个小 chunks
[H2D 2K][H2D 2K][H2D 2K][H2D 2K] → concat → [Flash 8K] → merge
方案 B: 直接用大 chunk
[H2D 8K] → [Flash 8K] → merge
计算结果完全等效!
```
### 收益分析
| 指标 | 小 chunk (2K) × 4 | 大 chunk (8K) × 1 |
|------|-------------------|-------------------|
| H2D 次数 | 4 | 1 |
| Flash kernel 调用 | 4 | 1 |
| Merge 调用 | 4 | 1 |
| Python 循环次数 | 4 | 1 |
| CPU 调度开销 | 4 × ~300μs = 1200μs | 1 × ~300μs = 300μs |
**本质**CPU 调度延迟问题的根源是循环次数太多,调大 chunk size 直接减少循环次数。
### Trade-off
1. **GPU 内存增加**
- 2K chunk: 每 slot ~4MB (K+V)
- 8K chunk: 每 slot ~16MB (K+V)
- 4 slots = 64MB对 80GB A100 影响很小
2. **单次 H2D 时间变长**
- H2D 8K ≈ 350μs
- Flash 8K ≈ 550μs
- 因为 Flash > H2Dpipeline 仍然有效
### 配置方法
```bash
# 测试不同 block size
python bench_offload.py --kvcache-block-size 2048 # 基准
python bench_offload.py --kvcache-block-size 4096 # 2x
python bench_offload.py --kvcache-block-size 8192 # 4x
```
---
## 优化方案二CUDA Graph适用于非 Attention 部分)
### CUDA Graph 在 Offload 场景的局限性
CUDA Graph 的前提:所有操作在 capture 时确定,数据地址固定。
**Offload 场景的现实**
1. **H2D 源地址动态** - 每次从不同的 CPU block 加载
2. **加载决策在运行时** - 哪些 block 需要加载是动态的
3. **CPU 必须协调** - H2D 和 Compute 的同步需要 CPU 参与
```
Offload 场景:
┌─────────────────────────────────────────┐
│ 数据在 CPU需要动态加载 │
│ [H2D_i] → [Compute] → [H2D_{i+n}] → ...│
│ ↑ 动态、CPU 必须参与调度 │
└─────────────────────────────────────────┘
即使用 Graph
Python: [wait_h2d] [replay] [launch_h2d] [wait_h2d] [replay] ...
↑ CPU 参与 ↑ CPU 参与 ↑ CPU 参与
CPU 调度开销仍然存在Graph 只优化了中间的 compute 部分。
```
**结论**CUDA Graph 不是 Offload 场景的银弹。
### 适用场景MLP 和 Projection 层
LLM 每层的计算流程:
```
┌─────────────────────────────────────────────────────────────┐
│ [LayerNorm] → [QKV Proj] → [Attention] → [O Proj] → [Add] │
│ ↑ │
│ KV Offload │
│ [LayerNorm] → [MLP: gate + up + down] → [Add] │
└─────────────────────────────────────────────────────────────┘
```
| 组件 | 涉及 Offload | 能用 CUDA Graph |
|------|-------------|-----------------|
| LayerNorm | ❌ | ✅ |
| QKV Projection | ❌ | ✅ |
| **Attention** | ✅ | ❌ |
| Output Projection | ❌ | ✅ |
| MLP (FFN) | ❌ | ✅ |
**只有 Attention 涉及动态 KV Cache 加载,其余都是"纯计算",可以用 CUDA Graph。**
### 实现方案
```python
class OptimizedLayer:
def __init__(self, layer):
# Graph 1: Attention 之前
self.graph_pre_attn = capture([
layer.input_layernorm,
layer.self_attn.q_proj,
layer.self_attn.k_proj,
layer.self_attn.v_proj,
])
# Graph 2: Attention 之后 + MLP
self.graph_post_attn = capture([
layer.self_attn.o_proj,
# residual add
layer.post_attention_layernorm,
layer.mlp.gate_proj,
layer.mlp.up_proj,
layer.mlp.down_proj,
# residual add
])
def forward(self, hidden_states, kv_cache):
# Pre-attention (CUDA Graph)
self.graph_pre_attn.replay()
# Attention with offload (动态,不能用 graph)
attn_output = chunked_attention_with_offload(q, kv_cache)
# Post-attention + MLP (CUDA Graph)
self.graph_post_attn.replay()
```
### 收益估算
MLP 每层典型操作 launch 开销:
- `gate_proj`, `up_proj`, `act_fn`, `gate * up`, `down_proj`, `residual add`
- 每个操作 ~30-50μs launch 开销,总计 ~200μs/层
- 用 CUDA Graph~30μs/层
**32 层 × 170μs 节省 ≈ 5.4ms**
---
## 优化方案三:前沿研究方向
### 1. InfiniGen - 投机预取 (OSDI'24)
**核心思想**:不需要加载所有 KV只预取"重要"的 token。
```
关键洞察:相邻层的 attention pattern 高度相似
用第 L 层的 attention score 预测第 L+1 层需要哪些 token
只预取 top-k 重要的 KV entries而不是全部
```
**技术实现**
- 用当前层的 Q 和下一层的部分 K 做"预演"
- 预测下一层的 attention 分布
- 异步预取预测的重要 token
- **减少 PCIe 带宽浪费,而不是加速传输**
**效果**:最高 **3x 加速**
**参考**[InfiniGen (OSDI'24)](https://www.usenix.org/conference/osdi24/presentation/lee)
### 2. ShadowKV - 低秩压缩 + Sparse Offload (ICML'25 Spotlight)
**核心思想**Key 压缩存 GPUValue offload 到 CPU只加载 1.56% 的 KV。
```
Pre-filling:
┌─────────────────────────────────────────────────┐
│ Key Cache → SVD 低秩压缩 → 保留在 GPU │
│ Value Cache → Offload 到 CPU │
│ 计算每个 chunk 的 landmark (均值) │
│ 识别 outlier tokens → 保留在 GPU │
└─────────────────────────────────────────────────┘
Decoding:
┌─────────────────────────────────────────────────┐
│ 用 landmarks 快速估计 attention score │
│ 只加载 top-k 重要的 Value (1.56% sparse) │
│ 结合 GPU 上的 outliers 计算最终结果 │
└─────────────────────────────────────────────────┘
```
**效果**6x 更大 batch size**3.04x 吞吐提升**
**参考**[ShadowKV (ByteDance)](https://github.com/ByteDance-Seed/ShadowKV)
### 3. L2 Cache 异步预取 (2025)
**核心思想**:利用 GPU L2 Cache 做预取,在计算时预取下一批 KV。
```
传统:
Compute: [Flash_i] [Flash_{i+1}]
H2D: [H2D_{i+1}]
↑ 等待
L2 Prefetch
Compute: [Flash_i + Prefetch_{i+1} to L2] [Flash_{i+1} L2 hit]
↑ 计算时利用空闲 memory bandwidth 预取
```
**技术**
- 在 Flash Attention kernel 内部发起预取指令
- 利用计算时的空闲 memory bandwidth
- 下一次访问直接 L2 hit
**效果****2.15x attention kernel 效率**1.97x 端到端吞吐
**参考**[Asynchronous KV Cache Prefetching (2025)](https://arxiv.org/abs/2504.06319)
### 4. KVPR - I/O-Aware 调度 (ACL'25)
**核心思想**:计算最优的 recompute vs offload 比例。
```
权衡:
- Recompute: 重新计算 KV用 GPU 算力换内存)
- Offload: 从 CPU 加载(用 PCIe 带宽换算力)
KVPR: 根据当前负载动态决定最优比例
+ 预取技术重叠数据传输和计算
```
**参考**[KVPR (ACL'25)](https://aclanthology.org/2025.findings-acl.997.pdf)
---
## 优化策略总结
### 推荐优先级
| 优先级 | 方案 | 核心优化 | 实现复杂度 | 预期收益 |
|--------|------|---------|-----------|---------|
| **P0** | 调大 chunk size | 减少循环次数 | 极低(改配置) | 2-4x |
| **P1** | MLP CUDA Graph | 减少 launch 开销 | 中 | ~5ms/request |
| **P2** | InfiniGen 式预取 | 只加载重要 token | 中高 | 2-3x |
| **P3** | ShadowKV 式压缩 | Key 压缩 + Sparse | 高 | 3x |
| **P3** | C++ Extension | 消除 Python 开销 | 高 | 2-3x |
### 策略分离原则
```
┌─────────────────────────────────────────────────────────────┐
│ Attention + Offload 部分: │
│ - 瓶颈H2D 传输 + CPU 调度 │
│ - 优化:调大 chunk size / 投机预取 / Sparse │
│ │
│ MLP + Proj + Norm 部分: │
│ - 瓶颈Kernel launch 开销 │
│ - 优化CUDA Graph │
└─────────────────────────────────────────────────────────────┘
两部分优化完全正交,可以组合使用。
```
---
## 相关文件
- `nanovllm/kvcache/sparse/full_policy.py`: Chunked attention pipeline
- `nanovllm/kvcache/offload_engine.py`: H2D/D2H 传输管理
- `docs/cpu_scheduling_latency_analysis.md`: 问题分析
## 参考文献
1. [InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management](https://www.usenix.org/conference/osdi24/presentation/lee) - OSDI'24
2. [ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference](https://github.com/ByteDance-Seed/ShadowKV) - ICML'25 Spotlight
3. [Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching](https://arxiv.org/abs/2504.06319) - 2025
4. [KVPR: Efficient LLM Inference with I/O-Aware KV Cache](https://aclanthology.org/2025.findings-acl.997.pdf) - ACL'25
5. [LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference](https://lmcache.ai/tech_report.pdf) - 2025

View File

@@ -0,0 +1,177 @@
# CPU 调度延迟分析
## 问题概述
在分析 nsys profile 时发现chunked attention pipeline 中存在大量的 **CPU 调度延迟**,导致 GPU 利用率显著下降。
## 观察数据
### 测试环境
- GPU: NVIDIA A100-SXM4-80GB
- 模型: Llama-3.1-8B-Instruct
- 测试: RULER niah_single_1, 64K context
- Profile 文件: `ruler_8slots_test.nsys-rep`
- 时间段: 92.982s - 93.038s
### Kernel 执行时间
| Kernel | 典型执行时间 |
|--------|-------------|
| flash_fwd_kernel | ~138 μs |
| H2D memcpy (2MB) | ~87 μs |
| merge_lse_kernel | ~3.5 μs |
| merge_output_kernel | ~34 μs |
### 操作间隙分析
从 cuda_gpu_trace 观察到的间隙:
```
Start (ms) Dur (μs) Gap (μs) Type
------------------------------------------------------------
92984.680 138.3 378.3 flash_fwd_kernel ← GAP!
92985.051 86.8 232.9 H2D memcpy ← GAP!
92985.141 86.8 2.8 H2D memcpy
92985.587 135.9 360.0 flash_fwd_kernel ← GAP!
92986.026 3.4 302.4 merge_lse ← GAP!
92986.164 33.5 135.0 merge_output ← GAP!
92986.371 86.9 173.4 H2D memcpy ← GAP!
92986.461 86.8 2.7 H2D memcpy
92986.816 137.9 268.2 flash_fwd_kernel ← GAP!
```
### Flash Kernel 间隙分解
| 间隙 | 总时间 | 有效工作时间 | 空闲时间 |
|------|--------|-------------|---------|
| Flash 1 → Flash 2 | 769 μs | ~174 μs (2x H2D) | ~595 μs (77%) |
| Flash 2 → Flash 3 | 1092 μs | ~211 μs (merge + H2D) | ~881 μs (81%) |
| Flash 3 → Flash 4 | 965 μs | ~211 μs (merge + H2D) | ~754 μs (78%) |
**关键发现**: 每个 flash kernel 之间约 **77-81% 的时间是 CPU 调度空闲**
## 间隙来源分析
### 1. CPU 调度延迟类型
| 转换 | 典型延迟 | 原因 |
|------|---------|------|
| Kernel 结束 → 下一个 Kernel 开始 | 100-400 μs | CPU 准备参数、调用 CUDA driver |
| Flash 结束 → H2D 开始 | ~233 μs | Python 代码执行 + CUDA launch |
| H2D 结束 → Flash 开始 | ~360 μs | 同步等待 + kernel launch |
| Flash 结束 → merge 开始 | ~302 μs | Python 代码执行 |
### 2. 延迟产生的代码位置
```python
# full_policy.py: compute_chunked_prefill
for block_idx in range(num_blocks):
# 1. 等待 H2D 完成 (同步点)
offload_engine.wait_slot_layer(current_slot) # ← 可能引入延迟
# 2. 获取 KV 数据
k_block, v_block = offload_engine.get_kv_for_slot(current_slot)
# 3. 调用 flash attention (kernel launch)
block_out, block_lse = flash_attn_with_kvcache(...) # ← CPU 调度延迟
# 4. merge 操作
merge_output(...) # ← CPU 调度延迟
merge_lse(...) # ← CPU 调度延迟
# 5. 发起下一个 H2D (异步)
offload_engine.load_to_slot_layer(next_slot, ...) # ← CPU 调度延迟
```
### 3. 为什么 H2D 之间间隙小
注意到连续的 H2D memcpy 之间间隙只有 ~2.7 μs这是因为
- 它们在同一个 stream 上连续发起
- CUDA driver 可以批量处理
- 没有 Python 代码介入
## GPU 利用率计算
基于观察数据:
| 指标 | 值 |
|------|-----|
| Flash kernel 平均执行时间 | 138 μs |
| Flash kernel 平均间隔 | 942 μs |
| Flash kernel GPU 利用率 | 138 / (138 + 942) = **12.8%** |
如果消除 CPU 调度延迟(仅保留必要的 H2D + merge
| 指标 | 值 |
|------|-----|
| 必要间隔 (2x H2D + merge) | ~211 μs |
| 理论 GPU 利用率 | 138 / (138 + 211) = **39.5%** |
**潜在提升**: 3x GPU 利用率
## 优化方向
### 1. CUDA Graph
将整个 block 处理流程编译为 CUDA Graph消除重复的 kernel launch 开销。
```python
# 伪代码
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
# 预录制 flash + merge 操作
block_out, block_lse = flash_attn_with_kvcache(...)
merge_output(...)
merge_lse(...)
# 运行时只需 replay
for block_idx in range(num_blocks):
graph.replay() # 单次 launch无 Python 介入
```
### 2. 自定义 Triton Kernel
将 flash + merge 融合为单个 kernel减少 kernel launch 次数。
### 3. C++ Extension
将 Python 循环移到 C++ 层,减少 Python 解释器开销。
### 4. 流水线重叠优化
确保 H2D 传输与前一个 block 的计算完全重叠:
```
Block 0: [H2D slot0] [Flash slot0] [merge]
Block 1: [H2D slot1] [Flash slot1] [merge]
Block 2: [H2D slot2] [Flash slot2] [merge]
```
## 验证方法
### 1. 使用 nsys 分析间隙
```bash
# 生成 profile
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 查看 kernel trace
nsys stats --report cuda_gpu_trace --format csv <file>.nsys-rep | \
awk -F',' 'NR>1 && $1 >= START && $1 <= END'
```
### 2. 计算间隙
```python
# 从 trace 数据计算
prev_end = start + duration
gap = next_start - prev_end
```
## 相关文件
- `nanovllm/kvcache/sparse/full_policy.py`: Pipeline 实现
- `nanovllm/kvcache/offload_engine.py`: H2D/D2H 传输
- `scripts/profile_offload.sh`: Profiling 脚本
## 参考
- [CUDA Graph 文档](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs)
- [nsys 用户指南](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)

View File

@@ -9,6 +9,7 @@ Key design principles for CUDA Graph compatibility:
import torch import torch
import torch.cuda.nvtx import torch.cuda.nvtx
import nvtx
from torch import Tensor from torch import Tensor
from typing import Dict, List, Tuple, Optional from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass from dataclasses import dataclass
@@ -374,7 +375,9 @@ class OffloadEngine:
""" """
self.ring_slot_compute_done[slot_idx].record() self.ring_slot_compute_done[slot_idx].record()
def load_to_slot_layer(self, slot_idx: int, layer_id: int, cpu_block_id: int) -> None: def load_to_slot_layer(
self, slot_idx: int, layer_id: int, cpu_block_id: int, chunk_idx: int = -1
) -> None:
""" """
Async load a single CPU block to a ring buffer slot for one layer. Async load a single CPU block to a ring buffer slot for one layer.
@@ -389,13 +392,20 @@ class OffloadEngine:
slot_idx: Target GPU slot index slot_idx: Target GPU slot index
layer_id: Layer index to load (for CPU cache indexing) layer_id: Layer index to load (for CPU cache indexing)
cpu_block_id: Source CPU block ID cpu_block_id: Source CPU block ID
chunk_idx: Optional chunk index for NVTX labeling (-1 means not specified)
""" """
logger.debug(f"Ring load: layer={layer_id}, CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]") logger.debug(f"Ring load: layer={layer_id}, CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]")
# Use per-slot stream for parallel transfers across different slots # Use per-slot stream for parallel transfers across different slots
stream = self.slot_transfer_streams[slot_idx] stream = self.slot_transfer_streams[slot_idx]
torch.cuda.nvtx.range_push(f"H2D: L{layer_id} CPU[{cpu_block_id}]->Slot[{slot_idx}]") # Build NVTX label with optional chunk info
if chunk_idx >= 0:
nvtx_label = f"H2D: L{layer_id} Chunk{chunk_idx} CPU[{cpu_block_id}]->Slot[{slot_idx}]"
else:
nvtx_label = f"H2D: L{layer_id} CPU[{cpu_block_id}]->Slot[{slot_idx}]"
nvtx.push_range(message=nvtx_label, color="blue")
with torch.cuda.stream(stream): with torch.cuda.stream(stream):
# Wait for previous compute on this slot to complete before overwriting # Wait for previous compute on this slot to complete before overwriting
# This prevents data race: transfer must not start until attention finishes reading # This prevents data race: transfer must not start until attention finishes reading
@@ -413,7 +423,7 @@ class OffloadEngine:
self.v_cache_cpu[layer_id, cpu_block_id], non_blocking=True self.v_cache_cpu[layer_id, cpu_block_id], non_blocking=True
) )
self.ring_slot_ready[slot_idx].record(stream) self.ring_slot_ready[slot_idx].record(stream)
torch.cuda.nvtx.range_pop() nvtx.pop_range()
def wait_slot_layer(self, slot_idx: int) -> None: def wait_slot_layer(self, slot_idx: int) -> None:
""" """
@@ -470,7 +480,8 @@ class OffloadEngine:
else: else:
self.sparse_policy.on_decode_offload(cpu_block_id, layer_id, k_cache, valid_tokens) self.sparse_policy.on_decode_offload(cpu_block_id, layer_id, k_cache, valid_tokens)
torch.cuda.nvtx.range_push(f"D2H: Slot[{slot_idx}]->CPU[L{layer_id},B{cpu_block_id}]") nvtx_label = f"D2H: Slot[{slot_idx}]->CPU[L{layer_id},B{cpu_block_id}]"
nvtx.push_range(message=nvtx_label, color="green")
with torch.cuda.stream(self.transfer_stream_main): with torch.cuda.stream(self.transfer_stream_main):
# Wait for both compute_stream and default stream # Wait for both compute_stream and default stream
# - compute_stream: for flash attention operations # - compute_stream: for flash attention operations
@@ -486,7 +497,7 @@ class OffloadEngine:
self.v_cache_gpu[slot_idx], non_blocking=True self.v_cache_gpu[slot_idx], non_blocking=True
) )
self.ring_slot_offload_done[slot_idx].record(self.transfer_stream_main) self.ring_slot_offload_done[slot_idx].record(self.transfer_stream_main)
torch.cuda.nvtx.range_pop() nvtx.pop_range()
# ----- KV access methods for ring buffer ----- # ----- KV access methods for ring buffer -----
@@ -702,6 +713,61 @@ class OffloadEngine:
v = self.prefill_v_buffer[layer_id, :num_tokens].unsqueeze(0) v = self.prefill_v_buffer[layer_id, :num_tokens].unsqueeze(0)
return k, v return k, v
def write_to_prefill_buffer(
self,
layer_id: int,
k: Tensor,
v: Tensor,
chunk_idx: int = -1,
) -> None:
"""
Write KV tensors to prefill buffer (D2D copy within GPU).
This is called during chunked prefill to store current chunk's KV
before computing attention.
Args:
layer_id: Layer index
k: Key tensor [num_tokens, kv_heads, head_dim]
v: Value tensor [num_tokens, kv_heads, head_dim]
chunk_idx: Current chunk index for NVTX labeling (-1 = not specified)
"""
num_tokens = k.shape[0]
# Build NVTX label
if chunk_idx >= 0:
nvtx_label = f"D2D: L{layer_id} Chunk{chunk_idx} WritePrefillBuffer"
else:
nvtx_label = f"D2D: L{layer_id} WritePrefillBuffer"
torch.cuda.nvtx.range_push(nvtx_label)
self.prefill_k_buffer[layer_id, :num_tokens].copy_(k)
self.prefill_v_buffer[layer_id, :num_tokens].copy_(v)
torch.cuda.nvtx.range_pop()
def write_to_decode_buffer(
self,
layer_id: int,
pos_in_block: int,
k: Tensor,
v: Tensor,
) -> None:
"""
Write KV tensors to decode buffer (D2D copy within GPU).
This is called during chunked decode to store current decode token's KV.
Args:
layer_id: Layer index
pos_in_block: Position within the current block
k: Key tensor [kv_heads, head_dim] (single token, squeezed)
v: Value tensor [kv_heads, head_dim] (single token, squeezed)
"""
torch.cuda.nvtx.range_push(f"D2D: L{layer_id} Pos{pos_in_block} WriteDecodeBuffer")
self.decode_k_buffer[layer_id, pos_in_block].copy_(k)
self.decode_v_buffer[layer_id, pos_in_block].copy_(v)
torch.cuda.nvtx.range_pop()
def offload_prefill_buffer_async( def offload_prefill_buffer_async(
self, self,
layer_id: int, layer_id: int,
@@ -729,7 +795,8 @@ class OffloadEngine:
# Use per-layer stream for parallel offloads # Use per-layer stream for parallel offloads
stream = self.prefill_offload_streams[layer_id] stream = self.prefill_offload_streams[layer_id]
torch.cuda.nvtx.range_push(f"AsyncPrefillOffload: L{layer_id}->CPU[{cpu_block_id}]") nvtx_label = f"D2H: PrefillBuffer L{layer_id}->CPU[{cpu_block_id}]"
nvtx.push_range(message=nvtx_label, color="orange")
with torch.cuda.stream(stream): with torch.cuda.stream(stream):
# Wait for compute to finish writing to prefill buffer # Wait for compute to finish writing to prefill buffer
stream.wait_stream(self.compute_stream) stream.wait_stream(self.compute_stream)
@@ -744,7 +811,7 @@ class OffloadEngine:
# Record completion event # Record completion event
self.prefill_offload_events[layer_id].record(stream) self.prefill_offload_events[layer_id].record(stream)
torch.cuda.nvtx.range_pop() nvtx.pop_range()
def wait_all_prefill_offloads(self) -> None: def wait_all_prefill_offloads(self) -> None:
"""Wait for all prefill buffer offloads to complete.""" """Wait for all prefill buffer offloads to complete."""

View File

@@ -139,7 +139,8 @@ class FullAttentionPolicy(SparsePolicy):
slot = load_slots[0] slot = load_slots[0]
for block_idx in range(num_blocks): for block_idx in range(num_blocks):
cpu_block_id = cpu_block_table[block_idx] cpu_block_id = cpu_block_table[block_idx]
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id) # cpu_block_id is the chunk index (block N = chunk N)
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id, chunk_idx=cpu_block_id)
offload_engine.wait_slot_layer(slot) offload_engine.wait_slot_layer(slot)
with torch.cuda.stream(compute_stream): with torch.cuda.stream(compute_stream):
@@ -159,7 +160,8 @@ class FullAttentionPolicy(SparsePolicy):
num_slots = len(load_slots) num_slots = len(load_slots)
num_preload = min(num_slots, num_blocks) num_preload = min(num_slots, num_blocks)
for i in range(num_preload): for i in range(num_preload):
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_table[i]) cpu_block_id = cpu_block_table[i]
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
for block_idx in range(num_blocks): for block_idx in range(num_blocks):
current_slot = load_slots[block_idx % num_slots] current_slot = load_slots[block_idx % num_slots]
@@ -186,7 +188,7 @@ class FullAttentionPolicy(SparsePolicy):
if next_block_idx < num_blocks: if next_block_idx < num_blocks:
next_slot = load_slots[next_block_idx % num_slots] next_slot = load_slots[next_block_idx % num_slots]
next_cpu_block_id = cpu_block_table[next_block_idx] next_cpu_block_id = cpu_block_table[next_block_idx]
offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id) offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
# Step 4: Compute attention to current chunk (causal mask) # Step 4: Compute attention to current chunk (causal mask)
with torch.cuda.stream(compute_stream): with torch.cuda.stream(compute_stream):
@@ -350,7 +352,8 @@ class FullAttentionPolicy(SparsePolicy):
# Phase 1: Pre-load up to num_slots blocks # Phase 1: Pre-load up to num_slots blocks
num_preload = min(num_slots, num_blocks) num_preload = min(num_slots, num_blocks)
for i in range(num_preload): for i in range(num_preload):
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_table[i]) cpu_block_id = cpu_block_table[i]
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
# Phase 2: Process blocks with pipeline # Phase 2: Process blocks with pipeline
for block_idx in range(num_blocks): for block_idx in range(num_blocks):
@@ -383,7 +386,8 @@ class FullAttentionPolicy(SparsePolicy):
# Start loading next block (pipeline) # Start loading next block (pipeline)
next_block_idx = block_idx + num_slots next_block_idx = block_idx + num_slots
if next_block_idx < num_blocks: if next_block_idx < num_blocks:
offload_engine.load_to_slot_layer(current_slot, layer_id, cpu_block_table[next_block_idx]) next_cpu_block_id = cpu_block_table[next_block_idx]
offload_engine.load_to_slot_layer(current_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
# Merge with accumulated # Merge with accumulated
with torch.cuda.stream(compute_stream): with torch.cuda.stream(compute_stream):

View File

@@ -189,8 +189,8 @@ class XAttentionBSAPolicy(SparsePolicy):
reshaped_block_size = block_size // self.stride # e.g., 1024/8 = 128 reshaped_block_size = block_size // self.stride # e.g., 1024/8 = 128
for cpu_block_id in available_blocks: for cpu_block_id in available_blocks:
# Load K block from CPU to GPU # Load K block from CPU to GPU (cpu_block_id is chunk index)
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id) offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id, chunk_idx=cpu_block_id)
offload_engine.wait_slot_layer(slot) offload_engine.wait_slot_layer(slot)
# Get KV: [1, block_size, num_kv_heads, head_dim] # Get KV: [1, block_size, num_kv_heads, head_dim]
@@ -382,7 +382,7 @@ class XAttentionBSAPolicy(SparsePolicy):
slot = load_slots[0] slot = load_slots[0]
for block_idx in range(num_blocks): for block_idx in range(num_blocks):
cpu_block_id = cpu_block_table[block_idx] cpu_block_id = cpu_block_table[block_idx]
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id) offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id, chunk_idx=cpu_block_id)
offload_engine.wait_slot_layer(slot) offload_engine.wait_slot_layer(slot)
with torch.cuda.stream(compute_stream): with torch.cuda.stream(compute_stream):
@@ -402,7 +402,8 @@ class XAttentionBSAPolicy(SparsePolicy):
num_slots = len(load_slots) num_slots = len(load_slots)
num_preload = min(num_slots, num_blocks) num_preload = min(num_slots, num_blocks)
for i in range(num_preload): for i in range(num_preload):
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_table[i]) cpu_block_id = cpu_block_table[i]
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
for block_idx in range(num_blocks): for block_idx in range(num_blocks):
current_slot = load_slots[block_idx % num_slots] current_slot = load_slots[block_idx % num_slots]
@@ -428,7 +429,7 @@ class XAttentionBSAPolicy(SparsePolicy):
if next_block_idx < num_blocks: if next_block_idx < num_blocks:
next_slot = load_slots[next_block_idx % num_slots] next_slot = load_slots[next_block_idx % num_slots]
next_cpu_block_id = cpu_block_table[next_block_idx] next_cpu_block_id = cpu_block_table[next_block_idx]
offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id) offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
# Compute attention to current chunk (causal mask) # Compute attention to current chunk (causal mask)
with torch.cuda.stream(compute_stream): with torch.cuda.stream(compute_stream):

View File

@@ -104,27 +104,21 @@ class Attention(nn.Module):
# This enables fully async offloads since each layer has its own buffer. # This enables fully async offloads since each layer has its own buffer.
offload_engine = context.kvcache_manager.offload_engine offload_engine = context.kvcache_manager.offload_engine
compute_stream = offload_engine.compute_stream compute_stream = offload_engine.compute_stream
chunk_idx = context.current_chunk_idx if hasattr(context, 'current_chunk_idx') else -1
# Wait for default stream to ensure slot_mapping tensor transfer is complete # Wait for default stream to ensure slot_mapping tensor transfer is complete
compute_stream.wait_stream(torch.cuda.default_stream()) compute_stream.wait_stream(torch.cuda.default_stream())
with torch.cuda.stream(compute_stream): with torch.cuda.stream(compute_stream):
# Write KV to per-layer prefill buffer (contiguous write, no slot_mapping) # Write KV to per-layer prefill buffer via offload_engine
# k, v shape: [num_tokens, kv_heads, head_dim] # k, v shape: [num_tokens, kv_heads, head_dim]
num_tokens = k.shape[0] #! GPU 2 GPU
offload_engine.prefill_k_buffer[self.layer_id, :num_tokens].copy_(k) offload_engine.write_to_prefill_buffer(self.layer_id, k, v, chunk_idx=chunk_idx)
offload_engine.prefill_v_buffer[self.layer_id, :num_tokens].copy_(v)
elif is_chunked_offload: elif is_chunked_offload:
# Chunked decode mode: use compute_stream for store_kvcache # Chunked decode mode: write KV to per-layer decode buffer via offload_engine
# This ensures proper synchronization with per-layer offload # KV will be written to decode buffer in the decode branch below
compute_stream = context.kvcache_manager.offload_engine.compute_stream # No store_kvcache needed - all KV management goes through offload_engine
if k_cache.numel() and v_cache.numel(): pass
# CRITICAL: Wait for default stream to ensure slot_mapping tensor transfer is complete
# slot_mapping is created with non_blocking=True on default stream, but we use it
# on compute_stream. Without this sync, index_copy_ can get corrupted indices.
compute_stream.wait_stream(torch.cuda.default_stream())
with torch.cuda.stream(compute_stream):
store_kvcache(k, v, k_cache, v_cache, context.slot_mapping)
else: else:
# Normal mode: store on default stream # Normal mode: store on default stream
if k_cache.numel() and v_cache.numel(): if k_cache.numel() and v_cache.numel():
@@ -155,8 +149,7 @@ class Attention(nn.Module):
offload_engine = kvcache_manager.offload_engine offload_engine = kvcache_manager.offload_engine
pos_in_block = context.decode_pos_in_block pos_in_block = context.decode_pos_in_block
# k, v shape: [1, kv_heads, head_dim] # k, v shape: [1, kv_heads, head_dim]
offload_engine.decode_k_buffer[self.layer_id, pos_in_block].copy_(k.squeeze(0)) offload_engine.write_to_decode_buffer(self.layer_id, pos_in_block, k.squeeze(0), v.squeeze(0))
offload_engine.decode_v_buffer[self.layer_id, pos_in_block].copy_(v.squeeze(0))
o = self._chunked_decode_attention(q, k, v, context) o = self._chunked_decode_attention(q, k, v, context)
else: else:
o = flash_attn_with_kvcache(q.unsqueeze(1), k_cache, v_cache, o = flash_attn_with_kvcache(q.unsqueeze(1), k_cache, v_cache,

View File

@@ -9,6 +9,7 @@
# --dataset DATASET Task name (default: niah_single_1) # --dataset DATASET Task name (default: niah_single_1)
# --sample INDEX Sample index (default: 0) # --sample INDEX Sample index (default: 0)
# --gpu GPU_ID GPU to use (default: 0) # --gpu GPU_ID GPU to use (default: 0)
# --num-gpu-blocks N Number of GPU blocks/slots (default: 4)
# --no-offload Disable CPU offload # --no-offload Disable CPU offload
# #
# Output: # Output:
@@ -18,6 +19,7 @@
# bash scripts/profile_offload.sh # bash scripts/profile_offload.sh
# bash scripts/profile_offload.sh --dataset niah_single_1 --sample 5 # bash scripts/profile_offload.sh --dataset niah_single_1 --sample 5
# bash scripts/profile_offload.sh --gpu 1 --no-offload # bash scripts/profile_offload.sh --gpu 1 --no-offload
# bash scripts/profile_offload.sh --num-gpu-blocks 8
set -e set -e
@@ -25,6 +27,7 @@ set -e
DATASET="niah_single_1" DATASET="niah_single_1"
SAMPLE_INDEX="0" SAMPLE_INDEX="0"
GPU_ID="0" GPU_ID="0"
NUM_GPU_BLOCKS="4"
ENABLE_OFFLOAD="--enable-offload" ENABLE_OFFLOAD="--enable-offload"
# Parse arguments # Parse arguments
@@ -46,6 +49,10 @@ while [[ $# -gt 0 ]]; do
ENABLE_OFFLOAD="" ENABLE_OFFLOAD=""
shift shift
;; ;;
--num-gpu-blocks)
NUM_GPU_BLOCKS="$2"
shift 2
;;
-h|--help) -h|--help)
echo "Usage: $0 [options]" echo "Usage: $0 [options]"
echo "" echo ""
@@ -54,6 +61,7 @@ while [[ $# -gt 0 ]]; do
echo " --sample INDEX Sample index (default: 0)" echo " --sample INDEX Sample index (default: 0)"
echo " --gpu GPU_ID GPU to use (default: 0)" echo " --gpu GPU_ID GPU to use (default: 0)"
echo " --no-offload Disable CPU offload" echo " --no-offload Disable CPU offload"
echo " --num-gpu-blocks N Number of GPU blocks/slots (default: 4)"
exit 0 exit 0
;; ;;
*) *)
@@ -76,7 +84,7 @@ mkdir -p "$OUTPUT_DIR"
TIMESTAMP=$(date +%Y%m%d_%H%M%S) TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OFFLOAD_SUFFIX="" OFFLOAD_SUFFIX=""
if [ -n "$ENABLE_OFFLOAD" ]; then if [ -n "$ENABLE_OFFLOAD" ]; then
OFFLOAD_SUFFIX="_offload" OFFLOAD_SUFFIX="_offload_${NUM_GPU_BLOCKS}slots"
fi fi
OUTPUT_FILE="$OUTPUT_DIR/ruler_${DATASET}_sample${SAMPLE_INDEX}${OFFLOAD_SUFFIX}_${TIMESTAMP}" OUTPUT_FILE="$OUTPUT_DIR/ruler_${DATASET}_sample${SAMPLE_INDEX}${OFFLOAD_SUFFIX}_${TIMESTAMP}"
@@ -87,6 +95,7 @@ echo "Test script: $TEST_SCRIPT"
echo "Dataset: $DATASET" echo "Dataset: $DATASET"
echo "Sample: $SAMPLE_INDEX" echo "Sample: $SAMPLE_INDEX"
echo "GPU: $GPU_ID" echo "GPU: $GPU_ID"
echo "GPU Blocks: $NUM_GPU_BLOCKS"
echo "Offload: ${ENABLE_OFFLOAD:-disabled}" echo "Offload: ${ENABLE_OFFLOAD:-disabled}"
echo "Output file: $OUTPUT_FILE.nsys-rep" echo "Output file: $OUTPUT_FILE.nsys-rep"
echo "" echo ""
@@ -109,6 +118,7 @@ nsys profile \
python "$TEST_SCRIPT" \ python "$TEST_SCRIPT" \
--datasets "$DATASET" \ --datasets "$DATASET" \
--sample-indices "$SAMPLE_INDEX" \ --sample-indices "$SAMPLE_INDEX" \
--num-gpu-blocks "$NUM_GPU_BLOCKS" \
$ENABLE_OFFLOAD \ $ENABLE_OFFLOAD \
--quiet --quiet