Compare commits
6 Commits
3100724666
...
0d31b3f71f
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0d31b3f71f | ||
|
|
73c9dc46ff | ||
|
|
924a0d2bfa | ||
|
|
0619accd1c | ||
|
|
18bc433f09 | ||
|
|
aea3812230 |
89
.claude/rules/nsys-profiling.md
Normal file
89
.claude/rules/nsys-profiling.md
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
# Nsys Profiling Rule
|
||||||
|
|
||||||
|
## 强制规则
|
||||||
|
|
||||||
|
**所有 nsys profiling 任务必须使用 `scripts/profile_offload.sh` 脚本**,禁止直接运行 nsys 命令。
|
||||||
|
|
||||||
|
| 禁止 | 原因 |
|
||||||
|
|------|------|
|
||||||
|
| `nsys profile python tests/test_ruler.py ...` | 参数不一致,输出路径混乱 |
|
||||||
|
| 手动构造 nsys 命令 | 容易遗漏关键参数 |
|
||||||
|
|
||||||
|
## 使用方法
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 基本用法(默认 4 slots)
|
||||||
|
bash scripts/profile_offload.sh
|
||||||
|
|
||||||
|
# 指定 GPU slots 数量
|
||||||
|
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||||
|
|
||||||
|
# 指定 sample
|
||||||
|
bash scripts/profile_offload.sh --sample 5
|
||||||
|
|
||||||
|
# 指定 dataset
|
||||||
|
bash scripts/profile_offload.sh --dataset niah_single_1
|
||||||
|
|
||||||
|
# 禁用 offload(对比测试)
|
||||||
|
bash scripts/profile_offload.sh --no-offload
|
||||||
|
|
||||||
|
# 组合参数
|
||||||
|
bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0 --gpu 1
|
||||||
|
```
|
||||||
|
|
||||||
|
## 参数说明
|
||||||
|
|
||||||
|
| 参数 | 默认值 | 说明 |
|
||||||
|
|------|--------|------|
|
||||||
|
| `--dataset` | `niah_single_1` | RULER 任务名称 |
|
||||||
|
| `--sample` | `0` | 样本索引 |
|
||||||
|
| `--gpu` | `0` | 使用的 GPU |
|
||||||
|
| `--num-gpu-blocks` | `4` | GPU ring buffer slots 数量 |
|
||||||
|
| `--no-offload` | - | 禁用 CPU offload |
|
||||||
|
|
||||||
|
## 输出文件
|
||||||
|
|
||||||
|
输出文件自动生成到 `results/nsys/` 目录:
|
||||||
|
|
||||||
|
```
|
||||||
|
results/nsys/ruler_<dataset>_sample<index>_offload_<slots>slots_<timestamp>.nsys-rep
|
||||||
|
```
|
||||||
|
|
||||||
|
示例:`ruler_niah_single_1_sample0_offload_8slots_20260127_031500.nsys-rep`
|
||||||
|
|
||||||
|
## 查看结果
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# GUI 查看
|
||||||
|
nsight-sys results/nsys/<filename>.nsys-rep
|
||||||
|
|
||||||
|
# 命令行统计
|
||||||
|
nsys stats --report cuda_api_sum results/nsys/<filename>.nsys-rep
|
||||||
|
nsys stats --report cuda_gpu_kern_sum results/nsys/<filename>.nsys-rep
|
||||||
|
```
|
||||||
|
|
||||||
|
## 典型工作流
|
||||||
|
|
||||||
|
### 1. 对比不同 slots 数量
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 测试 4 slots(默认)
|
||||||
|
bash scripts/profile_offload.sh --num-gpu-blocks 4
|
||||||
|
|
||||||
|
# 测试 8 slots
|
||||||
|
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||||
|
|
||||||
|
# 对比结果
|
||||||
|
nsys stats --report cuda_gpu_kern_sum results/nsys/*4slots*.nsys-rep
|
||||||
|
nsys stats --report cuda_gpu_kern_sum results/nsys/*8slots*.nsys-rep
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 分析 pipeline overlap
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 生成 profile
|
||||||
|
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||||
|
|
||||||
|
# 用 nsight-sys GUI 查看 CUDA HW timeline
|
||||||
|
# 检查 H2D 和 flash_fwd_kernel 是否 overlap
|
||||||
|
```
|
||||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -239,3 +239,4 @@ task_plan_*.md
|
|||||||
findings_*.md
|
findings_*.md
|
||||||
progress_*.md
|
progress_*.md
|
||||||
notes.md
|
notes.md
|
||||||
|
Snipaste*
|
||||||
|
|||||||
@@ -26,6 +26,9 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
|||||||
| [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
|
| [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
|
||||||
| [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
|
| [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
|
||||||
| [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |
|
| [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |
|
||||||
|
| [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | ⚡ PERF: CPU 调度延迟分析,kernel 间隙来源,GPU 利用率优化方向 |
|
||||||
|
| [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | 📊 BENCH: CPU offload 性能测试结果,Full vs XAttention 对比 (32K/128K) |
|
||||||
|
| [`docs/cpu_offload_optimization_strategies.md`](docs/cpu_offload_optimization_strategies.md) | 🚀 OPT: CPU offload 优化策略:chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV) |
|
||||||
|
|
||||||
## Rules Index
|
## Rules Index
|
||||||
|
|
||||||
|
|||||||
@@ -46,24 +46,41 @@ def main():
|
|||||||
from nanovllm.config import SparsePolicyType
|
from nanovllm.config import SparsePolicyType
|
||||||
|
|
||||||
parser = argparse.ArgumentParser(description="Benchmark CPU offload performance")
|
parser = argparse.ArgumentParser(description="Benchmark CPU offload performance")
|
||||||
parser.add_argument("--enable-quest", action="store_true", help="Enable Quest sparse attention for decode")
|
parser.add_argument("--model", type=str, default="~/models/Llama-3.1-8B-Instruct",
|
||||||
|
help="Model path (default: ~/models/Llama-3.1-8B-Instruct)")
|
||||||
|
# Sparse policy selection (mutually exclusive)
|
||||||
|
sparse_group = parser.add_mutually_exclusive_group()
|
||||||
|
sparse_group.add_argument("--enable-quest", action="store_true",
|
||||||
|
help="Enable Quest sparse attention (decode only, prefill uses full)")
|
||||||
|
sparse_group.add_argument("--enable-xattn", action="store_true",
|
||||||
|
help="Enable XAttention BSA (prefill only, decode uses full)")
|
||||||
|
# Quest parameters
|
||||||
parser.add_argument("--topk", type=int, default=16, help="Top-K blocks for Quest (default: 16)")
|
parser.add_argument("--topk", type=int, default=16, help="Top-K blocks for Quest (default: 16)")
|
||||||
parser.add_argument("--threshold", type=int, default=4, help="Apply sparse only when blocks > threshold (default: 4)")
|
parser.add_argument("--threshold", type=int, default=4, help="Apply sparse only when blocks > threshold (default: 4)")
|
||||||
|
# XAttention parameters
|
||||||
|
parser.add_argument("--xattn-threshold", type=float, default=0.95,
|
||||||
|
help="XAttention cumulative attention threshold (default: 0.95)")
|
||||||
|
parser.add_argument("--xattn-stride", type=int, default=8,
|
||||||
|
help="XAttention Q/K downsampling stride (default: 8)")
|
||||||
|
# General parameters
|
||||||
parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens")
|
parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens")
|
||||||
parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)")
|
parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)")
|
||||||
parser.add_argument("--num-gpu-blocks", type=int, default=6, help="Number of GPU blocks (default: 6)")
|
parser.add_argument("--num-gpu-blocks", type=int, default=4, help="Number of GPU blocks (default: 4)")
|
||||||
parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
|
parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
|
||||||
parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
|
parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
|
||||||
parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
|
parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
path = os.path.expanduser("~/models/Qwen3-4B-Instruct-2507/")
|
path = os.path.expanduser(args.model)
|
||||||
max_len = args.max_len
|
max_len = args.max_len
|
||||||
|
|
||||||
# Setup policy configuration
|
# Setup policy configuration
|
||||||
if args.enable_quest:
|
if args.enable_quest:
|
||||||
sparse_policy = SparsePolicyType.QUEST
|
sparse_policy = SparsePolicyType.QUEST
|
||||||
print(f"\n[Quest Sparse Attention] topk={args.topk}, threshold={args.threshold}")
|
print(f"\n[Quest Sparse Attention] decode: Quest (topk={args.topk}, threshold={args.threshold}), prefill: Full")
|
||||||
|
elif args.enable_xattn:
|
||||||
|
sparse_policy = SparsePolicyType.XATTN_BSA
|
||||||
|
print(f"\n[XAttention BSA] prefill: XAttn (tau={args.xattn_threshold}, stride={args.xattn_stride}), decode: Full")
|
||||||
else:
|
else:
|
||||||
sparse_policy = SparsePolicyType.FULL
|
sparse_policy = SparsePolicyType.FULL
|
||||||
print("\n[Full Attention] baseline (no sparse)")
|
print("\n[Full Attention] baseline (no sparse)")
|
||||||
@@ -78,8 +95,12 @@ def main():
|
|||||||
enable_cpu_offload=True,
|
enable_cpu_offload=True,
|
||||||
num_gpu_blocks=args.num_gpu_blocks,
|
num_gpu_blocks=args.num_gpu_blocks,
|
||||||
sparse_policy=sparse_policy,
|
sparse_policy=sparse_policy,
|
||||||
|
# Quest parameters
|
||||||
sparse_topk_blocks=args.topk,
|
sparse_topk_blocks=args.topk,
|
||||||
sparse_threshold_blocks=args.threshold,
|
sparse_threshold_blocks=args.threshold,
|
||||||
|
# XAttention parameters
|
||||||
|
sparse_threshold=args.xattn_threshold,
|
||||||
|
sparse_stride=args.xattn_stride,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Warmup
|
# Warmup
|
||||||
|
|||||||
89
docs/bench_offload_results.md
Normal file
89
docs/bench_offload_results.md
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
# CPU Offload Benchmark Results
|
||||||
|
|
||||||
|
本文档记录 `bench_offload.py` 在不同配置下的性能测试结果。
|
||||||
|
|
||||||
|
## 测试环境
|
||||||
|
|
||||||
|
| 参数 | 值 |
|
||||||
|
|------|-----|
|
||||||
|
| GPU | NVIDIA A100-SXM4-80GB |
|
||||||
|
| 模型 | Llama-3.1-8B-Instruct |
|
||||||
|
| GPU slots | 4 |
|
||||||
|
| Block size | 1024 tokens |
|
||||||
|
| Chunk size | 2048 tokens |
|
||||||
|
|
||||||
|
## Sparse Policy 配置
|
||||||
|
|
||||||
|
| 策略 | Prefill | Decode | 说明 |
|
||||||
|
|------|---------|--------|------|
|
||||||
|
| FULL | Full Attention | Full Attention | 基线,加载所有 blocks |
|
||||||
|
| XATTN_BSA | XAttention (tau=0.95, stride=8) | Full Attention (fallback) | 稀疏 prefill |
|
||||||
|
|
||||||
|
## 测试结果
|
||||||
|
|
||||||
|
### 32K 上下文
|
||||||
|
|
||||||
|
| 策略 | 输入长度 | 耗时 | 吞吐量 | 相对性能 |
|
||||||
|
|------|----------|------|--------|----------|
|
||||||
|
| Full Attention | 32767 tok | 20.64s | **1587.74 tok/s** | baseline |
|
||||||
|
| XAttention BSA | 32767 tok | 27.95s | **1172.33 tok/s** | 0.74x |
|
||||||
|
|
||||||
|
### 128K 上下文
|
||||||
|
|
||||||
|
| 策略 | 输入长度 | 耗时 | 吞吐量 | 相对性能 |
|
||||||
|
|------|----------|------|--------|----------|
|
||||||
|
| Full Attention | 131071 tok | 237.18s | **552.63 tok/s** | baseline |
|
||||||
|
| XAttention BSA | 131071 tok | 281.17s | **466.17 tok/s** | 0.84x |
|
||||||
|
|
||||||
|
### KV Cache 配置
|
||||||
|
|
||||||
|
| 上下文 | GPU Memory | CPU Memory | Total |
|
||||||
|
|--------|------------|------------|-------|
|
||||||
|
| 32K | 512 MB (4 blocks) | 4096 MB (32 blocks) | 4608 MB |
|
||||||
|
| 128K | 512 MB (4 blocks) | 16384 MB (128 blocks) | 16896 MB |
|
||||||
|
|
||||||
|
## 分析
|
||||||
|
|
||||||
|
### XAttention 性能特点
|
||||||
|
|
||||||
|
1. **32K 上下文**: XAttention 比 Full 慢 26%
|
||||||
|
2. **128K 上下文**: XAttention 比 Full 慢 16%
|
||||||
|
|
||||||
|
随着上下文增长,XAttention 的相对性能有所提升(74% → 84%),但仍未超过 Full Attention。
|
||||||
|
|
||||||
|
### 原因分析
|
||||||
|
|
||||||
|
1. **tau=0.95 阈值较高**: 需要覆盖 95% 累积注意力,实际跳过的 block 较少
|
||||||
|
2. **估计开销**: `xattn_estimate_chunked` 需要对每个 chunk 计算稀疏 mask
|
||||||
|
3. **BSA kernel overhead**: Block sparse kernel 有额外的 mask 处理和索引开销
|
||||||
|
4. **Offload 瓶颈**: CPU→GPU 传输是主要瓶颈,稀疏注意力节省的是计算而非传输
|
||||||
|
|
||||||
|
### 适用场景
|
||||||
|
|
||||||
|
XAttention BSA 更适合以下场景:
|
||||||
|
- 更长的上下文(256K+),稀疏收益更明显
|
||||||
|
- 计算密集型任务(非 offload 模式),传输不是瓶颈
|
||||||
|
- 较低的 tau 阈值(如 0.8),增加稀疏性
|
||||||
|
|
||||||
|
## 运行命令
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Full Attention (32K)
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 32768
|
||||||
|
|
||||||
|
# XAttention BSA (32K)
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 32768 --enable-xattn
|
||||||
|
|
||||||
|
# Full Attention (128K)
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 131072
|
||||||
|
|
||||||
|
# XAttention BSA (128K)
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 131072 --enable-xattn
|
||||||
|
|
||||||
|
# 调整 XAttention 参数
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --enable-xattn --xattn-threshold 0.8 --xattn-stride 16
|
||||||
|
```
|
||||||
|
|
||||||
|
## 更新记录
|
||||||
|
|
||||||
|
- 2026-01-27: 初始测试,Llama-3.1-8B-Instruct, A100 80GB
|
||||||
300
docs/cpu_offload_optimization_strategies.md
Normal file
300
docs/cpu_offload_optimization_strategies.md
Normal file
@@ -0,0 +1,300 @@
|
|||||||
|
# CPU Offload 优化策略
|
||||||
|
|
||||||
|
本文档记录 CPU Offload 场景下的性能优化策略分析,包括实际可行的方案和前沿研究方向。
|
||||||
|
|
||||||
|
## 问题回顾
|
||||||
|
|
||||||
|
根据 [CPU 调度延迟分析](cpu_scheduling_latency_analysis.md),当前 chunked attention pipeline 的主要问题:
|
||||||
|
|
||||||
|
| 指标 | 当前值 | 理论值 |
|
||||||
|
|------|--------|--------|
|
||||||
|
| Flash kernel 执行时间 | ~138 μs | - |
|
||||||
|
| Flash kernel 间隔 | ~942 μs | ~211 μs (仅 H2D + merge) |
|
||||||
|
| GPU 利用率 | **12.8%** | **39.5%** (理论上限) |
|
||||||
|
| CPU 调度空闲占比 | **77-81%** | 0% |
|
||||||
|
|
||||||
|
**瓶颈根源**:每个 block 都经过完整的 Python 循环,导致大量 CPU 调度延迟。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 优化方案一:调大 Chunk Size(推荐)
|
||||||
|
|
||||||
|
### 核心洞察
|
||||||
|
|
||||||
|
**Merge 多个小 chunk 和直接使用大 chunk 是等效的**:
|
||||||
|
|
||||||
|
```
|
||||||
|
方案 A: Merge 4 个小 chunks
|
||||||
|
[H2D 2K][H2D 2K][H2D 2K][H2D 2K] → concat → [Flash 8K] → merge
|
||||||
|
|
||||||
|
方案 B: 直接用大 chunk
|
||||||
|
[H2D 8K] → [Flash 8K] → merge
|
||||||
|
|
||||||
|
计算结果完全等效!
|
||||||
|
```
|
||||||
|
|
||||||
|
### 收益分析
|
||||||
|
|
||||||
|
| 指标 | 小 chunk (2K) × 4 | 大 chunk (8K) × 1 |
|
||||||
|
|------|-------------------|-------------------|
|
||||||
|
| H2D 次数 | 4 | 1 |
|
||||||
|
| Flash kernel 调用 | 4 | 1 |
|
||||||
|
| Merge 调用 | 4 | 1 |
|
||||||
|
| Python 循环次数 | 4 | 1 |
|
||||||
|
| CPU 调度开销 | 4 × ~300μs = 1200μs | 1 × ~300μs = 300μs |
|
||||||
|
|
||||||
|
**本质**:CPU 调度延迟问题的根源是循环次数太多,调大 chunk size 直接减少循环次数。
|
||||||
|
|
||||||
|
### Trade-off
|
||||||
|
|
||||||
|
1. **GPU 内存增加**
|
||||||
|
- 2K chunk: 每 slot ~4MB (K+V)
|
||||||
|
- 8K chunk: 每 slot ~16MB (K+V)
|
||||||
|
- 4 slots = 64MB,对 80GB A100 影响很小
|
||||||
|
|
||||||
|
2. **单次 H2D 时间变长**
|
||||||
|
- H2D 8K ≈ 350μs
|
||||||
|
- Flash 8K ≈ 550μs
|
||||||
|
- 因为 Flash > H2D,pipeline 仍然有效
|
||||||
|
|
||||||
|
### 配置方法
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 测试不同 block size
|
||||||
|
python bench_offload.py --kvcache-block-size 2048 # 基准
|
||||||
|
python bench_offload.py --kvcache-block-size 4096 # 2x
|
||||||
|
python bench_offload.py --kvcache-block-size 8192 # 4x
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 优化方案二:CUDA Graph(适用于非 Attention 部分)
|
||||||
|
|
||||||
|
### CUDA Graph 在 Offload 场景的局限性
|
||||||
|
|
||||||
|
CUDA Graph 的前提:所有操作在 capture 时确定,数据地址固定。
|
||||||
|
|
||||||
|
**Offload 场景的现实**:
|
||||||
|
1. **H2D 源地址动态** - 每次从不同的 CPU block 加载
|
||||||
|
2. **加载决策在运行时** - 哪些 block 需要加载是动态的
|
||||||
|
3. **CPU 必须协调** - H2D 和 Compute 的同步需要 CPU 参与
|
||||||
|
|
||||||
|
```
|
||||||
|
Offload 场景:
|
||||||
|
┌─────────────────────────────────────────┐
|
||||||
|
│ 数据在 CPU,需要动态加载 │
|
||||||
|
│ [H2D_i] → [Compute] → [H2D_{i+n}] → ...│
|
||||||
|
│ ↑ 动态、CPU 必须参与调度 │
|
||||||
|
└─────────────────────────────────────────┘
|
||||||
|
|
||||||
|
即使用 Graph:
|
||||||
|
Python: [wait_h2d] [replay] [launch_h2d] [wait_h2d] [replay] ...
|
||||||
|
↑ CPU 参与 ↑ CPU 参与 ↑ CPU 参与
|
||||||
|
|
||||||
|
CPU 调度开销仍然存在,Graph 只优化了中间的 compute 部分。
|
||||||
|
```
|
||||||
|
|
||||||
|
**结论**:CUDA Graph 不是 Offload 场景的银弹。
|
||||||
|
|
||||||
|
### 适用场景:MLP 和 Projection 层
|
||||||
|
|
||||||
|
LLM 每层的计算流程:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ [LayerNorm] → [QKV Proj] → [Attention] → [O Proj] → [Add] │
|
||||||
|
│ ↑ │
|
||||||
|
│ KV Offload │
|
||||||
|
│ [LayerNorm] → [MLP: gate + up + down] → [Add] │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
| 组件 | 涉及 Offload | 能用 CUDA Graph |
|
||||||
|
|------|-------------|-----------------|
|
||||||
|
| LayerNorm | ❌ | ✅ |
|
||||||
|
| QKV Projection | ❌ | ✅ |
|
||||||
|
| **Attention** | ✅ | ❌ |
|
||||||
|
| Output Projection | ❌ | ✅ |
|
||||||
|
| MLP (FFN) | ❌ | ✅ |
|
||||||
|
|
||||||
|
**只有 Attention 涉及动态 KV Cache 加载,其余都是"纯计算",可以用 CUDA Graph。**
|
||||||
|
|
||||||
|
### 实现方案
|
||||||
|
|
||||||
|
```python
|
||||||
|
class OptimizedLayer:
|
||||||
|
def __init__(self, layer):
|
||||||
|
# Graph 1: Attention 之前
|
||||||
|
self.graph_pre_attn = capture([
|
||||||
|
layer.input_layernorm,
|
||||||
|
layer.self_attn.q_proj,
|
||||||
|
layer.self_attn.k_proj,
|
||||||
|
layer.self_attn.v_proj,
|
||||||
|
])
|
||||||
|
|
||||||
|
# Graph 2: Attention 之后 + MLP
|
||||||
|
self.graph_post_attn = capture([
|
||||||
|
layer.self_attn.o_proj,
|
||||||
|
# residual add
|
||||||
|
layer.post_attention_layernorm,
|
||||||
|
layer.mlp.gate_proj,
|
||||||
|
layer.mlp.up_proj,
|
||||||
|
layer.mlp.down_proj,
|
||||||
|
# residual add
|
||||||
|
])
|
||||||
|
|
||||||
|
def forward(self, hidden_states, kv_cache):
|
||||||
|
# Pre-attention (CUDA Graph)
|
||||||
|
self.graph_pre_attn.replay()
|
||||||
|
|
||||||
|
# Attention with offload (动态,不能用 graph)
|
||||||
|
attn_output = chunked_attention_with_offload(q, kv_cache)
|
||||||
|
|
||||||
|
# Post-attention + MLP (CUDA Graph)
|
||||||
|
self.graph_post_attn.replay()
|
||||||
|
```
|
||||||
|
|
||||||
|
### 收益估算
|
||||||
|
|
||||||
|
MLP 每层典型操作 launch 开销:
|
||||||
|
- `gate_proj`, `up_proj`, `act_fn`, `gate * up`, `down_proj`, `residual add`
|
||||||
|
- 每个操作 ~30-50μs launch 开销,总计 ~200μs/层
|
||||||
|
- 用 CUDA Graph:~30μs/层
|
||||||
|
|
||||||
|
**32 层 × 170μs 节省 ≈ 5.4ms**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 优化方案三:前沿研究方向
|
||||||
|
|
||||||
|
### 1. InfiniGen - 投机预取 (OSDI'24)
|
||||||
|
|
||||||
|
**核心思想**:不需要加载所有 KV,只预取"重要"的 token。
|
||||||
|
|
||||||
|
```
|
||||||
|
关键洞察:相邻层的 attention pattern 高度相似
|
||||||
|
↓
|
||||||
|
用第 L 层的 attention score 预测第 L+1 层需要哪些 token
|
||||||
|
↓
|
||||||
|
只预取 top-k 重要的 KV entries(而不是全部)
|
||||||
|
```
|
||||||
|
|
||||||
|
**技术实现**:
|
||||||
|
- 用当前层的 Q 和下一层的部分 K 做"预演"
|
||||||
|
- 预测下一层的 attention 分布
|
||||||
|
- 异步预取预测的重要 token
|
||||||
|
- **减少 PCIe 带宽浪费,而不是加速传输**
|
||||||
|
|
||||||
|
**效果**:最高 **3x 加速**
|
||||||
|
|
||||||
|
**参考**:[InfiniGen (OSDI'24)](https://www.usenix.org/conference/osdi24/presentation/lee)
|
||||||
|
|
||||||
|
### 2. ShadowKV - 低秩压缩 + Sparse Offload (ICML'25 Spotlight)
|
||||||
|
|
||||||
|
**核心思想**:Key 压缩存 GPU,Value offload 到 CPU,只加载 1.56% 的 KV。
|
||||||
|
|
||||||
|
```
|
||||||
|
Pre-filling:
|
||||||
|
┌─────────────────────────────────────────────────┐
|
||||||
|
│ Key Cache → SVD 低秩压缩 → 保留在 GPU │
|
||||||
|
│ Value Cache → Offload 到 CPU │
|
||||||
|
│ 计算每个 chunk 的 landmark (均值) │
|
||||||
|
│ 识别 outlier tokens → 保留在 GPU │
|
||||||
|
└─────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
Decoding:
|
||||||
|
┌─────────────────────────────────────────────────┐
|
||||||
|
│ 用 landmarks 快速估计 attention score │
|
||||||
|
│ 只加载 top-k 重要的 Value (1.56% sparse) │
|
||||||
|
│ 结合 GPU 上的 outliers 计算最终结果 │
|
||||||
|
└─────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**效果**:6x 更大 batch size,**3.04x 吞吐提升**
|
||||||
|
|
||||||
|
**参考**:[ShadowKV (ByteDance)](https://github.com/ByteDance-Seed/ShadowKV)
|
||||||
|
|
||||||
|
### 3. L2 Cache 异步预取 (2025)
|
||||||
|
|
||||||
|
**核心思想**:利用 GPU L2 Cache 做预取,在计算时预取下一批 KV。
|
||||||
|
|
||||||
|
```
|
||||||
|
传统:
|
||||||
|
Compute: [Flash_i] [Flash_{i+1}]
|
||||||
|
H2D: [H2D_{i+1}]
|
||||||
|
↑ 等待
|
||||||
|
|
||||||
|
L2 Prefetch:
|
||||||
|
Compute: [Flash_i + Prefetch_{i+1} to L2] [Flash_{i+1} L2 hit]
|
||||||
|
↑ 计算时利用空闲 memory bandwidth 预取
|
||||||
|
```
|
||||||
|
|
||||||
|
**技术**:
|
||||||
|
- 在 Flash Attention kernel 内部发起预取指令
|
||||||
|
- 利用计算时的空闲 memory bandwidth
|
||||||
|
- 下一次访问直接 L2 hit
|
||||||
|
|
||||||
|
**效果**:**2.15x attention kernel 效率**,1.97x 端到端吞吐
|
||||||
|
|
||||||
|
**参考**:[Asynchronous KV Cache Prefetching (2025)](https://arxiv.org/abs/2504.06319)
|
||||||
|
|
||||||
|
### 4. KVPR - I/O-Aware 调度 (ACL'25)
|
||||||
|
|
||||||
|
**核心思想**:计算最优的 recompute vs offload 比例。
|
||||||
|
|
||||||
|
```
|
||||||
|
权衡:
|
||||||
|
- Recompute: 重新计算 KV(用 GPU 算力换内存)
|
||||||
|
- Offload: 从 CPU 加载(用 PCIe 带宽换算力)
|
||||||
|
|
||||||
|
KVPR: 根据当前负载动态决定最优比例
|
||||||
|
+ 预取技术重叠数据传输和计算
|
||||||
|
```
|
||||||
|
|
||||||
|
**参考**:[KVPR (ACL'25)](https://aclanthology.org/2025.findings-acl.997.pdf)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 优化策略总结
|
||||||
|
|
||||||
|
### 推荐优先级
|
||||||
|
|
||||||
|
| 优先级 | 方案 | 核心优化 | 实现复杂度 | 预期收益 |
|
||||||
|
|--------|------|---------|-----------|---------|
|
||||||
|
| **P0** | 调大 chunk size | 减少循环次数 | 极低(改配置) | 2-4x |
|
||||||
|
| **P1** | MLP CUDA Graph | 减少 launch 开销 | 中 | ~5ms/request |
|
||||||
|
| **P2** | InfiniGen 式预取 | 只加载重要 token | 中高 | 2-3x |
|
||||||
|
| **P3** | ShadowKV 式压缩 | Key 压缩 + Sparse | 高 | 3x |
|
||||||
|
| **P3** | C++ Extension | 消除 Python 开销 | 高 | 2-3x |
|
||||||
|
|
||||||
|
### 策略分离原则
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ Attention + Offload 部分: │
|
||||||
|
│ - 瓶颈:H2D 传输 + CPU 调度 │
|
||||||
|
│ - 优化:调大 chunk size / 投机预取 / Sparse │
|
||||||
|
│ │
|
||||||
|
│ MLP + Proj + Norm 部分: │
|
||||||
|
│ - 瓶颈:Kernel launch 开销 │
|
||||||
|
│ - 优化:CUDA Graph │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
两部分优化完全正交,可以组合使用。
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 相关文件
|
||||||
|
|
||||||
|
- `nanovllm/kvcache/sparse/full_policy.py`: Chunked attention pipeline
|
||||||
|
- `nanovllm/kvcache/offload_engine.py`: H2D/D2H 传输管理
|
||||||
|
- `docs/cpu_scheduling_latency_analysis.md`: 问题分析
|
||||||
|
|
||||||
|
## 参考文献
|
||||||
|
|
||||||
|
1. [InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management](https://www.usenix.org/conference/osdi24/presentation/lee) - OSDI'24
|
||||||
|
2. [ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference](https://github.com/ByteDance-Seed/ShadowKV) - ICML'25 Spotlight
|
||||||
|
3. [Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching](https://arxiv.org/abs/2504.06319) - 2025
|
||||||
|
4. [KVPR: Efficient LLM Inference with I/O-Aware KV Cache](https://aclanthology.org/2025.findings-acl.997.pdf) - ACL'25
|
||||||
|
5. [LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference](https://lmcache.ai/tech_report.pdf) - 2025
|
||||||
177
docs/cpu_scheduling_latency_analysis.md
Normal file
177
docs/cpu_scheduling_latency_analysis.md
Normal file
@@ -0,0 +1,177 @@
|
|||||||
|
# CPU 调度延迟分析
|
||||||
|
|
||||||
|
## 问题概述
|
||||||
|
|
||||||
|
在分析 nsys profile 时发现,chunked attention pipeline 中存在大量的 **CPU 调度延迟**,导致 GPU 利用率显著下降。
|
||||||
|
|
||||||
|
## 观察数据
|
||||||
|
|
||||||
|
### 测试环境
|
||||||
|
- GPU: NVIDIA A100-SXM4-80GB
|
||||||
|
- 模型: Llama-3.1-8B-Instruct
|
||||||
|
- 测试: RULER niah_single_1, 64K context
|
||||||
|
- Profile 文件: `ruler_8slots_test.nsys-rep`
|
||||||
|
- 时间段: 92.982s - 93.038s
|
||||||
|
|
||||||
|
### Kernel 执行时间
|
||||||
|
|
||||||
|
| Kernel | 典型执行时间 |
|
||||||
|
|--------|-------------|
|
||||||
|
| flash_fwd_kernel | ~138 μs |
|
||||||
|
| H2D memcpy (2MB) | ~87 μs |
|
||||||
|
| merge_lse_kernel | ~3.5 μs |
|
||||||
|
| merge_output_kernel | ~34 μs |
|
||||||
|
|
||||||
|
### 操作间隙分析
|
||||||
|
|
||||||
|
从 cuda_gpu_trace 观察到的间隙:
|
||||||
|
|
||||||
|
```
|
||||||
|
Start (ms) Dur (μs) Gap (μs) Type
|
||||||
|
------------------------------------------------------------
|
||||||
|
92984.680 138.3 378.3 flash_fwd_kernel ← GAP!
|
||||||
|
92985.051 86.8 232.9 H2D memcpy ← GAP!
|
||||||
|
92985.141 86.8 2.8 H2D memcpy
|
||||||
|
92985.587 135.9 360.0 flash_fwd_kernel ← GAP!
|
||||||
|
92986.026 3.4 302.4 merge_lse ← GAP!
|
||||||
|
92986.164 33.5 135.0 merge_output ← GAP!
|
||||||
|
92986.371 86.9 173.4 H2D memcpy ← GAP!
|
||||||
|
92986.461 86.8 2.7 H2D memcpy
|
||||||
|
92986.816 137.9 268.2 flash_fwd_kernel ← GAP!
|
||||||
|
```
|
||||||
|
|
||||||
|
### Flash Kernel 间隙分解
|
||||||
|
|
||||||
|
| 间隙 | 总时间 | 有效工作时间 | 空闲时间 |
|
||||||
|
|------|--------|-------------|---------|
|
||||||
|
| Flash 1 → Flash 2 | 769 μs | ~174 μs (2x H2D) | ~595 μs (77%) |
|
||||||
|
| Flash 2 → Flash 3 | 1092 μs | ~211 μs (merge + H2D) | ~881 μs (81%) |
|
||||||
|
| Flash 3 → Flash 4 | 965 μs | ~211 μs (merge + H2D) | ~754 μs (78%) |
|
||||||
|
|
||||||
|
**关键发现**: 每个 flash kernel 之间约 **77-81% 的时间是 CPU 调度空闲**。
|
||||||
|
|
||||||
|
## 间隙来源分析
|
||||||
|
|
||||||
|
### 1. CPU 调度延迟类型
|
||||||
|
|
||||||
|
| 转换 | 典型延迟 | 原因 |
|
||||||
|
|------|---------|------|
|
||||||
|
| Kernel 结束 → 下一个 Kernel 开始 | 100-400 μs | CPU 准备参数、调用 CUDA driver |
|
||||||
|
| Flash 结束 → H2D 开始 | ~233 μs | Python 代码执行 + CUDA launch |
|
||||||
|
| H2D 结束 → Flash 开始 | ~360 μs | 同步等待 + kernel launch |
|
||||||
|
| Flash 结束 → merge 开始 | ~302 μs | Python 代码执行 |
|
||||||
|
|
||||||
|
### 2. 延迟产生的代码位置
|
||||||
|
|
||||||
|
```python
|
||||||
|
# full_policy.py: compute_chunked_prefill
|
||||||
|
|
||||||
|
for block_idx in range(num_blocks):
|
||||||
|
# 1. 等待 H2D 完成 (同步点)
|
||||||
|
offload_engine.wait_slot_layer(current_slot) # ← 可能引入延迟
|
||||||
|
|
||||||
|
# 2. 获取 KV 数据
|
||||||
|
k_block, v_block = offload_engine.get_kv_for_slot(current_slot)
|
||||||
|
|
||||||
|
# 3. 调用 flash attention (kernel launch)
|
||||||
|
block_out, block_lse = flash_attn_with_kvcache(...) # ← CPU 调度延迟
|
||||||
|
|
||||||
|
# 4. merge 操作
|
||||||
|
merge_output(...) # ← CPU 调度延迟
|
||||||
|
merge_lse(...) # ← CPU 调度延迟
|
||||||
|
|
||||||
|
# 5. 发起下一个 H2D (异步)
|
||||||
|
offload_engine.load_to_slot_layer(next_slot, ...) # ← CPU 调度延迟
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 为什么 H2D 之间间隙小
|
||||||
|
|
||||||
|
注意到连续的 H2D memcpy 之间间隙只有 ~2.7 μs,这是因为:
|
||||||
|
- 它们在同一个 stream 上连续发起
|
||||||
|
- CUDA driver 可以批量处理
|
||||||
|
- 没有 Python 代码介入
|
||||||
|
|
||||||
|
## GPU 利用率计算
|
||||||
|
|
||||||
|
基于观察数据:
|
||||||
|
|
||||||
|
| 指标 | 值 |
|
||||||
|
|------|-----|
|
||||||
|
| Flash kernel 平均执行时间 | 138 μs |
|
||||||
|
| Flash kernel 平均间隔 | 942 μs |
|
||||||
|
| Flash kernel GPU 利用率 | 138 / (138 + 942) = **12.8%** |
|
||||||
|
|
||||||
|
如果消除 CPU 调度延迟(仅保留必要的 H2D + merge):
|
||||||
|
|
||||||
|
| 指标 | 值 |
|
||||||
|
|------|-----|
|
||||||
|
| 必要间隔 (2x H2D + merge) | ~211 μs |
|
||||||
|
| 理论 GPU 利用率 | 138 / (138 + 211) = **39.5%** |
|
||||||
|
|
||||||
|
**潜在提升**: 3x GPU 利用率
|
||||||
|
|
||||||
|
## 优化方向
|
||||||
|
|
||||||
|
### 1. CUDA Graph
|
||||||
|
将整个 block 处理流程编译为 CUDA Graph,消除重复的 kernel launch 开销。
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 伪代码
|
||||||
|
graph = torch.cuda.CUDAGraph()
|
||||||
|
with torch.cuda.graph(graph):
|
||||||
|
# 预录制 flash + merge 操作
|
||||||
|
block_out, block_lse = flash_attn_with_kvcache(...)
|
||||||
|
merge_output(...)
|
||||||
|
merge_lse(...)
|
||||||
|
|
||||||
|
# 运行时只需 replay
|
||||||
|
for block_idx in range(num_blocks):
|
||||||
|
graph.replay() # 单次 launch,无 Python 介入
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 自定义 Triton Kernel
|
||||||
|
将 flash + merge 融合为单个 kernel,减少 kernel launch 次数。
|
||||||
|
|
||||||
|
### 3. C++ Extension
|
||||||
|
将 Python 循环移到 C++ 层,减少 Python 解释器开销。
|
||||||
|
|
||||||
|
### 4. 流水线重叠优化
|
||||||
|
确保 H2D 传输与前一个 block 的计算完全重叠:
|
||||||
|
|
||||||
|
```
|
||||||
|
Block 0: [H2D slot0] [Flash slot0] [merge]
|
||||||
|
Block 1: [H2D slot1] [Flash slot1] [merge]
|
||||||
|
Block 2: [H2D slot2] [Flash slot2] [merge]
|
||||||
|
```
|
||||||
|
|
||||||
|
## 验证方法
|
||||||
|
|
||||||
|
### 1. 使用 nsys 分析间隙
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 生成 profile
|
||||||
|
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||||
|
|
||||||
|
# 查看 kernel trace
|
||||||
|
nsys stats --report cuda_gpu_trace --format csv <file>.nsys-rep | \
|
||||||
|
awk -F',' 'NR>1 && $1 >= START && $1 <= END'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 计算间隙
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 从 trace 数据计算
|
||||||
|
prev_end = start + duration
|
||||||
|
gap = next_start - prev_end
|
||||||
|
```
|
||||||
|
|
||||||
|
## 相关文件
|
||||||
|
|
||||||
|
- `nanovllm/kvcache/sparse/full_policy.py`: Pipeline 实现
|
||||||
|
- `nanovllm/kvcache/offload_engine.py`: H2D/D2H 传输
|
||||||
|
- `scripts/profile_offload.sh`: Profiling 脚本
|
||||||
|
|
||||||
|
## 参考
|
||||||
|
|
||||||
|
- [CUDA Graph 文档](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs)
|
||||||
|
- [nsys 用户指南](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)
|
||||||
@@ -9,6 +9,7 @@ Key design principles for CUDA Graph compatibility:
|
|||||||
|
|
||||||
import torch
|
import torch
|
||||||
import torch.cuda.nvtx
|
import torch.cuda.nvtx
|
||||||
|
import nvtx
|
||||||
from torch import Tensor
|
from torch import Tensor
|
||||||
from typing import Dict, List, Tuple, Optional
|
from typing import Dict, List, Tuple, Optional
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
@@ -374,7 +375,9 @@ class OffloadEngine:
|
|||||||
"""
|
"""
|
||||||
self.ring_slot_compute_done[slot_idx].record()
|
self.ring_slot_compute_done[slot_idx].record()
|
||||||
|
|
||||||
def load_to_slot_layer(self, slot_idx: int, layer_id: int, cpu_block_id: int) -> None:
|
def load_to_slot_layer(
|
||||||
|
self, slot_idx: int, layer_id: int, cpu_block_id: int, chunk_idx: int = -1
|
||||||
|
) -> None:
|
||||||
"""
|
"""
|
||||||
Async load a single CPU block to a ring buffer slot for one layer.
|
Async load a single CPU block to a ring buffer slot for one layer.
|
||||||
|
|
||||||
@@ -389,13 +392,20 @@ class OffloadEngine:
|
|||||||
slot_idx: Target GPU slot index
|
slot_idx: Target GPU slot index
|
||||||
layer_id: Layer index to load (for CPU cache indexing)
|
layer_id: Layer index to load (for CPU cache indexing)
|
||||||
cpu_block_id: Source CPU block ID
|
cpu_block_id: Source CPU block ID
|
||||||
|
chunk_idx: Optional chunk index for NVTX labeling (-1 means not specified)
|
||||||
"""
|
"""
|
||||||
logger.debug(f"Ring load: layer={layer_id}, CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]")
|
logger.debug(f"Ring load: layer={layer_id}, CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]")
|
||||||
|
|
||||||
# Use per-slot stream for parallel transfers across different slots
|
# Use per-slot stream for parallel transfers across different slots
|
||||||
stream = self.slot_transfer_streams[slot_idx]
|
stream = self.slot_transfer_streams[slot_idx]
|
||||||
|
|
||||||
torch.cuda.nvtx.range_push(f"H2D: L{layer_id} CPU[{cpu_block_id}]->Slot[{slot_idx}]")
|
# Build NVTX label with optional chunk info
|
||||||
|
if chunk_idx >= 0:
|
||||||
|
nvtx_label = f"H2D: L{layer_id} Chunk{chunk_idx} CPU[{cpu_block_id}]->Slot[{slot_idx}]"
|
||||||
|
else:
|
||||||
|
nvtx_label = f"H2D: L{layer_id} CPU[{cpu_block_id}]->Slot[{slot_idx}]"
|
||||||
|
|
||||||
|
nvtx.push_range(message=nvtx_label, color="blue")
|
||||||
with torch.cuda.stream(stream):
|
with torch.cuda.stream(stream):
|
||||||
# Wait for previous compute on this slot to complete before overwriting
|
# Wait for previous compute on this slot to complete before overwriting
|
||||||
# This prevents data race: transfer must not start until attention finishes reading
|
# This prevents data race: transfer must not start until attention finishes reading
|
||||||
@@ -413,7 +423,7 @@ class OffloadEngine:
|
|||||||
self.v_cache_cpu[layer_id, cpu_block_id], non_blocking=True
|
self.v_cache_cpu[layer_id, cpu_block_id], non_blocking=True
|
||||||
)
|
)
|
||||||
self.ring_slot_ready[slot_idx].record(stream)
|
self.ring_slot_ready[slot_idx].record(stream)
|
||||||
torch.cuda.nvtx.range_pop()
|
nvtx.pop_range()
|
||||||
|
|
||||||
def wait_slot_layer(self, slot_idx: int) -> None:
|
def wait_slot_layer(self, slot_idx: int) -> None:
|
||||||
"""
|
"""
|
||||||
@@ -470,7 +480,8 @@ class OffloadEngine:
|
|||||||
else:
|
else:
|
||||||
self.sparse_policy.on_decode_offload(cpu_block_id, layer_id, k_cache, valid_tokens)
|
self.sparse_policy.on_decode_offload(cpu_block_id, layer_id, k_cache, valid_tokens)
|
||||||
|
|
||||||
torch.cuda.nvtx.range_push(f"D2H: Slot[{slot_idx}]->CPU[L{layer_id},B{cpu_block_id}]")
|
nvtx_label = f"D2H: Slot[{slot_idx}]->CPU[L{layer_id},B{cpu_block_id}]"
|
||||||
|
nvtx.push_range(message=nvtx_label, color="green")
|
||||||
with torch.cuda.stream(self.transfer_stream_main):
|
with torch.cuda.stream(self.transfer_stream_main):
|
||||||
# Wait for both compute_stream and default stream
|
# Wait for both compute_stream and default stream
|
||||||
# - compute_stream: for flash attention operations
|
# - compute_stream: for flash attention operations
|
||||||
@@ -486,7 +497,7 @@ class OffloadEngine:
|
|||||||
self.v_cache_gpu[slot_idx], non_blocking=True
|
self.v_cache_gpu[slot_idx], non_blocking=True
|
||||||
)
|
)
|
||||||
self.ring_slot_offload_done[slot_idx].record(self.transfer_stream_main)
|
self.ring_slot_offload_done[slot_idx].record(self.transfer_stream_main)
|
||||||
torch.cuda.nvtx.range_pop()
|
nvtx.pop_range()
|
||||||
|
|
||||||
# ----- KV access methods for ring buffer -----
|
# ----- KV access methods for ring buffer -----
|
||||||
|
|
||||||
@@ -702,6 +713,61 @@ class OffloadEngine:
|
|||||||
v = self.prefill_v_buffer[layer_id, :num_tokens].unsqueeze(0)
|
v = self.prefill_v_buffer[layer_id, :num_tokens].unsqueeze(0)
|
||||||
return k, v
|
return k, v
|
||||||
|
|
||||||
|
def write_to_prefill_buffer(
|
||||||
|
self,
|
||||||
|
layer_id: int,
|
||||||
|
k: Tensor,
|
||||||
|
v: Tensor,
|
||||||
|
chunk_idx: int = -1,
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Write KV tensors to prefill buffer (D2D copy within GPU).
|
||||||
|
|
||||||
|
This is called during chunked prefill to store current chunk's KV
|
||||||
|
before computing attention.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
layer_id: Layer index
|
||||||
|
k: Key tensor [num_tokens, kv_heads, head_dim]
|
||||||
|
v: Value tensor [num_tokens, kv_heads, head_dim]
|
||||||
|
chunk_idx: Current chunk index for NVTX labeling (-1 = not specified)
|
||||||
|
"""
|
||||||
|
num_tokens = k.shape[0]
|
||||||
|
|
||||||
|
# Build NVTX label
|
||||||
|
if chunk_idx >= 0:
|
||||||
|
nvtx_label = f"D2D: L{layer_id} Chunk{chunk_idx} WritePrefillBuffer"
|
||||||
|
else:
|
||||||
|
nvtx_label = f"D2D: L{layer_id} WritePrefillBuffer"
|
||||||
|
|
||||||
|
torch.cuda.nvtx.range_push(nvtx_label)
|
||||||
|
self.prefill_k_buffer[layer_id, :num_tokens].copy_(k)
|
||||||
|
self.prefill_v_buffer[layer_id, :num_tokens].copy_(v)
|
||||||
|
torch.cuda.nvtx.range_pop()
|
||||||
|
|
||||||
|
def write_to_decode_buffer(
|
||||||
|
self,
|
||||||
|
layer_id: int,
|
||||||
|
pos_in_block: int,
|
||||||
|
k: Tensor,
|
||||||
|
v: Tensor,
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Write KV tensors to decode buffer (D2D copy within GPU).
|
||||||
|
|
||||||
|
This is called during chunked decode to store current decode token's KV.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
layer_id: Layer index
|
||||||
|
pos_in_block: Position within the current block
|
||||||
|
k: Key tensor [kv_heads, head_dim] (single token, squeezed)
|
||||||
|
v: Value tensor [kv_heads, head_dim] (single token, squeezed)
|
||||||
|
"""
|
||||||
|
torch.cuda.nvtx.range_push(f"D2D: L{layer_id} Pos{pos_in_block} WriteDecodeBuffer")
|
||||||
|
self.decode_k_buffer[layer_id, pos_in_block].copy_(k)
|
||||||
|
self.decode_v_buffer[layer_id, pos_in_block].copy_(v)
|
||||||
|
torch.cuda.nvtx.range_pop()
|
||||||
|
|
||||||
def offload_prefill_buffer_async(
|
def offload_prefill_buffer_async(
|
||||||
self,
|
self,
|
||||||
layer_id: int,
|
layer_id: int,
|
||||||
@@ -729,7 +795,8 @@ class OffloadEngine:
|
|||||||
# Use per-layer stream for parallel offloads
|
# Use per-layer stream for parallel offloads
|
||||||
stream = self.prefill_offload_streams[layer_id]
|
stream = self.prefill_offload_streams[layer_id]
|
||||||
|
|
||||||
torch.cuda.nvtx.range_push(f"AsyncPrefillOffload: L{layer_id}->CPU[{cpu_block_id}]")
|
nvtx_label = f"D2H: PrefillBuffer L{layer_id}->CPU[{cpu_block_id}]"
|
||||||
|
nvtx.push_range(message=nvtx_label, color="orange")
|
||||||
with torch.cuda.stream(stream):
|
with torch.cuda.stream(stream):
|
||||||
# Wait for compute to finish writing to prefill buffer
|
# Wait for compute to finish writing to prefill buffer
|
||||||
stream.wait_stream(self.compute_stream)
|
stream.wait_stream(self.compute_stream)
|
||||||
@@ -744,7 +811,7 @@ class OffloadEngine:
|
|||||||
|
|
||||||
# Record completion event
|
# Record completion event
|
||||||
self.prefill_offload_events[layer_id].record(stream)
|
self.prefill_offload_events[layer_id].record(stream)
|
||||||
torch.cuda.nvtx.range_pop()
|
nvtx.pop_range()
|
||||||
|
|
||||||
def wait_all_prefill_offloads(self) -> None:
|
def wait_all_prefill_offloads(self) -> None:
|
||||||
"""Wait for all prefill buffer offloads to complete."""
|
"""Wait for all prefill buffer offloads to complete."""
|
||||||
|
|||||||
@@ -139,7 +139,8 @@ class FullAttentionPolicy(SparsePolicy):
|
|||||||
slot = load_slots[0]
|
slot = load_slots[0]
|
||||||
for block_idx in range(num_blocks):
|
for block_idx in range(num_blocks):
|
||||||
cpu_block_id = cpu_block_table[block_idx]
|
cpu_block_id = cpu_block_table[block_idx]
|
||||||
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id)
|
# cpu_block_id is the chunk index (block N = chunk N)
|
||||||
|
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id, chunk_idx=cpu_block_id)
|
||||||
offload_engine.wait_slot_layer(slot)
|
offload_engine.wait_slot_layer(slot)
|
||||||
|
|
||||||
with torch.cuda.stream(compute_stream):
|
with torch.cuda.stream(compute_stream):
|
||||||
@@ -159,7 +160,8 @@ class FullAttentionPolicy(SparsePolicy):
|
|||||||
num_slots = len(load_slots)
|
num_slots = len(load_slots)
|
||||||
num_preload = min(num_slots, num_blocks)
|
num_preload = min(num_slots, num_blocks)
|
||||||
for i in range(num_preload):
|
for i in range(num_preload):
|
||||||
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_table[i])
|
cpu_block_id = cpu_block_table[i]
|
||||||
|
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
|
||||||
|
|
||||||
for block_idx in range(num_blocks):
|
for block_idx in range(num_blocks):
|
||||||
current_slot = load_slots[block_idx % num_slots]
|
current_slot = load_slots[block_idx % num_slots]
|
||||||
@@ -186,7 +188,7 @@ class FullAttentionPolicy(SparsePolicy):
|
|||||||
if next_block_idx < num_blocks:
|
if next_block_idx < num_blocks:
|
||||||
next_slot = load_slots[next_block_idx % num_slots]
|
next_slot = load_slots[next_block_idx % num_slots]
|
||||||
next_cpu_block_id = cpu_block_table[next_block_idx]
|
next_cpu_block_id = cpu_block_table[next_block_idx]
|
||||||
offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id)
|
offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
|
||||||
|
|
||||||
# Step 4: Compute attention to current chunk (causal mask)
|
# Step 4: Compute attention to current chunk (causal mask)
|
||||||
with torch.cuda.stream(compute_stream):
|
with torch.cuda.stream(compute_stream):
|
||||||
@@ -350,7 +352,8 @@ class FullAttentionPolicy(SparsePolicy):
|
|||||||
# Phase 1: Pre-load up to num_slots blocks
|
# Phase 1: Pre-load up to num_slots blocks
|
||||||
num_preload = min(num_slots, num_blocks)
|
num_preload = min(num_slots, num_blocks)
|
||||||
for i in range(num_preload):
|
for i in range(num_preload):
|
||||||
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_table[i])
|
cpu_block_id = cpu_block_table[i]
|
||||||
|
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
|
||||||
|
|
||||||
# Phase 2: Process blocks with pipeline
|
# Phase 2: Process blocks with pipeline
|
||||||
for block_idx in range(num_blocks):
|
for block_idx in range(num_blocks):
|
||||||
@@ -383,7 +386,8 @@ class FullAttentionPolicy(SparsePolicy):
|
|||||||
# Start loading next block (pipeline)
|
# Start loading next block (pipeline)
|
||||||
next_block_idx = block_idx + num_slots
|
next_block_idx = block_idx + num_slots
|
||||||
if next_block_idx < num_blocks:
|
if next_block_idx < num_blocks:
|
||||||
offload_engine.load_to_slot_layer(current_slot, layer_id, cpu_block_table[next_block_idx])
|
next_cpu_block_id = cpu_block_table[next_block_idx]
|
||||||
|
offload_engine.load_to_slot_layer(current_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
|
||||||
|
|
||||||
# Merge with accumulated
|
# Merge with accumulated
|
||||||
with torch.cuda.stream(compute_stream):
|
with torch.cuda.stream(compute_stream):
|
||||||
|
|||||||
@@ -189,8 +189,8 @@ class XAttentionBSAPolicy(SparsePolicy):
|
|||||||
reshaped_block_size = block_size // self.stride # e.g., 1024/8 = 128
|
reshaped_block_size = block_size // self.stride # e.g., 1024/8 = 128
|
||||||
|
|
||||||
for cpu_block_id in available_blocks:
|
for cpu_block_id in available_blocks:
|
||||||
# Load K block from CPU to GPU
|
# Load K block from CPU to GPU (cpu_block_id is chunk index)
|
||||||
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id)
|
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id, chunk_idx=cpu_block_id)
|
||||||
offload_engine.wait_slot_layer(slot)
|
offload_engine.wait_slot_layer(slot)
|
||||||
|
|
||||||
# Get KV: [1, block_size, num_kv_heads, head_dim]
|
# Get KV: [1, block_size, num_kv_heads, head_dim]
|
||||||
@@ -382,7 +382,7 @@ class XAttentionBSAPolicy(SparsePolicy):
|
|||||||
slot = load_slots[0]
|
slot = load_slots[0]
|
||||||
for block_idx in range(num_blocks):
|
for block_idx in range(num_blocks):
|
||||||
cpu_block_id = cpu_block_table[block_idx]
|
cpu_block_id = cpu_block_table[block_idx]
|
||||||
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id)
|
offload_engine.load_to_slot_layer(slot, layer_id, cpu_block_id, chunk_idx=cpu_block_id)
|
||||||
offload_engine.wait_slot_layer(slot)
|
offload_engine.wait_slot_layer(slot)
|
||||||
|
|
||||||
with torch.cuda.stream(compute_stream):
|
with torch.cuda.stream(compute_stream):
|
||||||
@@ -402,7 +402,8 @@ class XAttentionBSAPolicy(SparsePolicy):
|
|||||||
num_slots = len(load_slots)
|
num_slots = len(load_slots)
|
||||||
num_preload = min(num_slots, num_blocks)
|
num_preload = min(num_slots, num_blocks)
|
||||||
for i in range(num_preload):
|
for i in range(num_preload):
|
||||||
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_table[i])
|
cpu_block_id = cpu_block_table[i]
|
||||||
|
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
|
||||||
|
|
||||||
for block_idx in range(num_blocks):
|
for block_idx in range(num_blocks):
|
||||||
current_slot = load_slots[block_idx % num_slots]
|
current_slot = load_slots[block_idx % num_slots]
|
||||||
@@ -428,7 +429,7 @@ class XAttentionBSAPolicy(SparsePolicy):
|
|||||||
if next_block_idx < num_blocks:
|
if next_block_idx < num_blocks:
|
||||||
next_slot = load_slots[next_block_idx % num_slots]
|
next_slot = load_slots[next_block_idx % num_slots]
|
||||||
next_cpu_block_id = cpu_block_table[next_block_idx]
|
next_cpu_block_id = cpu_block_table[next_block_idx]
|
||||||
offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id)
|
offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
|
||||||
|
|
||||||
# Compute attention to current chunk (causal mask)
|
# Compute attention to current chunk (causal mask)
|
||||||
with torch.cuda.stream(compute_stream):
|
with torch.cuda.stream(compute_stream):
|
||||||
|
|||||||
@@ -104,27 +104,21 @@ class Attention(nn.Module):
|
|||||||
# This enables fully async offloads since each layer has its own buffer.
|
# This enables fully async offloads since each layer has its own buffer.
|
||||||
offload_engine = context.kvcache_manager.offload_engine
|
offload_engine = context.kvcache_manager.offload_engine
|
||||||
compute_stream = offload_engine.compute_stream
|
compute_stream = offload_engine.compute_stream
|
||||||
|
chunk_idx = context.current_chunk_idx if hasattr(context, 'current_chunk_idx') else -1
|
||||||
|
|
||||||
# Wait for default stream to ensure slot_mapping tensor transfer is complete
|
# Wait for default stream to ensure slot_mapping tensor transfer is complete
|
||||||
compute_stream.wait_stream(torch.cuda.default_stream())
|
compute_stream.wait_stream(torch.cuda.default_stream())
|
||||||
|
|
||||||
with torch.cuda.stream(compute_stream):
|
with torch.cuda.stream(compute_stream):
|
||||||
# Write KV to per-layer prefill buffer (contiguous write, no slot_mapping)
|
# Write KV to per-layer prefill buffer via offload_engine
|
||||||
# k, v shape: [num_tokens, kv_heads, head_dim]
|
# k, v shape: [num_tokens, kv_heads, head_dim]
|
||||||
num_tokens = k.shape[0]
|
#! GPU 2 GPU
|
||||||
offload_engine.prefill_k_buffer[self.layer_id, :num_tokens].copy_(k)
|
offload_engine.write_to_prefill_buffer(self.layer_id, k, v, chunk_idx=chunk_idx)
|
||||||
offload_engine.prefill_v_buffer[self.layer_id, :num_tokens].copy_(v)
|
|
||||||
elif is_chunked_offload:
|
elif is_chunked_offload:
|
||||||
# Chunked decode mode: use compute_stream for store_kvcache
|
# Chunked decode mode: write KV to per-layer decode buffer via offload_engine
|
||||||
# This ensures proper synchronization with per-layer offload
|
# KV will be written to decode buffer in the decode branch below
|
||||||
compute_stream = context.kvcache_manager.offload_engine.compute_stream
|
# No store_kvcache needed - all KV management goes through offload_engine
|
||||||
if k_cache.numel() and v_cache.numel():
|
pass
|
||||||
# CRITICAL: Wait for default stream to ensure slot_mapping tensor transfer is complete
|
|
||||||
# slot_mapping is created with non_blocking=True on default stream, but we use it
|
|
||||||
# on compute_stream. Without this sync, index_copy_ can get corrupted indices.
|
|
||||||
compute_stream.wait_stream(torch.cuda.default_stream())
|
|
||||||
with torch.cuda.stream(compute_stream):
|
|
||||||
store_kvcache(k, v, k_cache, v_cache, context.slot_mapping)
|
|
||||||
else:
|
else:
|
||||||
# Normal mode: store on default stream
|
# Normal mode: store on default stream
|
||||||
if k_cache.numel() and v_cache.numel():
|
if k_cache.numel() and v_cache.numel():
|
||||||
@@ -155,8 +149,7 @@ class Attention(nn.Module):
|
|||||||
offload_engine = kvcache_manager.offload_engine
|
offload_engine = kvcache_manager.offload_engine
|
||||||
pos_in_block = context.decode_pos_in_block
|
pos_in_block = context.decode_pos_in_block
|
||||||
# k, v shape: [1, kv_heads, head_dim]
|
# k, v shape: [1, kv_heads, head_dim]
|
||||||
offload_engine.decode_k_buffer[self.layer_id, pos_in_block].copy_(k.squeeze(0))
|
offload_engine.write_to_decode_buffer(self.layer_id, pos_in_block, k.squeeze(0), v.squeeze(0))
|
||||||
offload_engine.decode_v_buffer[self.layer_id, pos_in_block].copy_(v.squeeze(0))
|
|
||||||
o = self._chunked_decode_attention(q, k, v, context)
|
o = self._chunked_decode_attention(q, k, v, context)
|
||||||
else:
|
else:
|
||||||
o = flash_attn_with_kvcache(q.unsqueeze(1), k_cache, v_cache,
|
o = flash_attn_with_kvcache(q.unsqueeze(1), k_cache, v_cache,
|
||||||
|
|||||||
@@ -9,6 +9,7 @@
|
|||||||
# --dataset DATASET Task name (default: niah_single_1)
|
# --dataset DATASET Task name (default: niah_single_1)
|
||||||
# --sample INDEX Sample index (default: 0)
|
# --sample INDEX Sample index (default: 0)
|
||||||
# --gpu GPU_ID GPU to use (default: 0)
|
# --gpu GPU_ID GPU to use (default: 0)
|
||||||
|
# --num-gpu-blocks N Number of GPU blocks/slots (default: 4)
|
||||||
# --no-offload Disable CPU offload
|
# --no-offload Disable CPU offload
|
||||||
#
|
#
|
||||||
# Output:
|
# Output:
|
||||||
@@ -18,6 +19,7 @@
|
|||||||
# bash scripts/profile_offload.sh
|
# bash scripts/profile_offload.sh
|
||||||
# bash scripts/profile_offload.sh --dataset niah_single_1 --sample 5
|
# bash scripts/profile_offload.sh --dataset niah_single_1 --sample 5
|
||||||
# bash scripts/profile_offload.sh --gpu 1 --no-offload
|
# bash scripts/profile_offload.sh --gpu 1 --no-offload
|
||||||
|
# bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
@@ -25,6 +27,7 @@ set -e
|
|||||||
DATASET="niah_single_1"
|
DATASET="niah_single_1"
|
||||||
SAMPLE_INDEX="0"
|
SAMPLE_INDEX="0"
|
||||||
GPU_ID="0"
|
GPU_ID="0"
|
||||||
|
NUM_GPU_BLOCKS="4"
|
||||||
ENABLE_OFFLOAD="--enable-offload"
|
ENABLE_OFFLOAD="--enable-offload"
|
||||||
|
|
||||||
# Parse arguments
|
# Parse arguments
|
||||||
@@ -46,6 +49,10 @@ while [[ $# -gt 0 ]]; do
|
|||||||
ENABLE_OFFLOAD=""
|
ENABLE_OFFLOAD=""
|
||||||
shift
|
shift
|
||||||
;;
|
;;
|
||||||
|
--num-gpu-blocks)
|
||||||
|
NUM_GPU_BLOCKS="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
-h|--help)
|
-h|--help)
|
||||||
echo "Usage: $0 [options]"
|
echo "Usage: $0 [options]"
|
||||||
echo ""
|
echo ""
|
||||||
@@ -54,6 +61,7 @@ while [[ $# -gt 0 ]]; do
|
|||||||
echo " --sample INDEX Sample index (default: 0)"
|
echo " --sample INDEX Sample index (default: 0)"
|
||||||
echo " --gpu GPU_ID GPU to use (default: 0)"
|
echo " --gpu GPU_ID GPU to use (default: 0)"
|
||||||
echo " --no-offload Disable CPU offload"
|
echo " --no-offload Disable CPU offload"
|
||||||
|
echo " --num-gpu-blocks N Number of GPU blocks/slots (default: 4)"
|
||||||
exit 0
|
exit 0
|
||||||
;;
|
;;
|
||||||
*)
|
*)
|
||||||
@@ -76,7 +84,7 @@ mkdir -p "$OUTPUT_DIR"
|
|||||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||||
OFFLOAD_SUFFIX=""
|
OFFLOAD_SUFFIX=""
|
||||||
if [ -n "$ENABLE_OFFLOAD" ]; then
|
if [ -n "$ENABLE_OFFLOAD" ]; then
|
||||||
OFFLOAD_SUFFIX="_offload"
|
OFFLOAD_SUFFIX="_offload_${NUM_GPU_BLOCKS}slots"
|
||||||
fi
|
fi
|
||||||
OUTPUT_FILE="$OUTPUT_DIR/ruler_${DATASET}_sample${SAMPLE_INDEX}${OFFLOAD_SUFFIX}_${TIMESTAMP}"
|
OUTPUT_FILE="$OUTPUT_DIR/ruler_${DATASET}_sample${SAMPLE_INDEX}${OFFLOAD_SUFFIX}_${TIMESTAMP}"
|
||||||
|
|
||||||
@@ -87,6 +95,7 @@ echo "Test script: $TEST_SCRIPT"
|
|||||||
echo "Dataset: $DATASET"
|
echo "Dataset: $DATASET"
|
||||||
echo "Sample: $SAMPLE_INDEX"
|
echo "Sample: $SAMPLE_INDEX"
|
||||||
echo "GPU: $GPU_ID"
|
echo "GPU: $GPU_ID"
|
||||||
|
echo "GPU Blocks: $NUM_GPU_BLOCKS"
|
||||||
echo "Offload: ${ENABLE_OFFLOAD:-disabled}"
|
echo "Offload: ${ENABLE_OFFLOAD:-disabled}"
|
||||||
echo "Output file: $OUTPUT_FILE.nsys-rep"
|
echo "Output file: $OUTPUT_FILE.nsys-rep"
|
||||||
echo ""
|
echo ""
|
||||||
@@ -109,6 +118,7 @@ nsys profile \
|
|||||||
python "$TEST_SCRIPT" \
|
python "$TEST_SCRIPT" \
|
||||||
--datasets "$DATASET" \
|
--datasets "$DATASET" \
|
||||||
--sample-indices "$SAMPLE_INDEX" \
|
--sample-indices "$SAMPLE_INDEX" \
|
||||||
|
--num-gpu-blocks "$NUM_GPU_BLOCKS" \
|
||||||
$ENABLE_OFFLOAD \
|
$ENABLE_OFFLOAD \
|
||||||
--quiet
|
--quiet
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user