From 924a0d2bfa58a40bc9207bbeb0ceadc7113d3324 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Tue, 27 Jan 2026 03:42:17 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=94=A7=20chore:=20add=20nsys=20profiling?= =?UTF-8?q?=20rule=20and=20update=20gitignore?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add rule requiring profile_offload.sh for all nsys profiling - Document available parameters and typical workflows - Ignore Snipaste screenshot files Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude Co-Authored-By: Happy --- .claude/rules/nsys-profiling.md | 89 +++++++++++++++++++++++++++++++++ .gitignore | 1 + 2 files changed, 90 insertions(+) create mode 100644 .claude/rules/nsys-profiling.md diff --git a/.claude/rules/nsys-profiling.md b/.claude/rules/nsys-profiling.md new file mode 100644 index 0000000..f8a5e0e --- /dev/null +++ b/.claude/rules/nsys-profiling.md @@ -0,0 +1,89 @@ +# Nsys Profiling Rule + +## 强制规则 + +**所有 nsys profiling 任务必须使用 `scripts/profile_offload.sh` 脚本**,禁止直接运行 nsys 命令。 + +| 禁止 | 原因 | +|------|------| +| `nsys profile python tests/test_ruler.py ...` | 参数不一致,输出路径混乱 | +| 手动构造 nsys 命令 | 容易遗漏关键参数 | + +## 使用方法 + +```bash +# 基本用法(默认 4 slots) +bash scripts/profile_offload.sh + +# 指定 GPU slots 数量 +bash scripts/profile_offload.sh --num-gpu-blocks 8 + +# 指定 sample +bash scripts/profile_offload.sh --sample 5 + +# 指定 dataset +bash scripts/profile_offload.sh --dataset niah_single_1 + +# 禁用 offload(对比测试) +bash scripts/profile_offload.sh --no-offload + +# 组合参数 +bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0 --gpu 1 +``` + +## 参数说明 + +| 参数 | 默认值 | 说明 | +|------|--------|------| +| `--dataset` | `niah_single_1` | RULER 任务名称 | +| `--sample` | `0` | 样本索引 | +| `--gpu` | `0` | 使用的 GPU | +| `--num-gpu-blocks` | `4` | GPU ring buffer slots 数量 | +| `--no-offload` | - | 禁用 CPU offload | + +## 输出文件 + +输出文件自动生成到 `results/nsys/` 目录: + +``` +results/nsys/ruler__sample_offload_slots_.nsys-rep +``` + +示例:`ruler_niah_single_1_sample0_offload_8slots_20260127_031500.nsys-rep` + +## 查看结果 + +```bash +# GUI 查看 +nsight-sys results/nsys/.nsys-rep + +# 命令行统计 +nsys stats --report cuda_api_sum results/nsys/.nsys-rep +nsys stats --report cuda_gpu_kern_sum results/nsys/.nsys-rep +``` + +## 典型工作流 + +### 1. 对比不同 slots 数量 + +```bash +# 测试 4 slots(默认) +bash scripts/profile_offload.sh --num-gpu-blocks 4 + +# 测试 8 slots +bash scripts/profile_offload.sh --num-gpu-blocks 8 + +# 对比结果 +nsys stats --report cuda_gpu_kern_sum results/nsys/*4slots*.nsys-rep +nsys stats --report cuda_gpu_kern_sum results/nsys/*8slots*.nsys-rep +``` + +### 2. 分析 pipeline overlap + +```bash +# 生成 profile +bash scripts/profile_offload.sh --num-gpu-blocks 8 + +# 用 nsight-sys GUI 查看 CUDA HW timeline +# 检查 H2D 和 flash_fwd_kernel 是否 overlap +``` diff --git a/.gitignore b/.gitignore index 0d77302..0660f16 100644 --- a/.gitignore +++ b/.gitignore @@ -239,3 +239,4 @@ task_plan_*.md findings_*.md progress_*.md notes.md +Snipaste*