🔧 chore: add nsys profiling rule and update gitignore

- Add rule requiring profile_offload.sh for all nsys profiling
- Document available parameters and typical workflows
- Ignore Snipaste screenshot files

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
This commit is contained in:
Zijie Tian
2026-01-27 03:42:17 +08:00
parent 0619accd1c
commit 924a0d2bfa
2 changed files with 90 additions and 0 deletions

View File

@@ -0,0 +1,89 @@
# Nsys Profiling Rule
## 强制规则
**所有 nsys profiling 任务必须使用 `scripts/profile_offload.sh` 脚本**,禁止直接运行 nsys 命令。
| 禁止 | 原因 |
|------|------|
| `nsys profile python tests/test_ruler.py ...` | 参数不一致,输出路径混乱 |
| 手动构造 nsys 命令 | 容易遗漏关键参数 |
## 使用方法
```bash
# 基本用法(默认 4 slots
bash scripts/profile_offload.sh
# 指定 GPU slots 数量
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 指定 sample
bash scripts/profile_offload.sh --sample 5
# 指定 dataset
bash scripts/profile_offload.sh --dataset niah_single_1
# 禁用 offload对比测试
bash scripts/profile_offload.sh --no-offload
# 组合参数
bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0 --gpu 1
```
## 参数说明
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--dataset` | `niah_single_1` | RULER 任务名称 |
| `--sample` | `0` | 样本索引 |
| `--gpu` | `0` | 使用的 GPU |
| `--num-gpu-blocks` | `4` | GPU ring buffer slots 数量 |
| `--no-offload` | - | 禁用 CPU offload |
## 输出文件
输出文件自动生成到 `results/nsys/` 目录:
```
results/nsys/ruler_<dataset>_sample<index>_offload_<slots>slots_<timestamp>.nsys-rep
```
示例:`ruler_niah_single_1_sample0_offload_8slots_20260127_031500.nsys-rep`
## 查看结果
```bash
# GUI 查看
nsight-sys results/nsys/<filename>.nsys-rep
# 命令行统计
nsys stats --report cuda_api_sum results/nsys/<filename>.nsys-rep
nsys stats --report cuda_gpu_kern_sum results/nsys/<filename>.nsys-rep
```
## 典型工作流
### 1. 对比不同 slots 数量
```bash
# 测试 4 slots默认
bash scripts/profile_offload.sh --num-gpu-blocks 4
# 测试 8 slots
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 对比结果
nsys stats --report cuda_gpu_kern_sum results/nsys/*4slots*.nsys-rep
nsys stats --report cuda_gpu_kern_sum results/nsys/*8slots*.nsys-rep
```
### 2. 分析 pipeline overlap
```bash
# 生成 profile
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 用 nsight-sys GUI 查看 CUDA HW timeline
# 检查 H2D 和 flash_fwd_kernel 是否 overlap
```

1
.gitignore vendored
View File

@@ -239,3 +239,4 @@ task_plan_*.md
findings_*.md
progress_*.md
notes.md
Snipaste*