🔧 chore: add nsys profiling rule and update gitignore
- Add rule requiring profile_offload.sh for all nsys profiling - Document available parameters and typical workflows - Ignore Snipaste screenshot files Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
This commit is contained in:
89
.claude/rules/nsys-profiling.md
Normal file
89
.claude/rules/nsys-profiling.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# Nsys Profiling Rule
|
||||
|
||||
## 强制规则
|
||||
|
||||
**所有 nsys profiling 任务必须使用 `scripts/profile_offload.sh` 脚本**,禁止直接运行 nsys 命令。
|
||||
|
||||
| 禁止 | 原因 |
|
||||
|------|------|
|
||||
| `nsys profile python tests/test_ruler.py ...` | 参数不一致,输出路径混乱 |
|
||||
| 手动构造 nsys 命令 | 容易遗漏关键参数 |
|
||||
|
||||
## 使用方法
|
||||
|
||||
```bash
|
||||
# 基本用法(默认 4 slots)
|
||||
bash scripts/profile_offload.sh
|
||||
|
||||
# 指定 GPU slots 数量
|
||||
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||
|
||||
# 指定 sample
|
||||
bash scripts/profile_offload.sh --sample 5
|
||||
|
||||
# 指定 dataset
|
||||
bash scripts/profile_offload.sh --dataset niah_single_1
|
||||
|
||||
# 禁用 offload(对比测试)
|
||||
bash scripts/profile_offload.sh --no-offload
|
||||
|
||||
# 组合参数
|
||||
bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0 --gpu 1
|
||||
```
|
||||
|
||||
## 参数说明
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|------|--------|------|
|
||||
| `--dataset` | `niah_single_1` | RULER 任务名称 |
|
||||
| `--sample` | `0` | 样本索引 |
|
||||
| `--gpu` | `0` | 使用的 GPU |
|
||||
| `--num-gpu-blocks` | `4` | GPU ring buffer slots 数量 |
|
||||
| `--no-offload` | - | 禁用 CPU offload |
|
||||
|
||||
## 输出文件
|
||||
|
||||
输出文件自动生成到 `results/nsys/` 目录:
|
||||
|
||||
```
|
||||
results/nsys/ruler_<dataset>_sample<index>_offload_<slots>slots_<timestamp>.nsys-rep
|
||||
```
|
||||
|
||||
示例:`ruler_niah_single_1_sample0_offload_8slots_20260127_031500.nsys-rep`
|
||||
|
||||
## 查看结果
|
||||
|
||||
```bash
|
||||
# GUI 查看
|
||||
nsight-sys results/nsys/<filename>.nsys-rep
|
||||
|
||||
# 命令行统计
|
||||
nsys stats --report cuda_api_sum results/nsys/<filename>.nsys-rep
|
||||
nsys stats --report cuda_gpu_kern_sum results/nsys/<filename>.nsys-rep
|
||||
```
|
||||
|
||||
## 典型工作流
|
||||
|
||||
### 1. 对比不同 slots 数量
|
||||
|
||||
```bash
|
||||
# 测试 4 slots(默认)
|
||||
bash scripts/profile_offload.sh --num-gpu-blocks 4
|
||||
|
||||
# 测试 8 slots
|
||||
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||
|
||||
# 对比结果
|
||||
nsys stats --report cuda_gpu_kern_sum results/nsys/*4slots*.nsys-rep
|
||||
nsys stats --report cuda_gpu_kern_sum results/nsys/*8slots*.nsys-rep
|
||||
```
|
||||
|
||||
### 2. 分析 pipeline overlap
|
||||
|
||||
```bash
|
||||
# 生成 profile
|
||||
bash scripts/profile_offload.sh --num-gpu-blocks 8
|
||||
|
||||
# 用 nsight-sys GUI 查看 CUDA HW timeline
|
||||
# 检查 H2D 和 flash_fwd_kernel 是否 overlap
|
||||
```
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -239,3 +239,4 @@ task_plan_*.md
|
||||
findings_*.md
|
||||
progress_*.md
|
||||
notes.md
|
||||
Snipaste*
|
||||
|
||||
Reference in New Issue
Block a user