Files
nano-vllm/.claude/rules/nsys-profiling.md
Zijie Tian 924a0d2bfa 🔧 chore: add nsys profiling rule and update gitignore
- Add rule requiring profile_offload.sh for all nsys profiling
- Document available parameters and typical workflows
- Ignore Snipaste screenshot files

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 03:42:17 +08:00

90 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Nsys Profiling Rule
## 强制规则
**所有 nsys profiling 任务必须使用 `scripts/profile_offload.sh` 脚本**,禁止直接运行 nsys 命令。
| 禁止 | 原因 |
|------|------|
| `nsys profile python tests/test_ruler.py ...` | 参数不一致,输出路径混乱 |
| 手动构造 nsys 命令 | 容易遗漏关键参数 |
## 使用方法
```bash
# 基本用法(默认 4 slots
bash scripts/profile_offload.sh
# 指定 GPU slots 数量
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 指定 sample
bash scripts/profile_offload.sh --sample 5
# 指定 dataset
bash scripts/profile_offload.sh --dataset niah_single_1
# 禁用 offload对比测试
bash scripts/profile_offload.sh --no-offload
# 组合参数
bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0 --gpu 1
```
## 参数说明
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--dataset` | `niah_single_1` | RULER 任务名称 |
| `--sample` | `0` | 样本索引 |
| `--gpu` | `0` | 使用的 GPU |
| `--num-gpu-blocks` | `4` | GPU ring buffer slots 数量 |
| `--no-offload` | - | 禁用 CPU offload |
## 输出文件
输出文件自动生成到 `results/nsys/` 目录:
```
results/nsys/ruler_<dataset>_sample<index>_offload_<slots>slots_<timestamp>.nsys-rep
```
示例:`ruler_niah_single_1_sample0_offload_8slots_20260127_031500.nsys-rep`
## 查看结果
```bash
# GUI 查看
nsight-sys results/nsys/<filename>.nsys-rep
# 命令行统计
nsys stats --report cuda_api_sum results/nsys/<filename>.nsys-rep
nsys stats --report cuda_gpu_kern_sum results/nsys/<filename>.nsys-rep
```
## 典型工作流
### 1. 对比不同 slots 数量
```bash
# 测试 4 slots默认
bash scripts/profile_offload.sh --num-gpu-blocks 4
# 测试 8 slots
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 对比结果
nsys stats --report cuda_gpu_kern_sum results/nsys/*4slots*.nsys-rep
nsys stats --report cuda_gpu_kern_sum results/nsys/*8slots*.nsys-rep
```
### 2. 分析 pipeline overlap
```bash
# 生成 profile
bash scripts/profile_offload.sh --num-gpu-blocks 8
# 用 nsight-sys GUI 查看 CUDA HW timeline
# 检查 H2D 和 flash_fwd_kernel 是否 overlap
```