From 78a44f35365f94eefa4b1d721f90c3cd8519d231 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Sat, 24 Jan 2026 01:41:25 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20docs:=20add=20GPU=20memory=20mon?= =?UTF-8?q?itoring=20rule?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add .claude/rules/gpu-monitor.md requiring gpu-monitor agent for all GPU memory monitoring tasks - Update CLAUDE.md rules index with reference to new rule Co-Authored-By: Claude Opus 4.5 --- .claude/rules/gpu-monitor.md | 74 ++++++++++++++++++++++++++++++++++++ CLAUDE.md | 1 + 2 files changed, 75 insertions(+) create mode 100644 .claude/rules/gpu-monitor.md diff --git a/.claude/rules/gpu-monitor.md b/.claude/rules/gpu-monitor.md new file mode 100644 index 0000000..1abe5f8 --- /dev/null +++ b/.claude/rules/gpu-monitor.md @@ -0,0 +1,74 @@ +# GPU Memory Monitoring Rule + +## 强制规则 + +**所有 GPU 内存监控任务必须使用 `gpu-monitor` agent**,禁止使用以下方式: + +| ❌ 禁止 | 原因 | +|--------|------| +| `nvidia-smi` 循环 + sleep | 阻塞主 agent,无法并行 | +| 后台 bash 监控脚本 | 难以管理,输出混乱 | +| 手动轮询 | 效率低,占用 context | + +## 使用方法 + +```python +# 启动 GPU 监控(后台运行) +Task( + subagent_type="gpu-monitor", + prompt="Monitor GPU 0 with 0.5 second interval", + run_in_background=True +) +``` + +## 参数说明 + +| 参数 | 说明 | 示例 | +|------|------|------| +| GPU ID | 要监控的 GPU | `GPU 0`, `GPU 0,1` | +| interval | 采样间隔 | `0.5 second`, `1 second` | +| 目的 | 监控原因 | `for RULER benchmark test` | + +## 典型用法 + +### 1. 单 GPU 基准测试 +``` +Monitor GPU 0 with 1 second interval for benchmark profiling +``` + +### 2. 调试 OOM +``` +Monitor GPU 0 with 0.5 second interval to track memory peak during inference +``` + +### 3. 多 GPU 训练 +``` +Monitor GPU 0,1,2,3 with 2 second interval during training +``` + +## 获取结果 + +监控结果自动写入 output_file,使用以下方式读取: + +```bash +# 查看最新输出 +tail -50 /tmp/claude/.../tasks/.output + +# 查找峰值 +grep -i "peak\|max" /tmp/claude/.../tasks/.output +``` + +## 与测试并行 + +gpu-monitor 在后台运行,不会阻塞测试: + +```python +# 1. 启动监控(后台) +Task(subagent_type="gpu-monitor", ..., run_in_background=True) + +# 2. 运行测试(前台) +Bash("python tests/test_ruler.py ...") + +# 3. 测试完成后查看监控结果 +Bash("tail -50 ") +``` diff --git a/CLAUDE.md b/CLAUDE.md index 70c7d06..bf08c7b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -34,6 +34,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L | [`.claude/rules/gpu-testing.md`](.claude/rules/gpu-testing.md) | GPU type detection, card assignment, needle test requirements | | [`.claude/rules/sparse-policy.md`](.claude/rules/sparse-policy.md) | SparsePolicy implementation requirements | | [`.claude/rules/planning-with-files.md`](.claude/rules/planning-with-files.md) | Planning file management for complex tasks | +| [`.claude/rules/gpu-monitor.md`](.claude/rules/gpu-monitor.md) | **GPU memory monitoring**: 必须使用 gpu-monitor agent,禁止手动 nvidia-smi 循环 | ## GPU Mutex for Multi-Instance Debugging