From 78a44f35365f94eefa4b1d721f90c3cd8519d231 Mon Sep 17 00:00:00 2001
From: Zijie Tian <zijietian@mail.xmu.edu.cn>
Date: Sat, 24 Jan 2026 01:41:25 +0800
Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20docs:=20add=20GPU=20memory=20mon?=
 =?UTF-8?q?itoring=20rule?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add .claude/rules/gpu-monitor.md requiring gpu-monitor agent for all GPU memory monitoring tasks
- Update CLAUDE.md rules index with reference to new rule

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 .claude/rules/gpu-monitor.md | 74 ++++++++++++++++++++++++++++++++++++
 CLAUDE.md                    |  1 +
 2 files changed, 75 insertions(+)
 create mode 100644 .claude/rules/gpu-monitor.md

diff --git a/.claude/rules/gpu-monitor.md b/.claude/rules/gpu-monitor.md
new file mode 100644
index 0000000..1abe5f8
--- /dev/null
+++ b/.claude/rules/gpu-monitor.md
@@ -0,0 +1,74 @@
+# GPU Memory Monitoring Rule
+
+## 强制规则
+
+**所有 GPU 内存监控任务必须使用 `gpu-monitor` agent**，禁止使用以下方式：
+
+| ❌ 禁止 | 原因 |
+|--------|------|
+| `nvidia-smi` 循环 + sleep | 阻塞主 agent，无法并行 |
+| 后台 bash 监控脚本 | 难以管理，输出混乱 |
+| 手动轮询 | 效率低，占用 context |
+
+## 使用方法
+
+```python
+# 启动 GPU 监控（后台运行）
+Task(
+    subagent_type="gpu-monitor",
+    prompt="Monitor GPU 0 with 0.5 second interval",
+    run_in_background=True
+)
+```
+
+## 参数说明
+
+| 参数 | 说明 | 示例 |
+|------|------|------|
+| GPU ID | 要监控的 GPU | `GPU 0`, `GPU 0,1` |
+| interval | 采样间隔 | `0.5 second`, `1 second` |
+| 目的 | 监控原因 | `for RULER benchmark test` |
+
+## 典型用法
+
+### 1. 单 GPU 基准测试
+```
+Monitor GPU 0 with 1 second interval for benchmark profiling
+```
+
+### 2. 调试 OOM
+```
+Monitor GPU 0 with 0.5 second interval to track memory peak during inference
+```
+
+### 3. 多 GPU 训练
+```
+Monitor GPU 0,1,2,3 with 2 second interval during training
+```
+
+## 获取结果
+
+监控结果自动写入 output_file，使用以下方式读取：
+
+```bash
+# 查看最新输出
+tail -50 /tmp/claude/.../tasks/<agent_id>.output
+
+# 查找峰值
+grep -i "peak\|max" /tmp/claude/.../tasks/<agent_id>.output
+```
+
+## 与测试并行
+
+gpu-monitor 在后台运行，不会阻塞测试：
+
+```python
+# 1. 启动监控（后台）
+Task(subagent_type="gpu-monitor", ..., run_in_background=True)
+
+# 2. 运行测试（前台）
+Bash("python tests/test_ruler.py ...")
+
+# 3. 测试完成后查看监控结果
+Bash("tail -50 <output_file>")
+```
diff --git a/CLAUDE.md b/CLAUDE.md
index 70c7d06..bf08c7b 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -34,6 +34,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 | [`.claude/rules/gpu-testing.md`](.claude/rules/gpu-testing.md) | GPU type detection, card assignment, needle test requirements |
 | [`.claude/rules/sparse-policy.md`](.claude/rules/sparse-policy.md) | SparsePolicy implementation requirements |
 | [`.claude/rules/planning-with-files.md`](.claude/rules/planning-with-files.md) | Planning file management for complex tasks |
+| [`.claude/rules/gpu-monitor.md`](.claude/rules/gpu-monitor.md) | **GPU memory monitoring**: 必须使用 gpu-monitor agent，禁止手动 nvidia-smi 循环 |
 
 ## GPU Mutex for Multi-Instance Debugging