🔧 chore: add Claude rules for agent result format and multi-GPU debugging
- Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
195
.claude/rules/agent-result-format.md
Normal file
195
.claude/rules/agent-result-format.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Agent Result Format Rules
|
||||
|
||||
## Purpose
|
||||
|
||||
Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.
|
||||
|
||||
---
|
||||
|
||||
## 1. Result Formatting Principle
|
||||
|
||||
**MUST** return **structured summaries** instead of raw output.
|
||||
|
||||
| Don't | Do |
|
||||
|-------|-----|
|
||||
| Full program stdout/stderr | Key metrics only |
|
||||
| Debug logs | Pass/Fail status |
|
||||
| Verbose error stacks | Error summary + location |
|
||||
|
||||
---
|
||||
|
||||
## 2. Standard Result Templates
|
||||
|
||||
### 2.1 Test Results (RULER, Unit Tests, etc.)
|
||||
|
||||
```markdown
|
||||
## Test Results: [Task Name]
|
||||
|
||||
**Pass Rate**: X / Y (Z%)
|
||||
|
||||
### Failed Samples (if any)
|
||||
| Sample | Expected | Got |
|
||||
|--------|----------|-----|
|
||||
| N | expected_value | actual_value |
|
||||
|
||||
### Passed Samples
|
||||
[List sample IDs or "All N samples passed"]
|
||||
```
|
||||
|
||||
**Example** (instead of raw test output):
|
||||
```markdown
|
||||
## Test Results: niah_single_1 (Samples 0-49)
|
||||
|
||||
**Pass Rate**: 50 / 50 (100%)
|
||||
|
||||
### Passed Samples
|
||||
All 50 samples passed.
|
||||
```
|
||||
|
||||
### 2.2 Benchmark Results
|
||||
|
||||
```markdown
|
||||
## Benchmark Results: [Task Name]
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Throughput | X tok/s |
|
||||
| Latency (p50) | Y ms |
|
||||
| Latency (p99) | Z ms |
|
||||
| Memory Peak | W GB |
|
||||
```
|
||||
|
||||
### 2.3 Build/Compile Results
|
||||
|
||||
```markdown
|
||||
## Build Results: [Target]
|
||||
|
||||
**Status**: SUCCESS / FAILED
|
||||
|
||||
### Errors (if any)
|
||||
| File | Line | Error |
|
||||
|------|------|-------|
|
||||
| path/to/file.py | 123 | error message |
|
||||
```
|
||||
|
||||
### 2.4 Investigation/Research Results
|
||||
|
||||
```markdown
|
||||
## Investigation: [Topic]
|
||||
|
||||
### Findings
|
||||
1. Finding 1 (with file:line reference)
|
||||
2. Finding 2
|
||||
|
||||
### Relevant Files
|
||||
- path/to/file1.py: description
|
||||
- path/to/file2.py: description
|
||||
|
||||
### Conclusion
|
||||
[1-2 sentence summary]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Mandatory Fields by Task Type
|
||||
|
||||
| Task Type | Required Fields |
|
||||
|-----------|-----------------|
|
||||
| Test Run | Pass/Fail count, failed sample details |
|
||||
| Benchmark | Key metrics (throughput, latency, memory) |
|
||||
| Build | Status, error locations |
|
||||
| Search | File paths, line numbers, brief context |
|
||||
| Verification | Before/After comparison, conclusion |
|
||||
|
||||
---
|
||||
|
||||
## 4. What to EXCLUDE
|
||||
|
||||
**MUST NOT** include in results:
|
||||
|
||||
| Exclude | Reason |
|
||||
|---------|--------|
|
||||
| Full stack traces | Extract error type + location only |
|
||||
| Model loading logs | Not relevant to result |
|
||||
| Progress bars / tqdm output | Noise |
|
||||
| Warnings (unless critical) | Noise |
|
||||
| Repeated successful outputs | "All X passed" is sufficient |
|
||||
| Timestamps | Usually not needed |
|
||||
| Device info (unless debugging hardware) | Noise |
|
||||
|
||||
---
|
||||
|
||||
## 5. Agent Prompt Template
|
||||
|
||||
When spawning background agents, include this instruction:
|
||||
|
||||
```
|
||||
When reporting results, use a structured summary format:
|
||||
- For tests: Pass rate, failed sample details (expected vs actual)
|
||||
- For benchmarks: Key metrics table
|
||||
- Do NOT include raw program output, logs, or verbose debug info
|
||||
- Focus on actionable information only
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Main Agent Instructions
|
||||
|
||||
When spawning a background agent for testing:
|
||||
|
||||
**Before** (verbose):
|
||||
```
|
||||
Run tests for samples 0-49 and report the output.
|
||||
```
|
||||
|
||||
**After** (structured):
|
||||
```
|
||||
Run tests for samples 0-49. Report results as:
|
||||
- Total pass/fail count
|
||||
- For each failure: sample ID, expected value, actual value
|
||||
- Do NOT include raw program output or logs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Examples
|
||||
|
||||
### Bad (Wastes ~500 tokens):
|
||||
```
|
||||
The test output was:
|
||||
Loading model from ~/models/Llama-3.1-8B-Instruct...
|
||||
Model loaded in 12.3s
|
||||
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
|
||||
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
|
||||
... (50 more lines) ...
|
||||
```
|
||||
|
||||
### Good (Uses ~50 tokens):
|
||||
```
|
||||
## Test Results: niah_single_1 (Samples 0-49)
|
||||
|
||||
**Pass Rate**: 50 / 50 (100%)
|
||||
|
||||
All samples passed.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Token Savings Estimate
|
||||
|
||||
| Result Type | Raw Output | Structured | Savings |
|
||||
|-------------|------------|------------|---------|
|
||||
| 50-sample test | ~1000 tokens | ~100 tokens | 90% |
|
||||
| Benchmark run | ~500 tokens | ~80 tokens | 84% |
|
||||
| Build failure | ~2000 tokens | ~200 tokens | 90% |
|
||||
|
||||
---
|
||||
|
||||
## 9. Integration
|
||||
|
||||
This rule should be applied when:
|
||||
1. Spawning agents via Task tool
|
||||
2. Running background commands
|
||||
3. Processing results from completed agents
|
||||
|
||||
Combine with `multi-gpu-debugging.md` for efficient parallel testing workflows.
|
||||
Reference in New Issue
Block a user