🔧 chore: add Claude rules for agent result format and multi-GPU debugging

- Add agent-result-format.md: standardize output formats for background agents
- Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows
- Update CLAUDE.md: add documentation index entry for chunked offload issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Zijie Tian
2026-01-20 23:41:08 +08:00
parent 6180055ed8
commit 512e1e5401
3 changed files with 667 additions and 0 deletions

View File

@@ -0,0 +1,195 @@
# Agent Result Format Rules
## Purpose
Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.
---
## 1. Result Formatting Principle
**MUST** return **structured summaries** instead of raw output.
| Don't | Do |
|-------|-----|
| Full program stdout/stderr | Key metrics only |
| Debug logs | Pass/Fail status |
| Verbose error stacks | Error summary + location |
---
## 2. Standard Result Templates
### 2.1 Test Results (RULER, Unit Tests, etc.)
```markdown
## Test Results: [Task Name]
**Pass Rate**: X / Y (Z%)
### Failed Samples (if any)
| Sample | Expected | Got |
|--------|----------|-----|
| N | expected_value | actual_value |
### Passed Samples
[List sample IDs or "All N samples passed"]
```
**Example** (instead of raw test output):
```markdown
## Test Results: niah_single_1 (Samples 0-49)
**Pass Rate**: 50 / 50 (100%)
### Passed Samples
All 50 samples passed.
```
### 2.2 Benchmark Results
```markdown
## Benchmark Results: [Task Name]
| Metric | Value |
|--------|-------|
| Throughput | X tok/s |
| Latency (p50) | Y ms |
| Latency (p99) | Z ms |
| Memory Peak | W GB |
```
### 2.3 Build/Compile Results
```markdown
## Build Results: [Target]
**Status**: SUCCESS / FAILED
### Errors (if any)
| File | Line | Error |
|------|------|-------|
| path/to/file.py | 123 | error message |
```
### 2.4 Investigation/Research Results
```markdown
## Investigation: [Topic]
### Findings
1. Finding 1 (with file:line reference)
2. Finding 2
### Relevant Files
- path/to/file1.py: description
- path/to/file2.py: description
### Conclusion
[1-2 sentence summary]
```
---
## 3. Mandatory Fields by Task Type
| Task Type | Required Fields |
|-----------|-----------------|
| Test Run | Pass/Fail count, failed sample details |
| Benchmark | Key metrics (throughput, latency, memory) |
| Build | Status, error locations |
| Search | File paths, line numbers, brief context |
| Verification | Before/After comparison, conclusion |
---
## 4. What to EXCLUDE
**MUST NOT** include in results:
| Exclude | Reason |
|---------|--------|
| Full stack traces | Extract error type + location only |
| Model loading logs | Not relevant to result |
| Progress bars / tqdm output | Noise |
| Warnings (unless critical) | Noise |
| Repeated successful outputs | "All X passed" is sufficient |
| Timestamps | Usually not needed |
| Device info (unless debugging hardware) | Noise |
---
## 5. Agent Prompt Template
When spawning background agents, include this instruction:
```
When reporting results, use a structured summary format:
- For tests: Pass rate, failed sample details (expected vs actual)
- For benchmarks: Key metrics table
- Do NOT include raw program output, logs, or verbose debug info
- Focus on actionable information only
```
---
## 6. Main Agent Instructions
When spawning a background agent for testing:
**Before** (verbose):
```
Run tests for samples 0-49 and report the output.
```
**After** (structured):
```
Run tests for samples 0-49. Report results as:
- Total pass/fail count
- For each failure: sample ID, expected value, actual value
- Do NOT include raw program output or logs
```
---
## 7. Examples
### Bad (Wastes ~500 tokens):
```
The test output was:
Loading model from ~/models/Llama-3.1-8B-Instruct...
Model loaded in 12.3s
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
... (50 more lines) ...
```
### Good (Uses ~50 tokens):
```
## Test Results: niah_single_1 (Samples 0-49)
**Pass Rate**: 50 / 50 (100%)
All samples passed.
```
---
## 8. Token Savings Estimate
| Result Type | Raw Output | Structured | Savings |
|-------------|------------|------------|---------|
| 50-sample test | ~1000 tokens | ~100 tokens | 90% |
| Benchmark run | ~500 tokens | ~80 tokens | 84% |
| Build failure | ~2000 tokens | ~200 tokens | 90% |
---
## 9. Integration
This rule should be applied when:
1. Spawning agents via Task tool
2. Running background commands
3. Processing results from completed agents
Combine with `multi-gpu-debugging.md` for efficient parallel testing workflows.