- Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
196 lines
4.1 KiB
Markdown
196 lines
4.1 KiB
Markdown
# Agent Result Format Rules
|
|
|
|
## Purpose
|
|
|
|
Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.
|
|
|
|
---
|
|
|
|
## 1. Result Formatting Principle
|
|
|
|
**MUST** return **structured summaries** instead of raw output.
|
|
|
|
| Don't | Do |
|
|
|-------|-----|
|
|
| Full program stdout/stderr | Key metrics only |
|
|
| Debug logs | Pass/Fail status |
|
|
| Verbose error stacks | Error summary + location |
|
|
|
|
---
|
|
|
|
## 2. Standard Result Templates
|
|
|
|
### 2.1 Test Results (RULER, Unit Tests, etc.)
|
|
|
|
```markdown
|
|
## Test Results: [Task Name]
|
|
|
|
**Pass Rate**: X / Y (Z%)
|
|
|
|
### Failed Samples (if any)
|
|
| Sample | Expected | Got |
|
|
|--------|----------|-----|
|
|
| N | expected_value | actual_value |
|
|
|
|
### Passed Samples
|
|
[List sample IDs or "All N samples passed"]
|
|
```
|
|
|
|
**Example** (instead of raw test output):
|
|
```markdown
|
|
## Test Results: niah_single_1 (Samples 0-49)
|
|
|
|
**Pass Rate**: 50 / 50 (100%)
|
|
|
|
### Passed Samples
|
|
All 50 samples passed.
|
|
```
|
|
|
|
### 2.2 Benchmark Results
|
|
|
|
```markdown
|
|
## Benchmark Results: [Task Name]
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Throughput | X tok/s |
|
|
| Latency (p50) | Y ms |
|
|
| Latency (p99) | Z ms |
|
|
| Memory Peak | W GB |
|
|
```
|
|
|
|
### 2.3 Build/Compile Results
|
|
|
|
```markdown
|
|
## Build Results: [Target]
|
|
|
|
**Status**: SUCCESS / FAILED
|
|
|
|
### Errors (if any)
|
|
| File | Line | Error |
|
|
|------|------|-------|
|
|
| path/to/file.py | 123 | error message |
|
|
```
|
|
|
|
### 2.4 Investigation/Research Results
|
|
|
|
```markdown
|
|
## Investigation: [Topic]
|
|
|
|
### Findings
|
|
1. Finding 1 (with file:line reference)
|
|
2. Finding 2
|
|
|
|
### Relevant Files
|
|
- path/to/file1.py: description
|
|
- path/to/file2.py: description
|
|
|
|
### Conclusion
|
|
[1-2 sentence summary]
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Mandatory Fields by Task Type
|
|
|
|
| Task Type | Required Fields |
|
|
|-----------|-----------------|
|
|
| Test Run | Pass/Fail count, failed sample details |
|
|
| Benchmark | Key metrics (throughput, latency, memory) |
|
|
| Build | Status, error locations |
|
|
| Search | File paths, line numbers, brief context |
|
|
| Verification | Before/After comparison, conclusion |
|
|
|
|
---
|
|
|
|
## 4. What to EXCLUDE
|
|
|
|
**MUST NOT** include in results:
|
|
|
|
| Exclude | Reason |
|
|
|---------|--------|
|
|
| Full stack traces | Extract error type + location only |
|
|
| Model loading logs | Not relevant to result |
|
|
| Progress bars / tqdm output | Noise |
|
|
| Warnings (unless critical) | Noise |
|
|
| Repeated successful outputs | "All X passed" is sufficient |
|
|
| Timestamps | Usually not needed |
|
|
| Device info (unless debugging hardware) | Noise |
|
|
|
|
---
|
|
|
|
## 5. Agent Prompt Template
|
|
|
|
When spawning background agents, include this instruction:
|
|
|
|
```
|
|
When reporting results, use a structured summary format:
|
|
- For tests: Pass rate, failed sample details (expected vs actual)
|
|
- For benchmarks: Key metrics table
|
|
- Do NOT include raw program output, logs, or verbose debug info
|
|
- Focus on actionable information only
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Main Agent Instructions
|
|
|
|
When spawning a background agent for testing:
|
|
|
|
**Before** (verbose):
|
|
```
|
|
Run tests for samples 0-49 and report the output.
|
|
```
|
|
|
|
**After** (structured):
|
|
```
|
|
Run tests for samples 0-49. Report results as:
|
|
- Total pass/fail count
|
|
- For each failure: sample ID, expected value, actual value
|
|
- Do NOT include raw program output or logs
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Examples
|
|
|
|
### Bad (Wastes ~500 tokens):
|
|
```
|
|
The test output was:
|
|
Loading model from ~/models/Llama-3.1-8B-Instruct...
|
|
Model loaded in 12.3s
|
|
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
|
|
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
|
|
... (50 more lines) ...
|
|
```
|
|
|
|
### Good (Uses ~50 tokens):
|
|
```
|
|
## Test Results: niah_single_1 (Samples 0-49)
|
|
|
|
**Pass Rate**: 50 / 50 (100%)
|
|
|
|
All samples passed.
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Token Savings Estimate
|
|
|
|
| Result Type | Raw Output | Structured | Savings |
|
|
|-------------|------------|------------|---------|
|
|
| 50-sample test | ~1000 tokens | ~100 tokens | 90% |
|
|
| Benchmark run | ~500 tokens | ~80 tokens | 84% |
|
|
| Build failure | ~2000 tokens | ~200 tokens | 90% |
|
|
|
|
---
|
|
|
|
## 9. Integration
|
|
|
|
This rule should be applied when:
|
|
1. Spawning agents via Task tool
|
|
2. Running background commands
|
|
3. Processing results from completed agents
|
|
|
|
Combine with `multi-gpu-debugging.md` for efficient parallel testing workflows.
|