- Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.1 KiB
4.1 KiB
Agent Result Format Rules
Purpose
Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.
1. Result Formatting Principle
MUST return structured summaries instead of raw output.
| Don't | Do |
|---|---|
| Full program stdout/stderr | Key metrics only |
| Debug logs | Pass/Fail status |
| Verbose error stacks | Error summary + location |
2. Standard Result Templates
2.1 Test Results (RULER, Unit Tests, etc.)
## Test Results: [Task Name]
**Pass Rate**: X / Y (Z%)
### Failed Samples (if any)
| Sample | Expected | Got |
|--------|----------|-----|
| N | expected_value | actual_value |
### Passed Samples
[List sample IDs or "All N samples passed"]
Example (instead of raw test output):
## Test Results: niah_single_1 (Samples 0-49)
**Pass Rate**: 50 / 50 (100%)
### Passed Samples
All 50 samples passed.
2.2 Benchmark Results
## Benchmark Results: [Task Name]
| Metric | Value |
|--------|-------|
| Throughput | X tok/s |
| Latency (p50) | Y ms |
| Latency (p99) | Z ms |
| Memory Peak | W GB |
2.3 Build/Compile Results
## Build Results: [Target]
**Status**: SUCCESS / FAILED
### Errors (if any)
| File | Line | Error |
|------|------|-------|
| path/to/file.py | 123 | error message |
2.4 Investigation/Research Results
## Investigation: [Topic]
### Findings
1. Finding 1 (with file:line reference)
2. Finding 2
### Relevant Files
- path/to/file1.py: description
- path/to/file2.py: description
### Conclusion
[1-2 sentence summary]
3. Mandatory Fields by Task Type
| Task Type | Required Fields |
|---|---|
| Test Run | Pass/Fail count, failed sample details |
| Benchmark | Key metrics (throughput, latency, memory) |
| Build | Status, error locations |
| Search | File paths, line numbers, brief context |
| Verification | Before/After comparison, conclusion |
4. What to EXCLUDE
MUST NOT include in results:
| Exclude | Reason |
|---|---|
| Full stack traces | Extract error type + location only |
| Model loading logs | Not relevant to result |
| Progress bars / tqdm output | Noise |
| Warnings (unless critical) | Noise |
| Repeated successful outputs | "All X passed" is sufficient |
| Timestamps | Usually not needed |
| Device info (unless debugging hardware) | Noise |
5. Agent Prompt Template
When spawning background agents, include this instruction:
When reporting results, use a structured summary format:
- For tests: Pass rate, failed sample details (expected vs actual)
- For benchmarks: Key metrics table
- Do NOT include raw program output, logs, or verbose debug info
- Focus on actionable information only
6. Main Agent Instructions
When spawning a background agent for testing:
Before (verbose):
Run tests for samples 0-49 and report the output.
After (structured):
Run tests for samples 0-49. Report results as:
- Total pass/fail count
- For each failure: sample ID, expected value, actual value
- Do NOT include raw program output or logs
7. Examples
Bad (Wastes ~500 tokens):
The test output was:
Loading model from ~/models/Llama-3.1-8B-Instruct...
Model loaded in 12.3s
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
... (50 more lines) ...
Good (Uses ~50 tokens):
## Test Results: niah_single_1 (Samples 0-49)
**Pass Rate**: 50 / 50 (100%)
All samples passed.
8. Token Savings Estimate
| Result Type | Raw Output | Structured | Savings |
|---|---|---|---|
| 50-sample test | ~1000 tokens | ~100 tokens | 90% |
| Benchmark run | ~500 tokens | ~80 tokens | 84% |
| Build failure | ~2000 tokens | ~200 tokens | 90% |
9. Integration
This rule should be applied when:
- Spawning agents via Task tool
- Running background commands
- Processing results from completed agents
Combine with multi-gpu-debugging.md for efficient parallel testing workflows.