Files

Zijie Tian 512e1e5401 🔧 chore: add Claude rules for agent result format and multi-GPU debugging

- Add agent-result-format.md: standardize output formats for background agents
- Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows
- Update CLAUDE.md: add documentation index entry for chunked offload issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 23:41:08 +08:00

4.1 KiB

Raw Blame History

Agent Result Format Rules

Purpose

Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.

1. Result Formatting Principle

MUST return structured summaries instead of raw output.

Don't	Do
Full program stdout/stderr	Key metrics only
Debug logs	Pass/Fail status
Verbose error stacks	Error summary + location

2. Standard Result Templates

2.1 Test Results (RULER, Unit Tests, etc.)

## Test Results: [Task Name]

**Pass Rate**: X / Y (Z%)

### Failed Samples (if any)
| Sample | Expected | Got |
|--------|----------|-----|
| N | expected_value | actual_value |

### Passed Samples
[List sample IDs or "All N samples passed"]

Example (instead of raw test output):

## Test Results: niah_single_1 (Samples 0-49)

**Pass Rate**: 50 / 50 (100%)

### Passed Samples
All 50 samples passed.

2.2 Benchmark Results

## Benchmark Results: [Task Name]

| Metric | Value |
|--------|-------|
| Throughput | X tok/s |
| Latency (p50) | Y ms |
| Latency (p99) | Z ms |
| Memory Peak | W GB |

2.3 Build/Compile Results

## Build Results: [Target]

**Status**: SUCCESS / FAILED

### Errors (if any)
| File | Line | Error |
|------|------|-------|
| path/to/file.py | 123 | error message |

2.4 Investigation/Research Results

## Investigation: [Topic]

### Findings
1. Finding 1 (with file:line reference)
2. Finding 2

### Relevant Files
- path/to/file1.py: description
- path/to/file2.py: description

### Conclusion
[1-2 sentence summary]

3. Mandatory Fields by Task Type

Task Type	Required Fields
Test Run	Pass/Fail count, failed sample details
Benchmark	Key metrics (throughput, latency, memory)
Build	Status, error locations
Search	File paths, line numbers, brief context
Verification	Before/After comparison, conclusion

4. What to EXCLUDE

MUST NOT include in results:

Exclude	Reason
Full stack traces	Extract error type + location only
Model loading logs	Not relevant to result
Progress bars / tqdm output	Noise
Warnings (unless critical)	Noise
Repeated successful outputs	"All X passed" is sufficient
Timestamps	Usually not needed
Device info (unless debugging hardware)	Noise

5. Agent Prompt Template

When spawning background agents, include this instruction:

When reporting results, use a structured summary format:
- For tests: Pass rate, failed sample details (expected vs actual)
- For benchmarks: Key metrics table
- Do NOT include raw program output, logs, or verbose debug info
- Focus on actionable information only

6. Main Agent Instructions

When spawning a background agent for testing:

Before (verbose):

Run tests for samples 0-49 and report the output.

After (structured):

Run tests for samples 0-49. Report results as:
- Total pass/fail count
- For each failure: sample ID, expected value, actual value
- Do NOT include raw program output or logs

7. Examples

Bad (Wastes ~500 tokens):

The test output was:
Loading model from ~/models/Llama-3.1-8B-Instruct...
Model loaded in 12.3s
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
... (50 more lines) ...

Good (Uses ~50 tokens):

## Test Results: niah_single_1 (Samples 0-49)

**Pass Rate**: 50 / 50 (100%)

All samples passed.

8. Token Savings Estimate

Result Type	Raw Output	Structured	Savings
50-sample test	~1000 tokens	~100 tokens	90%
Benchmark run	~500 tokens	~80 tokens	84%
Build failure	~2000 tokens	~200 tokens	90%

9. Integration

This rule should be applied when:

Spawning agents via Task tool
Running background commands
Processing results from completed agents

Combine with multi-gpu-debugging.md for efficient parallel testing workflows.

4.1 KiB Raw Blame History