nano-vllm/.claude/rules/agent-result-format.md

# Agent Result Format Rules

## Purpose

Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.

---

## 1. Result Formatting Principle

**MUST** return **structured summaries** instead of raw output.

| Don't | Do |
|-------|-----|
| Full program stdout/stderr | Key metrics only |
| Debug logs | Pass/Fail status |
| Verbose error stacks | Error summary + location |

---

## 2. Standard Result Templates

### 2.1 Test Results (RULER, Unit Tests, etc.)

```markdown
## Test Results: [Task Name]

**Pass Rate**: X / Y (Z%)

### Failed Samples (if any)
| Sample | Expected | Got |
|--------|----------|-----|
| N | expected_value | actual_value |

### Passed Samples
[List sample IDs or "All N samples passed"]
```

**Example** (instead of raw test output):
```markdown
## Test Results: niah_single_1 (Samples 0-49)

**Pass Rate**: 50 / 50 (100%)

### Passed Samples
All 50 samples passed.
```

### 2.2 Benchmark Results

```markdown
## Benchmark Results: [Task Name]

| Metric | Value |
|--------|-------|
| Throughput | X tok/s |
| Latency (p50) | Y ms |
| Latency (p99) | Z ms |
| Memory Peak | W GB |
```

### 2.3 Build/Compile Results

```markdown
## Build Results: [Target]

**Status**: SUCCESS / FAILED

### Errors (if any)
| File | Line | Error |
|------|------|-------|
| path/to/file.py | 123 | error message |
```

### 2.4 Investigation/Research Results

```markdown
## Investigation: [Topic]

### Findings
1. Finding 1 (with file:line reference)
2. Finding 2

### Relevant Files
- path/to/file1.py: description
- path/to/file2.py: description

### Conclusion
[1-2 sentence summary]
```

---

## 3. Mandatory Fields by Task Type

| Task Type | Required Fields |
|-----------|-----------------|
| Test Run | Pass/Fail count, failed sample details |
| Benchmark | Key metrics (throughput, latency, memory) |
| Build | Status, error locations |
| Search | File paths, line numbers, brief context |
| Verification | Before/After comparison, conclusion |

---

## 4. What to EXCLUDE

**MUST NOT** include in results:

| Exclude | Reason |
|---------|--------|
| Full stack traces | Extract error type + location only |
| Model loading logs | Not relevant to result |
| Progress bars / tqdm output | Noise |
| Warnings (unless critical) | Noise |
| Repeated successful outputs | "All X passed" is sufficient |
| Timestamps | Usually not needed |
| Device info (unless debugging hardware) | Noise |

---

## 5. Agent Prompt Template

When spawning background agents, include this instruction:

```
When reporting results, use a structured summary format:
- For tests: Pass rate, failed sample details (expected vs actual)
- For benchmarks: Key metrics table
- Do NOT include raw program output, logs, or verbose debug info
- Focus on actionable information only
```

---

## 6. Main Agent Instructions

When spawning a background agent for testing:

**Before** (verbose):
```
Run tests for samples 0-49 and report the output.
```

**After** (structured):
```
Run tests for samples 0-49. Report results as:
- Total pass/fail count
- For each failure: sample ID, expected value, actual value
- Do NOT include raw program output or logs
```

---

## 7. Examples

### Bad (Wastes ~500 tokens):
```
The test output was:
Loading model from ~/models/Llama-3.1-8B-Instruct...
Model loaded in 12.3s
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
... (50 more lines) ...
```

### Good (Uses ~50 tokens):
```
## Test Results: niah_single_1 (Samples 0-49)

**Pass Rate**: 50 / 50 (100%)

All samples passed.
```

---

## 8. Token Savings Estimate

| Result Type | Raw Output | Structured | Savings |
|-------------|------------|------------|---------|
| 50-sample test | ~1000 tokens | ~100 tokens | 90% |
| Benchmark run | ~500 tokens | ~80 tokens | 84% |
| Build failure | ~2000 tokens | ~200 tokens | 90% |

---

## 9. Integration

This rule should be applied when:
1. Spawning agents via Task tool
2. Running background commands
3. Processing results from completed agents

Combine with `multi-gpu-debugging.md` for efficient parallel testing workflows.