Files
nano-vllm/.claude/rules/agent-result-format.md
Zijie Tian 512e1e5401 🔧 chore: add Claude rules for agent result format and multi-GPU debugging
- Add agent-result-format.md: standardize output formats for background agents
- Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows
- Update CLAUDE.md: add documentation index entry for chunked offload issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 23:41:08 +08:00

4.1 KiB

Agent Result Format Rules

Purpose

Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.


1. Result Formatting Principle

MUST return structured summaries instead of raw output.

Don't Do
Full program stdout/stderr Key metrics only
Debug logs Pass/Fail status
Verbose error stacks Error summary + location

2. Standard Result Templates

2.1 Test Results (RULER, Unit Tests, etc.)

## Test Results: [Task Name]

**Pass Rate**: X / Y (Z%)

### Failed Samples (if any)
| Sample | Expected | Got |
|--------|----------|-----|
| N | expected_value | actual_value |

### Passed Samples
[List sample IDs or "All N samples passed"]

Example (instead of raw test output):

## Test Results: niah_single_1 (Samples 0-49)

**Pass Rate**: 50 / 50 (100%)

### Passed Samples
All 50 samples passed.

2.2 Benchmark Results

## Benchmark Results: [Task Name]

| Metric | Value |
|--------|-------|
| Throughput | X tok/s |
| Latency (p50) | Y ms |
| Latency (p99) | Z ms |
| Memory Peak | W GB |

2.3 Build/Compile Results

## Build Results: [Target]

**Status**: SUCCESS / FAILED

### Errors (if any)
| File | Line | Error |
|------|------|-------|
| path/to/file.py | 123 | error message |

2.4 Investigation/Research Results

## Investigation: [Topic]

### Findings
1. Finding 1 (with file:line reference)
2. Finding 2

### Relevant Files
- path/to/file1.py: description
- path/to/file2.py: description

### Conclusion
[1-2 sentence summary]

3. Mandatory Fields by Task Type

Task Type Required Fields
Test Run Pass/Fail count, failed sample details
Benchmark Key metrics (throughput, latency, memory)
Build Status, error locations
Search File paths, line numbers, brief context
Verification Before/After comparison, conclusion

4. What to EXCLUDE

MUST NOT include in results:

Exclude Reason
Full stack traces Extract error type + location only
Model loading logs Not relevant to result
Progress bars / tqdm output Noise
Warnings (unless critical) Noise
Repeated successful outputs "All X passed" is sufficient
Timestamps Usually not needed
Device info (unless debugging hardware) Noise

5. Agent Prompt Template

When spawning background agents, include this instruction:

When reporting results, use a structured summary format:
- For tests: Pass rate, failed sample details (expected vs actual)
- For benchmarks: Key metrics table
- Do NOT include raw program output, logs, or verbose debug info
- Focus on actionable information only

6. Main Agent Instructions

When spawning a background agent for testing:

Before (verbose):

Run tests for samples 0-49 and report the output.

After (structured):

Run tests for samples 0-49. Report results as:
- Total pass/fail count
- For each failure: sample ID, expected value, actual value
- Do NOT include raw program output or logs

7. Examples

Bad (Wastes ~500 tokens):

The test output was:
Loading model from ~/models/Llama-3.1-8B-Instruct...
Model loaded in 12.3s
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
... (50 more lines) ...

Good (Uses ~50 tokens):

## Test Results: niah_single_1 (Samples 0-49)

**Pass Rate**: 50 / 50 (100%)

All samples passed.

8. Token Savings Estimate

Result Type Raw Output Structured Savings
50-sample test ~1000 tokens ~100 tokens 90%
Benchmark run ~500 tokens ~80 tokens 84%
Build failure ~2000 tokens ~200 tokens 90%

9. Integration

This rule should be applied when:

  1. Spawning agents via Task tool
  2. Running background commands
  3. Processing results from completed agents

Combine with multi-gpu-debugging.md for efficient parallel testing workflows.