🔧 chore: add Claude rules for agent result format and multi-GPU debugging
- Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
195
.claude/rules/agent-result-format.md
Normal file
195
.claude/rules/agent-result-format.md
Normal file
@@ -0,0 +1,195 @@
|
|||||||
|
# Agent Result Format Rules
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Result Formatting Principle
|
||||||
|
|
||||||
|
**MUST** return **structured summaries** instead of raw output.
|
||||||
|
|
||||||
|
| Don't | Do |
|
||||||
|
|-------|-----|
|
||||||
|
| Full program stdout/stderr | Key metrics only |
|
||||||
|
| Debug logs | Pass/Fail status |
|
||||||
|
| Verbose error stacks | Error summary + location |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Standard Result Templates
|
||||||
|
|
||||||
|
### 2.1 Test Results (RULER, Unit Tests, etc.)
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Test Results: [Task Name]
|
||||||
|
|
||||||
|
**Pass Rate**: X / Y (Z%)
|
||||||
|
|
||||||
|
### Failed Samples (if any)
|
||||||
|
| Sample | Expected | Got |
|
||||||
|
|--------|----------|-----|
|
||||||
|
| N | expected_value | actual_value |
|
||||||
|
|
||||||
|
### Passed Samples
|
||||||
|
[List sample IDs or "All N samples passed"]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example** (instead of raw test output):
|
||||||
|
```markdown
|
||||||
|
## Test Results: niah_single_1 (Samples 0-49)
|
||||||
|
|
||||||
|
**Pass Rate**: 50 / 50 (100%)
|
||||||
|
|
||||||
|
### Passed Samples
|
||||||
|
All 50 samples passed.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Benchmark Results
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Benchmark Results: [Task Name]
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Throughput | X tok/s |
|
||||||
|
| Latency (p50) | Y ms |
|
||||||
|
| Latency (p99) | Z ms |
|
||||||
|
| Memory Peak | W GB |
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 Build/Compile Results
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Build Results: [Target]
|
||||||
|
|
||||||
|
**Status**: SUCCESS / FAILED
|
||||||
|
|
||||||
|
### Errors (if any)
|
||||||
|
| File | Line | Error |
|
||||||
|
|------|------|-------|
|
||||||
|
| path/to/file.py | 123 | error message |
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.4 Investigation/Research Results
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Investigation: [Topic]
|
||||||
|
|
||||||
|
### Findings
|
||||||
|
1. Finding 1 (with file:line reference)
|
||||||
|
2. Finding 2
|
||||||
|
|
||||||
|
### Relevant Files
|
||||||
|
- path/to/file1.py: description
|
||||||
|
- path/to/file2.py: description
|
||||||
|
|
||||||
|
### Conclusion
|
||||||
|
[1-2 sentence summary]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Mandatory Fields by Task Type
|
||||||
|
|
||||||
|
| Task Type | Required Fields |
|
||||||
|
|-----------|-----------------|
|
||||||
|
| Test Run | Pass/Fail count, failed sample details |
|
||||||
|
| Benchmark | Key metrics (throughput, latency, memory) |
|
||||||
|
| Build | Status, error locations |
|
||||||
|
| Search | File paths, line numbers, brief context |
|
||||||
|
| Verification | Before/After comparison, conclusion |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. What to EXCLUDE
|
||||||
|
|
||||||
|
**MUST NOT** include in results:
|
||||||
|
|
||||||
|
| Exclude | Reason |
|
||||||
|
|---------|--------|
|
||||||
|
| Full stack traces | Extract error type + location only |
|
||||||
|
| Model loading logs | Not relevant to result |
|
||||||
|
| Progress bars / tqdm output | Noise |
|
||||||
|
| Warnings (unless critical) | Noise |
|
||||||
|
| Repeated successful outputs | "All X passed" is sufficient |
|
||||||
|
| Timestamps | Usually not needed |
|
||||||
|
| Device info (unless debugging hardware) | Noise |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Agent Prompt Template
|
||||||
|
|
||||||
|
When spawning background agents, include this instruction:
|
||||||
|
|
||||||
|
```
|
||||||
|
When reporting results, use a structured summary format:
|
||||||
|
- For tests: Pass rate, failed sample details (expected vs actual)
|
||||||
|
- For benchmarks: Key metrics table
|
||||||
|
- Do NOT include raw program output, logs, or verbose debug info
|
||||||
|
- Focus on actionable information only
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Main Agent Instructions
|
||||||
|
|
||||||
|
When spawning a background agent for testing:
|
||||||
|
|
||||||
|
**Before** (verbose):
|
||||||
|
```
|
||||||
|
Run tests for samples 0-49 and report the output.
|
||||||
|
```
|
||||||
|
|
||||||
|
**After** (structured):
|
||||||
|
```
|
||||||
|
Run tests for samples 0-49. Report results as:
|
||||||
|
- Total pass/fail count
|
||||||
|
- For each failure: sample ID, expected value, actual value
|
||||||
|
- Do NOT include raw program output or logs
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Examples
|
||||||
|
|
||||||
|
### Bad (Wastes ~500 tokens):
|
||||||
|
```
|
||||||
|
The test output was:
|
||||||
|
Loading model from ~/models/Llama-3.1-8B-Instruct...
|
||||||
|
Model loaded in 12.3s
|
||||||
|
[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|>
|
||||||
|
[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|>
|
||||||
|
... (50 more lines) ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Good (Uses ~50 tokens):
|
||||||
|
```
|
||||||
|
## Test Results: niah_single_1 (Samples 0-49)
|
||||||
|
|
||||||
|
**Pass Rate**: 50 / 50 (100%)
|
||||||
|
|
||||||
|
All samples passed.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Token Savings Estimate
|
||||||
|
|
||||||
|
| Result Type | Raw Output | Structured | Savings |
|
||||||
|
|-------------|------------|------------|---------|
|
||||||
|
| 50-sample test | ~1000 tokens | ~100 tokens | 90% |
|
||||||
|
| Benchmark run | ~500 tokens | ~80 tokens | 84% |
|
||||||
|
| Build failure | ~2000 tokens | ~200 tokens | 90% |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Integration
|
||||||
|
|
||||||
|
This rule should be applied when:
|
||||||
|
1. Spawning agents via Task tool
|
||||||
|
2. Running background commands
|
||||||
|
3. Processing results from completed agents
|
||||||
|
|
||||||
|
Combine with `multi-gpu-debugging.md` for efficient parallel testing workflows.
|
||||||
463
.claude/rules/multi-gpu-debugging.md
Normal file
463
.claude/rules/multi-gpu-debugging.md
Normal file
@@ -0,0 +1,463 @@
|
|||||||
|
# Multi-GPU Debugging and Experimentation Rules
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This rule governs GPU resource allocation and task execution strategy during debugging and experimentation on multi-GPU machines. The goal is to maximize debugging efficiency by:
|
||||||
|
- Running long validations on minimal GPUs (1-2)
|
||||||
|
- Using remaining GPUs for parallel hypothesis exploration
|
||||||
|
- Executing only one task/dataset for full validation during debugging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scenario Classification
|
||||||
|
|
||||||
|
### 1.1 Long-Running Validation (Triggers Conservative Allocation)
|
||||||
|
|
||||||
|
A task SHALL be classified as **long-running validation** if ANY of the following conditions apply:
|
||||||
|
|
||||||
|
| Condition | Threshold |
|
||||||
|
|-----------|-----------|
|
||||||
|
| Estimated runtime | > 20 minutes |
|
||||||
|
| Sample count | > 50 samples per task |
|
||||||
|
| Full dataset execution | Any complete validation.jsonl |
|
||||||
|
| Full training/fine-tuning | Any training run |
|
||||||
|
| Large-scale inference | > 10K tokens total |
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
- Running all 100 samples of `niah_single_1`
|
||||||
|
- Full RULER benchmark (13 tasks × 100 samples)
|
||||||
|
- Complete model evaluation on any benchmark
|
||||||
|
|
||||||
|
### 1.2 Exploratory / Fast-Iteration Work (Allows Full GPU Use)
|
||||||
|
|
||||||
|
A task SHALL be classified as **exploratory** if ALL of the following apply:
|
||||||
|
|
||||||
|
| Condition | Threshold |
|
||||||
|
|-----------|-----------|
|
||||||
|
| Estimated runtime | < 10 minutes |
|
||||||
|
| Sample count | ≤ 10 samples |
|
||||||
|
| Purpose | Sanity check, minimal reproduction, hypothesis testing |
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
- Testing 3-5 specific error samples
|
||||||
|
- Single-batch inference for debugging
|
||||||
|
- Verifying a code fix on minimal input
|
||||||
|
- Profiling a single forward pass
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. GPU Allocation Strategy
|
||||||
|
|
||||||
|
### 2.1 Core Allocation Rules
|
||||||
|
|
||||||
|
| Task Type | GPU Allocation | Remaining GPUs |
|
||||||
|
|-----------|----------------|----------------|
|
||||||
|
| Long-running validation | 1 GPU (default), max 2 GPUs | Reserved for exploration |
|
||||||
|
| Exploratory work | As needed, can use multiple | - |
|
||||||
|
|
||||||
|
### 2.2 Mandatory Constraints
|
||||||
|
|
||||||
|
1. **MUST NOT** occupy all available GPUs for a single long-running validation
|
||||||
|
2. **MUST** reserve at least 50% of GPUs (minimum 2) for parallel exploration when ≥4 GPUs available
|
||||||
|
3. **MUST** select GPUs based on this priority:
|
||||||
|
- Idle GPUs first (check with `nvidia-smi`)
|
||||||
|
- If load info unavailable, use lowest-numbered GPUs for validation
|
||||||
|
4. **MUST** avoid resource conflicts:
|
||||||
|
- Each task uses unique `CUDA_VISIBLE_DEVICES`
|
||||||
|
- Each task uses unique output directories
|
||||||
|
- Log files include GPU ID in filename
|
||||||
|
|
||||||
|
### 2.3 GPU Selection Algorithm
|
||||||
|
|
||||||
|
```
|
||||||
|
IF num_available_gpus >= 4:
|
||||||
|
validation_gpus = 1 (or 2 if justified)
|
||||||
|
exploration_gpus = remaining GPUs
|
||||||
|
ELSE IF num_available_gpus == 3:
|
||||||
|
validation_gpus = 1
|
||||||
|
exploration_gpus = 2
|
||||||
|
ELSE IF num_available_gpus == 2:
|
||||||
|
validation_gpus = 1
|
||||||
|
exploration_gpus = 1
|
||||||
|
ELSE:
|
||||||
|
validation_gpus = 1
|
||||||
|
exploration_gpus = 0 (sequential exploration)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Task / Dataset Selection Policy
|
||||||
|
|
||||||
|
### 3.1 Single-Task Validation Rule
|
||||||
|
|
||||||
|
During debugging, when a long-running validation is required:
|
||||||
|
|
||||||
|
- **MUST** execute only ONE task/dataset fully
|
||||||
|
- **MUST NOT** run all tasks unless explicitly requested or conditions in Section 4 are met
|
||||||
|
|
||||||
|
### 3.2 Task Selection Priority
|
||||||
|
|
||||||
|
Select the single task based on this priority order:
|
||||||
|
|
||||||
|
| Priority | Criterion | Example |
|
||||||
|
|----------|-----------|---------|
|
||||||
|
| 1 | Task most likely to reproduce the bug | If error occurs in `niah_single_1`, use that |
|
||||||
|
| 2 | Smallest task covering critical paths | `niah_single_1` (100 samples) vs `niah_multikey_3` |
|
||||||
|
| 3 | Task with known error samples | Use task with documented failure cases |
|
||||||
|
| 4 | Most representative task | Single-key before multi-key for basic validation |
|
||||||
|
|
||||||
|
### 3.3 Other Tasks Handling
|
||||||
|
|
||||||
|
Tasks not selected for full validation:
|
||||||
|
- **MAY** receive lightweight sanity checks (≤5 samples)
|
||||||
|
- **MUST NOT** receive full end-to-end execution by default
|
||||||
|
- **SHOULD** be noted in execution plan for future validation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Scale-Up Conditions
|
||||||
|
|
||||||
|
Expansion to more GPUs or multiple full tasks is **ALLOWED ONLY IF**:
|
||||||
|
|
||||||
|
| Condition | Justification Required |
|
||||||
|
|-----------|------------------------|
|
||||||
|
| Single-task validation completed successfully | Confirm fix works on one task first |
|
||||||
|
| Critical bug identified and fixed | Need cross-task verification |
|
||||||
|
| Cross-dataset consistency required | Clear technical justification needed |
|
||||||
|
| User explicitly requests full-scale | User override |
|
||||||
|
|
||||||
|
### 4.1 Default Behavior
|
||||||
|
|
||||||
|
- **DEFAULT**: Conservative, non-expansive
|
||||||
|
- **MUST** ask for confirmation before scaling up
|
||||||
|
- **MUST** document reason for scale-up in execution plan
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Execution Plan Transparency
|
||||||
|
|
||||||
|
### 5.1 Mandatory Pre-Execution Output
|
||||||
|
|
||||||
|
Before starting any validation, **MUST** output an execution plan containing:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Execution Plan
|
||||||
|
|
||||||
|
### Task Classification
|
||||||
|
- Type: [Long-running validation / Exploratory]
|
||||||
|
- Reason: [Why classified this way]
|
||||||
|
|
||||||
|
### GPU Allocation
|
||||||
|
- Validation GPU(s): [GPU IDs]
|
||||||
|
- Reason: [Why these GPUs selected]
|
||||||
|
- Exploration GPU(s): [GPU IDs]
|
||||||
|
- Exploration tasks: [List of parallel hypotheses to test]
|
||||||
|
|
||||||
|
### Task Selection
|
||||||
|
- Full validation task: [Task name]
|
||||||
|
- Reason: [Why this task selected]
|
||||||
|
- Other tasks: [Skipped / Sanity-check only]
|
||||||
|
|
||||||
|
### Stopping Criteria
|
||||||
|
- Time limit: [X minutes]
|
||||||
|
- Success metric: [e.g., accuracy > 90%]
|
||||||
|
- Error threshold: [e.g., stop if >20 samples fail]
|
||||||
|
|
||||||
|
### Expected Output
|
||||||
|
- [What results will be produced]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Progress Checkpoints
|
||||||
|
|
||||||
|
For long-running validations, **SHOULD** report progress at:
|
||||||
|
- 25% completion
|
||||||
|
- 50% completion
|
||||||
|
- 75% completion
|
||||||
|
- Final results
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Configuration Defaults
|
||||||
|
|
||||||
|
### 6.1 Default Parameters
|
||||||
|
|
||||||
|
| Parameter | Default Value | Description |
|
||||||
|
|-----------|---------------|-------------|
|
||||||
|
| `LONG_RUNNING_THRESHOLD_MINUTES` | 20 | Runtime threshold for classification |
|
||||||
|
| `LONG_RUNNING_SAMPLE_THRESHOLD` | 50 | Sample count threshold |
|
||||||
|
| `MAX_VALIDATION_GPUS` | 2 | Maximum GPUs for long validation |
|
||||||
|
| `MIN_EXPLORATION_GPUS` | 2 | Minimum GPUs reserved for exploration (when ≥4 available) |
|
||||||
|
| `EXPLORATION_SAMPLE_LIMIT` | 10 | Max samples for exploratory tests |
|
||||||
|
| `SANITY_CHECK_SAMPLES` | 5 | Samples for non-selected tasks |
|
||||||
|
|
||||||
|
### 6.2 User Override
|
||||||
|
|
||||||
|
Users can override defaults by specifying in their request:
|
||||||
|
- "Use all GPUs for validation"
|
||||||
|
- "Run all tasks"
|
||||||
|
- "Increase validation GPUs to N"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Async Monitoring (CRITICAL)
|
||||||
|
|
||||||
|
### 7.1 Non-Blocking Principle
|
||||||
|
|
||||||
|
**MUST NOT** block the main agent with `sleep` commands waiting for results:
|
||||||
|
- ❌ `sleep 300 && check_results` (blocks main agent)
|
||||||
|
- ✅ Launch background tasks, continue thinking, check periodically
|
||||||
|
|
||||||
|
### 7.2 Continuous GPU Utilization
|
||||||
|
|
||||||
|
**MUST** maximize GPU utilization:
|
||||||
|
- When an agent completes a task, immediately assign new work
|
||||||
|
- Use `run_in_background: true` for all long-running agents
|
||||||
|
- Check agent completion via system notifications, not polling
|
||||||
|
|
||||||
|
### 7.3 Monitoring Strategy
|
||||||
|
|
||||||
|
```
|
||||||
|
CORRECT PATTERN:
|
||||||
|
1. Launch agents in background with run_in_background: true
|
||||||
|
2. Continue analysis, planning, or hypothesis generation
|
||||||
|
3. When agent completion notification arrives, process results
|
||||||
|
4. Immediately assign new tasks to freed GPUs
|
||||||
|
|
||||||
|
WRONG PATTERN:
|
||||||
|
1. Launch agents
|
||||||
|
2. sleep 300 # BLOCKS EVERYTHING!
|
||||||
|
3. Check results
|
||||||
|
4. GPU sits idle during sleep
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.4 Between-Task Work
|
||||||
|
|
||||||
|
While waiting for agents, the main agent SHOULD:
|
||||||
|
- Analyze code for additional hypotheses
|
||||||
|
- Prepare next batch of tests
|
||||||
|
- Update documentation with interim findings
|
||||||
|
- Plan fix implementations based on emerging patterns
|
||||||
|
|
||||||
|
### 7.5 Idle GPU Utilization (CRITICAL)
|
||||||
|
|
||||||
|
**MUST** utilize idle GPUs for exploratory tests while waiting:
|
||||||
|
|
||||||
|
```
|
||||||
|
WRONG PATTERN:
|
||||||
|
1. Launch 2 agents on GPU 0-1
|
||||||
|
2. Wait for completion ← GPU 2-5 sit idle!
|
||||||
|
3. Process results
|
||||||
|
|
||||||
|
CORRECT PATTERN:
|
||||||
|
1. Launch 2 agents on GPU 0-1 for main validation
|
||||||
|
2. IMMEDIATELY launch exploratory tests on GPU 2-5:
|
||||||
|
- Test alternative configurations
|
||||||
|
- Verify edge cases
|
||||||
|
- Run sanity checks on other datasets
|
||||||
|
- Profile performance bottlenecks
|
||||||
|
3. Continue spawning new tasks as GPUs become free
|
||||||
|
4. Process results as they arrive
|
||||||
|
```
|
||||||
|
|
||||||
|
**Idle GPU Detection**:
|
||||||
|
```bash
|
||||||
|
# Check which GPUs are free
|
||||||
|
nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv
|
||||||
|
```
|
||||||
|
|
||||||
|
**Exploratory Test Ideas** (when main validation is running):
|
||||||
|
|
||||||
|
| GPU State | Suggested Task |
|
||||||
|
|-----------|----------------|
|
||||||
|
| Idle during single-task validation | Test same task with different config |
|
||||||
|
| Idle after quick test completes | Run related task (e.g., multikey after single-key) |
|
||||||
|
| Idle during long benchmark | Run profiling or memory analysis |
|
||||||
|
| Multiple GPUs idle | Parallelize hypothesis testing |
|
||||||
|
|
||||||
|
**Anti-Pattern**:
|
||||||
|
- ❌ "I'll wait for the 100-sample test to finish before doing anything else"
|
||||||
|
- ✅ "While GPU 0-1 run the 100-sample test, I'll use GPU 2-5 to test configs X, Y, Z"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Code Modification Policy (CRITICAL)
|
||||||
|
|
||||||
|
### 8.1 Evidence-Before-Action Principle
|
||||||
|
|
||||||
|
**MUST NOT** modify code until sufficient evidence has been gathered:
|
||||||
|
|
||||||
|
| Phase | Action | Code Modification |
|
||||||
|
|-------|--------|-------------------|
|
||||||
|
| Hypothesis Formation | Identify potential causes | ❌ NO |
|
||||||
|
| Evidence Gathering | Run targeted tests | ❌ NO |
|
||||||
|
| Pattern Analysis | Analyze test results | ❌ NO |
|
||||||
|
| Root Cause Confirmation | Validate with multiple tests | ❌ NO |
|
||||||
|
| Solution Design | Design fix based on evidence | ❌ NO |
|
||||||
|
| **Implementation** | Apply targeted fix | ✅ YES |
|
||||||
|
|
||||||
|
### 8.2 Minimum Evidence Requirements
|
||||||
|
|
||||||
|
Before proposing ANY code modification:
|
||||||
|
|
||||||
|
1. **Reproducibility**: Bug must be reproducible with specific test cases
|
||||||
|
2. **Isolation**: Root cause must be isolated (not symptoms)
|
||||||
|
3. **Multiple Data Points**: At least 3 independent test runs confirming the issue
|
||||||
|
4. **Counter-Evidence**: Attempted to disprove the hypothesis
|
||||||
|
5. **Mechanism Understanding**: Clear understanding of WHY the bug occurs
|
||||||
|
|
||||||
|
### 8.3 Main Agent Behavior
|
||||||
|
|
||||||
|
The main agent **SHOULD**:
|
||||||
|
- Keep thinking and analyzing while background agents run tests
|
||||||
|
- Formulate and refine hypotheses based on incoming results
|
||||||
|
- Document findings in `findings.md` as evidence accumulates
|
||||||
|
- Wait for sufficient test coverage before proposing fixes
|
||||||
|
|
||||||
|
The main agent **MUST NOT**:
|
||||||
|
- Rush to modify code after seeing first failure
|
||||||
|
- Propose fixes based on speculation
|
||||||
|
- Change multiple things at once "just to be safe"
|
||||||
|
- Assume correlation implies causation
|
||||||
|
|
||||||
|
### 8.4 Evidence Documentation Template
|
||||||
|
|
||||||
|
Before any code modification, document in `findings.md`:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Proposed Fix: [Brief Description]
|
||||||
|
|
||||||
|
### Evidence Summary
|
||||||
|
- Test A: [Result] - supports/contradicts hypothesis
|
||||||
|
- Test B: [Result] - supports/contradicts hypothesis
|
||||||
|
- Test C: [Result] - supports/contradicts hypothesis
|
||||||
|
|
||||||
|
### Root Cause Analysis
|
||||||
|
- What: [Specific bug behavior]
|
||||||
|
- Where: [File:line or function]
|
||||||
|
- Why: [Mechanism explanation]
|
||||||
|
- Confidence: [High/Medium/Low]
|
||||||
|
|
||||||
|
### Alternative Explanations Ruled Out
|
||||||
|
1. [Alternative A]: Ruled out because [reason]
|
||||||
|
2. [Alternative B]: Ruled out because [reason]
|
||||||
|
|
||||||
|
### Proposed Change
|
||||||
|
- File: [path]
|
||||||
|
- Change: [description]
|
||||||
|
- Expected Impact: [what should improve]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.5 Anti-Patterns
|
||||||
|
|
||||||
|
| Don't | Do Instead |
|
||||||
|
|-------|------------|
|
||||||
|
| See error → immediately edit code | See error → gather more data → analyze → then edit |
|
||||||
|
| Fix based on single test failure | Reproduce failure 3+ times, understand pattern |
|
||||||
|
| Change code "to see what happens" | Form hypothesis first, design targeted experiment |
|
||||||
|
| Modify multiple files simultaneously | Isolate changes, verify each independently |
|
||||||
|
| Skip documentation of findings | Document every significant finding before changing code |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Example Scenario
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
- **Machine**: 8 GPUs (GPU 0-7)
|
||||||
|
- **Task**: Debug RULER chunked attention 20% error rate
|
||||||
|
- **Available tasks**: 6 RULER tasks (niah_single_1/2/3, niah_multikey_1/2/3)
|
||||||
|
- **Estimated full validation time**: ~2 hours for all tasks
|
||||||
|
|
||||||
|
### Execution Plan Output
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Execution Plan
|
||||||
|
|
||||||
|
### Task Classification
|
||||||
|
- Type: Long-running validation
|
||||||
|
- Reason: Full validation of 100 samples × 6 tasks would take ~2 hours
|
||||||
|
|
||||||
|
### GPU Allocation
|
||||||
|
- Validation GPU(s): GPU 0 (1 GPU)
|
||||||
|
- Reason: Single GPU sufficient for sequential 100-sample validation
|
||||||
|
- Exploration GPU(s): GPU 1, 2, 3, 4, 5, 6, 7 (7 GPUs)
|
||||||
|
- Exploration tasks:
|
||||||
|
1. GPU 1: Test 2-slot vs 4-slot ring buffer on error samples
|
||||||
|
2. GPU 2: Test N-way merge implementation
|
||||||
|
3. GPU 3: Test LSE precision fix
|
||||||
|
4. GPU 4: Profile merge accumulation error
|
||||||
|
5. GPU 5: Test with ruler_64k dataset (5 samples)
|
||||||
|
6. GPU 6: Test decode boundary conditions
|
||||||
|
7. GPU 7: Reserved for ad-hoc hypothesis testing
|
||||||
|
|
||||||
|
### Task Selection
|
||||||
|
- Full validation task: niah_single_1
|
||||||
|
- Reason: Has documented error samples (19 known failures), smallest single-key task
|
||||||
|
- Other tasks: Sanity-check only (5 samples each) after fix verified
|
||||||
|
|
||||||
|
### Stopping Criteria
|
||||||
|
- Time limit: 60 minutes for full validation
|
||||||
|
- Success metric: Error rate < 10% (down from 20%)
|
||||||
|
- Error threshold: Pause if new error pattern emerges (>5 consecutive failures)
|
||||||
|
|
||||||
|
### Expected Output
|
||||||
|
- Accuracy comparison: before vs after fix
|
||||||
|
- Error sample analysis: which samples still fail
|
||||||
|
- Hypothesis validation: which exploration branch identified the fix
|
||||||
|
```
|
||||||
|
|
||||||
|
### Execution Flow
|
||||||
|
|
||||||
|
1. **GPU 0**: Runs full `niah_single_1` validation (100 samples, ~40 min)
|
||||||
|
2. **GPU 1-7**: Run parallel exploration tasks (each ~5-15 min)
|
||||||
|
3. **Checkpoint at 50%**: Report GPU 0 progress + any discoveries from exploration
|
||||||
|
4. **On discovery**: If exploration GPU finds fix, pause validation, apply fix, restart
|
||||||
|
5. **Completion**: Report final results, decide if scale-up needed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Quick Reference Checklist
|
||||||
|
|
||||||
|
Before starting any debugging validation:
|
||||||
|
|
||||||
|
- [ ] Classified task type? (Long-running vs Exploratory)
|
||||||
|
- [ ] If long-running: Limited to 1-2 GPUs?
|
||||||
|
- [ ] If long-running: Selected single task for full validation?
|
||||||
|
- [ ] Remaining GPUs allocated for exploration?
|
||||||
|
- [ ] Execution plan output with all required sections?
|
||||||
|
- [ ] Stopping criteria defined?
|
||||||
|
- [ ] No user override requested? (Default conservative behavior)
|
||||||
|
|
||||||
|
Before proposing any code modification:
|
||||||
|
|
||||||
|
- [ ] Bug reproducible with specific test cases?
|
||||||
|
- [ ] Root cause isolated (not just symptoms)?
|
||||||
|
- [ ] At least 3 independent test runs confirming the issue?
|
||||||
|
- [ ] Alternative explanations ruled out?
|
||||||
|
- [ ] Mechanism of bug clearly understood?
|
||||||
|
- [ ] Evidence documented in findings.md?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Rule Violations
|
||||||
|
|
||||||
|
The following actions **VIOLATE** this rule:
|
||||||
|
|
||||||
|
1. Using all 6+ GPUs for a single 100-sample validation
|
||||||
|
2. Running full validation on all tasks without completing single-task first
|
||||||
|
3. Starting long validation without outputting execution plan
|
||||||
|
4. Not reserving GPUs for exploration when ≥4 GPUs available
|
||||||
|
5. Scaling up without meeting conditions in Section 4
|
||||||
|
6. **Modifying code before gathering sufficient evidence** (Section 8)
|
||||||
|
7. Proposing fixes based on single test failure or speculation
|
||||||
|
8. Changing multiple code locations simultaneously without isolation testing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Integration with Other Rules
|
||||||
|
|
||||||
|
This rule works alongside:
|
||||||
|
- `gpu-testing.md`: GPU type detection and basic allocation
|
||||||
|
- `planning-with-files.md`: Progress tracking for long validations
|
||||||
|
- `testing.md`: Test script conventions
|
||||||
|
|
||||||
|
When conflicts arise, this rule takes precedence for debugging scenarios.
|
||||||
@@ -23,6 +23,15 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
|||||||
| [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
|
| [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
|
||||||
| [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
|
| [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
|
||||||
|
|
||||||
|
## Rules Index
|
||||||
|
|
||||||
|
| Rule | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| [`.claude/rules/multi-gpu-debugging.md`](.claude/rules/multi-gpu-debugging.md) | **Multi-GPU debugging**: GPU allocation (1-2 for validation, rest for exploration), single-task validation policy |
|
||||||
|
| [`.claude/rules/gpu-testing.md`](.claude/rules/gpu-testing.md) | GPU type detection, card assignment, needle test requirements |
|
||||||
|
| [`.claude/rules/sparse-policy.md`](.claude/rules/sparse-policy.md) | SparsePolicy implementation requirements |
|
||||||
|
| [`.claude/rules/planning-with-files.md`](.claude/rules/planning-with-files.md) | Planning file management for complex tasks |
|
||||||
|
|
||||||
## GPU Mutex for Multi-Instance Debugging
|
## GPU Mutex for Multi-Instance Debugging
|
||||||
|
|
||||||
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
|
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
|
||||||
|
|||||||
Reference in New Issue
Block a user