Files

Zijie Tian 512e1e5401 🔧 chore: add Claude rules for agent result format and multi-GPU debugging

- Add agent-result-format.md: standardize output formats for background agents
- Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows
- Update CLAUDE.md: add documentation index entry for chunked offload issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 23:41:08 +08:00

15 KiB

Raw Blame History

Multi-GPU Debugging and Experimentation Rules

Purpose

This rule governs GPU resource allocation and task execution strategy during debugging and experimentation on multi-GPU machines. The goal is to maximize debugging efficiency by:

Running long validations on minimal GPUs (1-2)
Using remaining GPUs for parallel hypothesis exploration
Executing only one task/dataset for full validation during debugging

1. Scenario Classification

1.1 Long-Running Validation (Triggers Conservative Allocation)

A task SHALL be classified as long-running validation if ANY of the following conditions apply:

Condition	Threshold
Estimated runtime	> 20 minutes
Sample count	> 50 samples per task
Full dataset execution	Any complete validation.jsonl
Full training/fine-tuning	Any training run
Large-scale inference	> 10K tokens total

Examples:

Running all 100 samples of niah_single_1
Full RULER benchmark (13 tasks × 100 samples)
Complete model evaluation on any benchmark

1.2 Exploratory / Fast-Iteration Work (Allows Full GPU Use)

A task SHALL be classified as exploratory if ALL of the following apply:

Condition	Threshold
Estimated runtime	< 10 minutes
Sample count	≤ 10 samples
Purpose	Sanity check, minimal reproduction, hypothesis testing

Examples:

Testing 3-5 specific error samples
Single-batch inference for debugging
Verifying a code fix on minimal input
Profiling a single forward pass

2. GPU Allocation Strategy

2.1 Core Allocation Rules

Task Type	GPU Allocation	Remaining GPUs
Long-running validation	1 GPU (default), max 2 GPUs	Reserved for exploration
Exploratory work	As needed, can use multiple	-

2.2 Mandatory Constraints

MUST NOT occupy all available GPUs for a single long-running validation
MUST reserve at least 50% of GPUs (minimum 2) for parallel exploration when ≥4 GPUs available
MUST select GPUs based on this priority:
- Idle GPUs first (check with nvidia-smi)
- If load info unavailable, use lowest-numbered GPUs for validation
MUST avoid resource conflicts:
- Each task uses unique CUDA_VISIBLE_DEVICES
- Each task uses unique output directories
- Log files include GPU ID in filename

2.3 GPU Selection Algorithm

IF num_available_gpus >= 4:
    validation_gpus = 1 (or 2 if justified)
    exploration_gpus = remaining GPUs
ELSE IF num_available_gpus == 3:
    validation_gpus = 1
    exploration_gpus = 2
ELSE IF num_available_gpus == 2:
    validation_gpus = 1
    exploration_gpus = 1
ELSE:
    validation_gpus = 1
    exploration_gpus = 0 (sequential exploration)

3. Task / Dataset Selection Policy

3.1 Single-Task Validation Rule

During debugging, when a long-running validation is required:

MUST execute only ONE task/dataset fully
MUST NOT run all tasks unless explicitly requested or conditions in Section 4 are met

3.2 Task Selection Priority

Select the single task based on this priority order:

Priority	Criterion	Example
1	Task most likely to reproduce the bug	If error occurs in `niah_single_1`, use that
2	Smallest task covering critical paths	`niah_single_1` (100 samples) vs `niah_multikey_3`
3	Task with known error samples	Use task with documented failure cases
4	Most representative task	Single-key before multi-key for basic validation

3.3 Other Tasks Handling

Tasks not selected for full validation:

MAY receive lightweight sanity checks (≤5 samples)
MUST NOT receive full end-to-end execution by default
SHOULD be noted in execution plan for future validation

4. Scale-Up Conditions

Expansion to more GPUs or multiple full tasks is ALLOWED ONLY IF:

Condition	Justification Required
Single-task validation completed successfully	Confirm fix works on one task first
Critical bug identified and fixed	Need cross-task verification
Cross-dataset consistency required	Clear technical justification needed
User explicitly requests full-scale	User override

4.1 Default Behavior

DEFAULT: Conservative, non-expansive
MUST ask for confirmation before scaling up
MUST document reason for scale-up in execution plan

5. Execution Plan Transparency

5.1 Mandatory Pre-Execution Output

Before starting any validation, MUST output an execution plan containing:

## Execution Plan

### Task Classification
- Type: [Long-running validation / Exploratory]
- Reason: [Why classified this way]

### GPU Allocation
- Validation GPU(s): [GPU IDs]
- Reason: [Why these GPUs selected]
- Exploration GPU(s): [GPU IDs]
- Exploration tasks: [List of parallel hypotheses to test]

### Task Selection
- Full validation task: [Task name]
- Reason: [Why this task selected]
- Other tasks: [Skipped / Sanity-check only]

### Stopping Criteria
- Time limit: [X minutes]
- Success metric: [e.g., accuracy > 90%]
- Error threshold: [e.g., stop if >20 samples fail]

### Expected Output
- [What results will be produced]

5.2 Progress Checkpoints

For long-running validations, SHOULD report progress at:

25% completion
50% completion
75% completion
Final results

6. Configuration Defaults

6.1 Default Parameters

Parameter	Default Value	Description
`LONG_RUNNING_THRESHOLD_MINUTES`	20	Runtime threshold for classification
`LONG_RUNNING_SAMPLE_THRESHOLD`	50	Sample count threshold
`MAX_VALIDATION_GPUS`	2	Maximum GPUs for long validation
`MIN_EXPLORATION_GPUS`	2	Minimum GPUs reserved for exploration (when ≥4 available)
`EXPLORATION_SAMPLE_LIMIT`	10	Max samples for exploratory tests
`SANITY_CHECK_SAMPLES`	5	Samples for non-selected tasks

6.2 User Override

Users can override defaults by specifying in their request:

"Use all GPUs for validation"
"Run all tasks"
"Increase validation GPUs to N"

7. Async Monitoring (CRITICAL)

7.1 Non-Blocking Principle

MUST NOT block the main agent with sleep commands waiting for results:

❌ sleep 300 && check_results (blocks main agent)
✅ Launch background tasks, continue thinking, check periodically

7.2 Continuous GPU Utilization

MUST maximize GPU utilization:

When an agent completes a task, immediately assign new work
Use run_in_background: true for all long-running agents
Check agent completion via system notifications, not polling

7.3 Monitoring Strategy

CORRECT PATTERN:
1. Launch agents in background with run_in_background: true
2. Continue analysis, planning, or hypothesis generation
3. When agent completion notification arrives, process results
4. Immediately assign new tasks to freed GPUs

WRONG PATTERN:
1. Launch agents
2. sleep 300  # BLOCKS EVERYTHING!
3. Check results
4. GPU sits idle during sleep

7.4 Between-Task Work

While waiting for agents, the main agent SHOULD:

Analyze code for additional hypotheses
Prepare next batch of tests
Update documentation with interim findings
Plan fix implementations based on emerging patterns

7.5 Idle GPU Utilization (CRITICAL)

MUST utilize idle GPUs for exploratory tests while waiting:

WRONG PATTERN:
1. Launch 2 agents on GPU 0-1
2. Wait for completion  ← GPU 2-5 sit idle!
3. Process results

CORRECT PATTERN:
1. Launch 2 agents on GPU 0-1 for main validation
2. IMMEDIATELY launch exploratory tests on GPU 2-5:
   - Test alternative configurations
   - Verify edge cases
   - Run sanity checks on other datasets
   - Profile performance bottlenecks
3. Continue spawning new tasks as GPUs become free
4. Process results as they arrive

Idle GPU Detection:

# Check which GPUs are free
nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv

Exploratory Test Ideas (when main validation is running):

GPU State	Suggested Task
Idle during single-task validation	Test same task with different config
Idle after quick test completes	Run related task (e.g., multikey after single-key)
Idle during long benchmark	Run profiling or memory analysis
Multiple GPUs idle	Parallelize hypothesis testing

Anti-Pattern:

❌ "I'll wait for the 100-sample test to finish before doing anything else"
✅ "While GPU 0-1 run the 100-sample test, I'll use GPU 2-5 to test configs X, Y, Z"

8. Code Modification Policy (CRITICAL)

8.1 Evidence-Before-Action Principle

MUST NOT modify code until sufficient evidence has been gathered:

Phase	Action	Code Modification
Hypothesis Formation	Identify potential causes	❌ NO
Evidence Gathering	Run targeted tests	❌ NO
Pattern Analysis	Analyze test results	❌ NO
Root Cause Confirmation	Validate with multiple tests	❌ NO
Solution Design	Design fix based on evidence	❌ NO
Implementation	Apply targeted fix	✅ YES

8.2 Minimum Evidence Requirements

Before proposing ANY code modification:

Reproducibility: Bug must be reproducible with specific test cases
Isolation: Root cause must be isolated (not symptoms)
Multiple Data Points: At least 3 independent test runs confirming the issue
Counter-Evidence: Attempted to disprove the hypothesis
Mechanism Understanding: Clear understanding of WHY the bug occurs

8.3 Main Agent Behavior

The main agent SHOULD:

Keep thinking and analyzing while background agents run tests
Formulate and refine hypotheses based on incoming results
Document findings in findings.md as evidence accumulates
Wait for sufficient test coverage before proposing fixes

The main agent MUST NOT:

Rush to modify code after seeing first failure
Propose fixes based on speculation
Change multiple things at once "just to be safe"
Assume correlation implies causation

8.4 Evidence Documentation Template

Before any code modification, document in findings.md:

## Proposed Fix: [Brief Description]

### Evidence Summary
- Test A: [Result] - supports/contradicts hypothesis
- Test B: [Result] - supports/contradicts hypothesis
- Test C: [Result] - supports/contradicts hypothesis

### Root Cause Analysis
- What: [Specific bug behavior]
- Where: [File:line or function]
- Why: [Mechanism explanation]
- Confidence: [High/Medium/Low]

### Alternative Explanations Ruled Out
1. [Alternative A]: Ruled out because [reason]
2. [Alternative B]: Ruled out because [reason]

### Proposed Change
- File: [path]
- Change: [description]
- Expected Impact: [what should improve]

8.5 Anti-Patterns

Don't	Do Instead
See error → immediately edit code	See error → gather more data → analyze → then edit
Fix based on single test failure	Reproduce failure 3+ times, understand pattern
Change code "to see what happens"	Form hypothesis first, design targeted experiment
Modify multiple files simultaneously	Isolate changes, verify each independently
Skip documentation of findings	Document every significant finding before changing code

9. Example Scenario

Setup

Machine: 8 GPUs (GPU 0-7)
Task: Debug RULER chunked attention 20% error rate
Available tasks: 6 RULER tasks (niah_single_1/2/3, niah_multikey_1/2/3)
Estimated full validation time: ~2 hours for all tasks

Execution Plan Output

## Execution Plan

### Task Classification
- Type: Long-running validation
- Reason: Full validation of 100 samples × 6 tasks would take ~2 hours

### GPU Allocation
- Validation GPU(s): GPU 0 (1 GPU)
- Reason: Single GPU sufficient for sequential 100-sample validation
- Exploration GPU(s): GPU 1, 2, 3, 4, 5, 6, 7 (7 GPUs)
- Exploration tasks:
  1. GPU 1: Test 2-slot vs 4-slot ring buffer on error samples
  2. GPU 2: Test N-way merge implementation
  3. GPU 3: Test LSE precision fix
  4. GPU 4: Profile merge accumulation error
  5. GPU 5: Test with ruler_64k dataset (5 samples)
  6. GPU 6: Test decode boundary conditions
  7. GPU 7: Reserved for ad-hoc hypothesis testing

### Task Selection
- Full validation task: niah_single_1
- Reason: Has documented error samples (19 known failures), smallest single-key task
- Other tasks: Sanity-check only (5 samples each) after fix verified

### Stopping Criteria
- Time limit: 60 minutes for full validation
- Success metric: Error rate < 10% (down from 20%)
- Error threshold: Pause if new error pattern emerges (>5 consecutive failures)

### Expected Output
- Accuracy comparison: before vs after fix
- Error sample analysis: which samples still fail
- Hypothesis validation: which exploration branch identified the fix

Execution Flow

GPU 0: Runs full niah_single_1 validation (100 samples, ~40 min)
GPU 1-7: Run parallel exploration tasks (each ~5-15 min)
Checkpoint at 50%: Report GPU 0 progress + any discoveries from exploration
On discovery: If exploration GPU finds fix, pause validation, apply fix, restart
Completion: Report final results, decide if scale-up needed

10. Quick Reference Checklist

Before starting any debugging validation:

Classified task type? (Long-running vs Exploratory)
If long-running: Limited to 1-2 GPUs?
If long-running: Selected single task for full validation?
Remaining GPUs allocated for exploration?
Execution plan output with all required sections?
Stopping criteria defined?
No user override requested? (Default conservative behavior)

Before proposing any code modification:

Bug reproducible with specific test cases?
Root cause isolated (not just symptoms)?
At least 3 independent test runs confirming the issue?
Alternative explanations ruled out?
Mechanism of bug clearly understood?
Evidence documented in findings.md?

11. Rule Violations

The following actions VIOLATE this rule:

Using all 6+ GPUs for a single 100-sample validation
Running full validation on all tasks without completing single-task first
Starting long validation without outputting execution plan
Not reserving GPUs for exploration when ≥4 GPUs available
Scaling up without meeting conditions in Section 4
Modifying code before gathering sufficient evidence (Section 8)
Proposing fixes based on single test failure or speculation
Changing multiple code locations simultaneously without isolation testing

12. Integration with Other Rules

This rule works alongside:

gpu-testing.md: GPU type detection and basic allocation
planning-with-files.md: Progress tracking for long validations
testing.md: Test script conventions

When conflicts arise, this rule takes precedence for debugging scenarios.

15 KiB Raw Blame History Unescape Escape