Files
nano-vllm/.claude/rules/multi-gpu-debugging.md
Zijie Tian 512e1e5401 🔧 chore: add Claude rules for agent result format and multi-GPU debugging
- Add agent-result-format.md: standardize output formats for background agents
- Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows
- Update CLAUDE.md: add documentation index entry for chunked offload issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 23:41:08 +08:00

15 KiB
Raw Blame History

Multi-GPU Debugging and Experimentation Rules

Purpose

This rule governs GPU resource allocation and task execution strategy during debugging and experimentation on multi-GPU machines. The goal is to maximize debugging efficiency by:

  • Running long validations on minimal GPUs (1-2)
  • Using remaining GPUs for parallel hypothesis exploration
  • Executing only one task/dataset for full validation during debugging

1. Scenario Classification

1.1 Long-Running Validation (Triggers Conservative Allocation)

A task SHALL be classified as long-running validation if ANY of the following conditions apply:

Condition Threshold
Estimated runtime > 20 minutes
Sample count > 50 samples per task
Full dataset execution Any complete validation.jsonl
Full training/fine-tuning Any training run
Large-scale inference > 10K tokens total

Examples:

  • Running all 100 samples of niah_single_1
  • Full RULER benchmark (13 tasks × 100 samples)
  • Complete model evaluation on any benchmark

1.2 Exploratory / Fast-Iteration Work (Allows Full GPU Use)

A task SHALL be classified as exploratory if ALL of the following apply:

Condition Threshold
Estimated runtime < 10 minutes
Sample count ≤ 10 samples
Purpose Sanity check, minimal reproduction, hypothesis testing

Examples:

  • Testing 3-5 specific error samples
  • Single-batch inference for debugging
  • Verifying a code fix on minimal input
  • Profiling a single forward pass

2. GPU Allocation Strategy

2.1 Core Allocation Rules

Task Type GPU Allocation Remaining GPUs
Long-running validation 1 GPU (default), max 2 GPUs Reserved for exploration
Exploratory work As needed, can use multiple -

2.2 Mandatory Constraints

  1. MUST NOT occupy all available GPUs for a single long-running validation
  2. MUST reserve at least 50% of GPUs (minimum 2) for parallel exploration when ≥4 GPUs available
  3. MUST select GPUs based on this priority:
    • Idle GPUs first (check with nvidia-smi)
    • If load info unavailable, use lowest-numbered GPUs for validation
  4. MUST avoid resource conflicts:
    • Each task uses unique CUDA_VISIBLE_DEVICES
    • Each task uses unique output directories
    • Log files include GPU ID in filename

2.3 GPU Selection Algorithm

IF num_available_gpus >= 4:
    validation_gpus = 1 (or 2 if justified)
    exploration_gpus = remaining GPUs
ELSE IF num_available_gpus == 3:
    validation_gpus = 1
    exploration_gpus = 2
ELSE IF num_available_gpus == 2:
    validation_gpus = 1
    exploration_gpus = 1
ELSE:
    validation_gpus = 1
    exploration_gpus = 0 (sequential exploration)

3. Task / Dataset Selection Policy

3.1 Single-Task Validation Rule

During debugging, when a long-running validation is required:

  • MUST execute only ONE task/dataset fully
  • MUST NOT run all tasks unless explicitly requested or conditions in Section 4 are met

3.2 Task Selection Priority

Select the single task based on this priority order:

Priority Criterion Example
1 Task most likely to reproduce the bug If error occurs in niah_single_1, use that
2 Smallest task covering critical paths niah_single_1 (100 samples) vs niah_multikey_3
3 Task with known error samples Use task with documented failure cases
4 Most representative task Single-key before multi-key for basic validation

3.3 Other Tasks Handling

Tasks not selected for full validation:

  • MAY receive lightweight sanity checks (≤5 samples)
  • MUST NOT receive full end-to-end execution by default
  • SHOULD be noted in execution plan for future validation

4. Scale-Up Conditions

Expansion to more GPUs or multiple full tasks is ALLOWED ONLY IF:

Condition Justification Required
Single-task validation completed successfully Confirm fix works on one task first
Critical bug identified and fixed Need cross-task verification
Cross-dataset consistency required Clear technical justification needed
User explicitly requests full-scale User override

4.1 Default Behavior

  • DEFAULT: Conservative, non-expansive
  • MUST ask for confirmation before scaling up
  • MUST document reason for scale-up in execution plan

5. Execution Plan Transparency

5.1 Mandatory Pre-Execution Output

Before starting any validation, MUST output an execution plan containing:

## Execution Plan

### Task Classification
- Type: [Long-running validation / Exploratory]
- Reason: [Why classified this way]

### GPU Allocation
- Validation GPU(s): [GPU IDs]
- Reason: [Why these GPUs selected]
- Exploration GPU(s): [GPU IDs]
- Exploration tasks: [List of parallel hypotheses to test]

### Task Selection
- Full validation task: [Task name]
- Reason: [Why this task selected]
- Other tasks: [Skipped / Sanity-check only]

### Stopping Criteria
- Time limit: [X minutes]
- Success metric: [e.g., accuracy > 90%]
- Error threshold: [e.g., stop if >20 samples fail]

### Expected Output
- [What results will be produced]

5.2 Progress Checkpoints

For long-running validations, SHOULD report progress at:

  • 25% completion
  • 50% completion
  • 75% completion
  • Final results

6. Configuration Defaults

6.1 Default Parameters

Parameter Default Value Description
LONG_RUNNING_THRESHOLD_MINUTES 20 Runtime threshold for classification
LONG_RUNNING_SAMPLE_THRESHOLD 50 Sample count threshold
MAX_VALIDATION_GPUS 2 Maximum GPUs for long validation
MIN_EXPLORATION_GPUS 2 Minimum GPUs reserved for exploration (when ≥4 available)
EXPLORATION_SAMPLE_LIMIT 10 Max samples for exploratory tests
SANITY_CHECK_SAMPLES 5 Samples for non-selected tasks

6.2 User Override

Users can override defaults by specifying in their request:

  • "Use all GPUs for validation"
  • "Run all tasks"
  • "Increase validation GPUs to N"

7. Async Monitoring (CRITICAL)

7.1 Non-Blocking Principle

MUST NOT block the main agent with sleep commands waiting for results:

  • sleep 300 && check_results (blocks main agent)
  • Launch background tasks, continue thinking, check periodically

7.2 Continuous GPU Utilization

MUST maximize GPU utilization:

  • When an agent completes a task, immediately assign new work
  • Use run_in_background: true for all long-running agents
  • Check agent completion via system notifications, not polling

7.3 Monitoring Strategy

CORRECT PATTERN:
1. Launch agents in background with run_in_background: true
2. Continue analysis, planning, or hypothesis generation
3. When agent completion notification arrives, process results
4. Immediately assign new tasks to freed GPUs

WRONG PATTERN:
1. Launch agents
2. sleep 300  # BLOCKS EVERYTHING!
3. Check results
4. GPU sits idle during sleep

7.4 Between-Task Work

While waiting for agents, the main agent SHOULD:

  • Analyze code for additional hypotheses
  • Prepare next batch of tests
  • Update documentation with interim findings
  • Plan fix implementations based on emerging patterns

7.5 Idle GPU Utilization (CRITICAL)

MUST utilize idle GPUs for exploratory tests while waiting:

WRONG PATTERN:
1. Launch 2 agents on GPU 0-1
2. Wait for completion  ← GPU 2-5 sit idle!
3. Process results

CORRECT PATTERN:
1. Launch 2 agents on GPU 0-1 for main validation
2. IMMEDIATELY launch exploratory tests on GPU 2-5:
   - Test alternative configurations
   - Verify edge cases
   - Run sanity checks on other datasets
   - Profile performance bottlenecks
3. Continue spawning new tasks as GPUs become free
4. Process results as they arrive

Idle GPU Detection:

# Check which GPUs are free
nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv

Exploratory Test Ideas (when main validation is running):

GPU State Suggested Task
Idle during single-task validation Test same task with different config
Idle after quick test completes Run related task (e.g., multikey after single-key)
Idle during long benchmark Run profiling or memory analysis
Multiple GPUs idle Parallelize hypothesis testing

Anti-Pattern:

  • "I'll wait for the 100-sample test to finish before doing anything else"
  • "While GPU 0-1 run the 100-sample test, I'll use GPU 2-5 to test configs X, Y, Z"

8. Code Modification Policy (CRITICAL)

8.1 Evidence-Before-Action Principle

MUST NOT modify code until sufficient evidence has been gathered:

Phase Action Code Modification
Hypothesis Formation Identify potential causes NO
Evidence Gathering Run targeted tests NO
Pattern Analysis Analyze test results NO
Root Cause Confirmation Validate with multiple tests NO
Solution Design Design fix based on evidence NO
Implementation Apply targeted fix YES

8.2 Minimum Evidence Requirements

Before proposing ANY code modification:

  1. Reproducibility: Bug must be reproducible with specific test cases
  2. Isolation: Root cause must be isolated (not symptoms)
  3. Multiple Data Points: At least 3 independent test runs confirming the issue
  4. Counter-Evidence: Attempted to disprove the hypothesis
  5. Mechanism Understanding: Clear understanding of WHY the bug occurs

8.3 Main Agent Behavior

The main agent SHOULD:

  • Keep thinking and analyzing while background agents run tests
  • Formulate and refine hypotheses based on incoming results
  • Document findings in findings.md as evidence accumulates
  • Wait for sufficient test coverage before proposing fixes

The main agent MUST NOT:

  • Rush to modify code after seeing first failure
  • Propose fixes based on speculation
  • Change multiple things at once "just to be safe"
  • Assume correlation implies causation

8.4 Evidence Documentation Template

Before any code modification, document in findings.md:

## Proposed Fix: [Brief Description]

### Evidence Summary
- Test A: [Result] - supports/contradicts hypothesis
- Test B: [Result] - supports/contradicts hypothesis
- Test C: [Result] - supports/contradicts hypothesis

### Root Cause Analysis
- What: [Specific bug behavior]
- Where: [File:line or function]
- Why: [Mechanism explanation]
- Confidence: [High/Medium/Low]

### Alternative Explanations Ruled Out
1. [Alternative A]: Ruled out because [reason]
2. [Alternative B]: Ruled out because [reason]

### Proposed Change
- File: [path]
- Change: [description]
- Expected Impact: [what should improve]

8.5 Anti-Patterns

Don't Do Instead
See error → immediately edit code See error → gather more data → analyze → then edit
Fix based on single test failure Reproduce failure 3+ times, understand pattern
Change code "to see what happens" Form hypothesis first, design targeted experiment
Modify multiple files simultaneously Isolate changes, verify each independently
Skip documentation of findings Document every significant finding before changing code

9. Example Scenario

Setup

  • Machine: 8 GPUs (GPU 0-7)
  • Task: Debug RULER chunked attention 20% error rate
  • Available tasks: 6 RULER tasks (niah_single_1/2/3, niah_multikey_1/2/3)
  • Estimated full validation time: ~2 hours for all tasks

Execution Plan Output

## Execution Plan

### Task Classification
- Type: Long-running validation
- Reason: Full validation of 100 samples × 6 tasks would take ~2 hours

### GPU Allocation
- Validation GPU(s): GPU 0 (1 GPU)
- Reason: Single GPU sufficient for sequential 100-sample validation
- Exploration GPU(s): GPU 1, 2, 3, 4, 5, 6, 7 (7 GPUs)
- Exploration tasks:
  1. GPU 1: Test 2-slot vs 4-slot ring buffer on error samples
  2. GPU 2: Test N-way merge implementation
  3. GPU 3: Test LSE precision fix
  4. GPU 4: Profile merge accumulation error
  5. GPU 5: Test with ruler_64k dataset (5 samples)
  6. GPU 6: Test decode boundary conditions
  7. GPU 7: Reserved for ad-hoc hypothesis testing

### Task Selection
- Full validation task: niah_single_1
- Reason: Has documented error samples (19 known failures), smallest single-key task
- Other tasks: Sanity-check only (5 samples each) after fix verified

### Stopping Criteria
- Time limit: 60 minutes for full validation
- Success metric: Error rate < 10% (down from 20%)
- Error threshold: Pause if new error pattern emerges (>5 consecutive failures)

### Expected Output
- Accuracy comparison: before vs after fix
- Error sample analysis: which samples still fail
- Hypothesis validation: which exploration branch identified the fix

Execution Flow

  1. GPU 0: Runs full niah_single_1 validation (100 samples, ~40 min)
  2. GPU 1-7: Run parallel exploration tasks (each ~5-15 min)
  3. Checkpoint at 50%: Report GPU 0 progress + any discoveries from exploration
  4. On discovery: If exploration GPU finds fix, pause validation, apply fix, restart
  5. Completion: Report final results, decide if scale-up needed

10. Quick Reference Checklist

Before starting any debugging validation:

  • Classified task type? (Long-running vs Exploratory)
  • If long-running: Limited to 1-2 GPUs?
  • If long-running: Selected single task for full validation?
  • Remaining GPUs allocated for exploration?
  • Execution plan output with all required sections?
  • Stopping criteria defined?
  • No user override requested? (Default conservative behavior)

Before proposing any code modification:

  • Bug reproducible with specific test cases?
  • Root cause isolated (not just symptoms)?
  • At least 3 independent test runs confirming the issue?
  • Alternative explanations ruled out?
  • Mechanism of bug clearly understood?
  • Evidence documented in findings.md?

11. Rule Violations

The following actions VIOLATE this rule:

  1. Using all 6+ GPUs for a single 100-sample validation
  2. Running full validation on all tasks without completing single-task first
  3. Starting long validation without outputting execution plan
  4. Not reserving GPUs for exploration when ≥4 GPUs available
  5. Scaling up without meeting conditions in Section 4
  6. Modifying code before gathering sufficient evidence (Section 8)
  7. Proposing fixes based on single test failure or speculation
  8. Changing multiple code locations simultaneously without isolation testing

12. Integration with Other Rules

This rule works alongside:

  • gpu-testing.md: GPU type detection and basic allocation
  • planning-with-files.md: Progress tracking for long validations
  • testing.md: Test script conventions

When conflicts arise, this rule takes precedence for debugging scenarios.