# Multi-GPU Debugging and Experimentation Rules ## Purpose This rule governs GPU resource allocation and task execution strategy during debugging and experimentation on multi-GPU machines. The goal is to maximize debugging efficiency by: - Running long validations on minimal GPUs (1-2) - Using remaining GPUs for parallel hypothesis exploration - Executing only one task/dataset for full validation during debugging --- ## 1. Scenario Classification ### 1.1 Long-Running Validation (Triggers Conservative Allocation) A task SHALL be classified as **long-running validation** if ANY of the following conditions apply: | Condition | Threshold | |-----------|-----------| | Estimated runtime | > 20 minutes | | Sample count | > 50 samples per task | | Full dataset execution | Any complete validation.jsonl | | Full training/fine-tuning | Any training run | | Large-scale inference | > 10K tokens total | **Examples:** - Running all 100 samples of `niah_single_1` - Full RULER benchmark (13 tasks × 100 samples) - Complete model evaluation on any benchmark ### 1.2 Exploratory / Fast-Iteration Work (Allows Full GPU Use) A task SHALL be classified as **exploratory** if ALL of the following apply: | Condition | Threshold | |-----------|-----------| | Estimated runtime | < 10 minutes | | Sample count | ≤ 10 samples | | Purpose | Sanity check, minimal reproduction, hypothesis testing | **Examples:** - Testing 3-5 specific error samples - Single-batch inference for debugging - Verifying a code fix on minimal input - Profiling a single forward pass --- ## 2. GPU Allocation Strategy ### 2.1 Core Allocation Rules | Task Type | GPU Allocation | Remaining GPUs | |-----------|----------------|----------------| | Long-running validation | 1 GPU (default), max 2 GPUs | Reserved for exploration | | Exploratory work | As needed, can use multiple | - | ### 2.2 Mandatory Constraints 1. **MUST NOT** occupy all available GPUs for a single long-running validation 2. **MUST** reserve at least 50% of GPUs (minimum 2) for parallel exploration when ≥4 GPUs available 3. **MUST** select GPUs based on this priority: - Idle GPUs first (check with `nvidia-smi`) - If load info unavailable, use lowest-numbered GPUs for validation 4. **MUST** avoid resource conflicts: - Each task uses unique `CUDA_VISIBLE_DEVICES` - Each task uses unique output directories - Log files include GPU ID in filename ### 2.3 GPU Selection Algorithm ``` IF num_available_gpus >= 4: validation_gpus = 1 (or 2 if justified) exploration_gpus = remaining GPUs ELSE IF num_available_gpus == 3: validation_gpus = 1 exploration_gpus = 2 ELSE IF num_available_gpus == 2: validation_gpus = 1 exploration_gpus = 1 ELSE: validation_gpus = 1 exploration_gpus = 0 (sequential exploration) ``` --- ## 3. Task / Dataset Selection Policy ### 3.1 Single-Task Validation Rule During debugging, when a long-running validation is required: - **MUST** execute only ONE task/dataset fully - **MUST NOT** run all tasks unless explicitly requested or conditions in Section 4 are met ### 3.2 Task Selection Priority Select the single task based on this priority order: | Priority | Criterion | Example | |----------|-----------|---------| | 1 | Task most likely to reproduce the bug | If error occurs in `niah_single_1`, use that | | 2 | Smallest task covering critical paths | `niah_single_1` (100 samples) vs `niah_multikey_3` | | 3 | Task with known error samples | Use task with documented failure cases | | 4 | Most representative task | Single-key before multi-key for basic validation | ### 3.3 Other Tasks Handling Tasks not selected for full validation: - **MAY** receive lightweight sanity checks (≤5 samples) - **MUST NOT** receive full end-to-end execution by default - **SHOULD** be noted in execution plan for future validation --- ## 4. Scale-Up Conditions Expansion to more GPUs or multiple full tasks is **ALLOWED ONLY IF**: | Condition | Justification Required | |-----------|------------------------| | Single-task validation completed successfully | Confirm fix works on one task first | | Critical bug identified and fixed | Need cross-task verification | | Cross-dataset consistency required | Clear technical justification needed | | User explicitly requests full-scale | User override | ### 4.1 Default Behavior - **DEFAULT**: Conservative, non-expansive - **MUST** ask for confirmation before scaling up - **MUST** document reason for scale-up in execution plan --- ## 5. Execution Plan Transparency ### 5.1 Mandatory Pre-Execution Output Before starting any validation, **MUST** output an execution plan containing: ```markdown ## Execution Plan ### Task Classification - Type: [Long-running validation / Exploratory] - Reason: [Why classified this way] ### GPU Allocation - Validation GPU(s): [GPU IDs] - Reason: [Why these GPUs selected] - Exploration GPU(s): [GPU IDs] - Exploration tasks: [List of parallel hypotheses to test] ### Task Selection - Full validation task: [Task name] - Reason: [Why this task selected] - Other tasks: [Skipped / Sanity-check only] ### Stopping Criteria - Time limit: [X minutes] - Success metric: [e.g., accuracy > 90%] - Error threshold: [e.g., stop if >20 samples fail] ### Expected Output - [What results will be produced] ``` ### 5.2 Progress Checkpoints For long-running validations, **SHOULD** report progress at: - 25% completion - 50% completion - 75% completion - Final results --- ## 6. Configuration Defaults ### 6.1 Default Parameters | Parameter | Default Value | Description | |-----------|---------------|-------------| | `LONG_RUNNING_THRESHOLD_MINUTES` | 20 | Runtime threshold for classification | | `LONG_RUNNING_SAMPLE_THRESHOLD` | 50 | Sample count threshold | | `MAX_VALIDATION_GPUS` | 2 | Maximum GPUs for long validation | | `MIN_EXPLORATION_GPUS` | 2 | Minimum GPUs reserved for exploration (when ≥4 available) | | `EXPLORATION_SAMPLE_LIMIT` | 10 | Max samples for exploratory tests | | `SANITY_CHECK_SAMPLES` | 5 | Samples for non-selected tasks | ### 6.2 User Override Users can override defaults by specifying in their request: - "Use all GPUs for validation" - "Run all tasks" - "Increase validation GPUs to N" --- ## 7. Async Monitoring (CRITICAL) ### 7.1 Non-Blocking Principle **MUST NOT** block the main agent with `sleep` commands waiting for results: - ❌ `sleep 300 && check_results` (blocks main agent) - ✅ Launch background tasks, continue thinking, check periodically ### 7.2 Continuous GPU Utilization **MUST** maximize GPU utilization: - When an agent completes a task, immediately assign new work - Use `run_in_background: true` for all long-running agents - Check agent completion via system notifications, not polling ### 7.3 Monitoring Strategy ``` CORRECT PATTERN: 1. Launch agents in background with run_in_background: true 2. Continue analysis, planning, or hypothesis generation 3. When agent completion notification arrives, process results 4. Immediately assign new tasks to freed GPUs WRONG PATTERN: 1. Launch agents 2. sleep 300 # BLOCKS EVERYTHING! 3. Check results 4. GPU sits idle during sleep ``` ### 7.4 Between-Task Work While waiting for agents, the main agent SHOULD: - Analyze code for additional hypotheses - Prepare next batch of tests - Update documentation with interim findings - Plan fix implementations based on emerging patterns ### 7.5 Idle GPU Utilization (CRITICAL) **MUST** utilize idle GPUs for exploratory tests while waiting: ``` WRONG PATTERN: 1. Launch 2 agents on GPU 0-1 2. Wait for completion ← GPU 2-5 sit idle! 3. Process results CORRECT PATTERN: 1. Launch 2 agents on GPU 0-1 for main validation 2. IMMEDIATELY launch exploratory tests on GPU 2-5: - Test alternative configurations - Verify edge cases - Run sanity checks on other datasets - Profile performance bottlenecks 3. Continue spawning new tasks as GPUs become free 4. Process results as they arrive ``` **Idle GPU Detection**: ```bash # Check which GPUs are free nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv ``` **Exploratory Test Ideas** (when main validation is running): | GPU State | Suggested Task | |-----------|----------------| | Idle during single-task validation | Test same task with different config | | Idle after quick test completes | Run related task (e.g., multikey after single-key) | | Idle during long benchmark | Run profiling or memory analysis | | Multiple GPUs idle | Parallelize hypothesis testing | **Anti-Pattern**: - ❌ "I'll wait for the 100-sample test to finish before doing anything else" - ✅ "While GPU 0-1 run the 100-sample test, I'll use GPU 2-5 to test configs X, Y, Z" --- ## 8. Code Modification Policy (CRITICAL) ### 8.1 Evidence-Before-Action Principle **MUST NOT** modify code until sufficient evidence has been gathered: | Phase | Action | Code Modification | |-------|--------|-------------------| | Hypothesis Formation | Identify potential causes | ❌ NO | | Evidence Gathering | Run targeted tests | ❌ NO | | Pattern Analysis | Analyze test results | ❌ NO | | Root Cause Confirmation | Validate with multiple tests | ❌ NO | | Solution Design | Design fix based on evidence | ❌ NO | | **Implementation** | Apply targeted fix | ✅ YES | ### 8.2 Minimum Evidence Requirements Before proposing ANY code modification: 1. **Reproducibility**: Bug must be reproducible with specific test cases 2. **Isolation**: Root cause must be isolated (not symptoms) 3. **Multiple Data Points**: At least 3 independent test runs confirming the issue 4. **Counter-Evidence**: Attempted to disprove the hypothesis 5. **Mechanism Understanding**: Clear understanding of WHY the bug occurs ### 8.3 Main Agent Behavior The main agent **SHOULD**: - Keep thinking and analyzing while background agents run tests - Formulate and refine hypotheses based on incoming results - Document findings in `findings.md` as evidence accumulates - Wait for sufficient test coverage before proposing fixes The main agent **MUST NOT**: - Rush to modify code after seeing first failure - Propose fixes based on speculation - Change multiple things at once "just to be safe" - Assume correlation implies causation ### 8.4 Evidence Documentation Template Before any code modification, document in `findings.md`: ```markdown ## Proposed Fix: [Brief Description] ### Evidence Summary - Test A: [Result] - supports/contradicts hypothesis - Test B: [Result] - supports/contradicts hypothesis - Test C: [Result] - supports/contradicts hypothesis ### Root Cause Analysis - What: [Specific bug behavior] - Where: [File:line or function] - Why: [Mechanism explanation] - Confidence: [High/Medium/Low] ### Alternative Explanations Ruled Out 1. [Alternative A]: Ruled out because [reason] 2. [Alternative B]: Ruled out because [reason] ### Proposed Change - File: [path] - Change: [description] - Expected Impact: [what should improve] ``` ### 8.5 Anti-Patterns | Don't | Do Instead | |-------|------------| | See error → immediately edit code | See error → gather more data → analyze → then edit | | Fix based on single test failure | Reproduce failure 3+ times, understand pattern | | Change code "to see what happens" | Form hypothesis first, design targeted experiment | | Modify multiple files simultaneously | Isolate changes, verify each independently | | Skip documentation of findings | Document every significant finding before changing code | --- ## 9. Example Scenario ### Setup - **Machine**: 8 GPUs (GPU 0-7) - **Task**: Debug RULER chunked attention 20% error rate - **Available tasks**: 6 RULER tasks (niah_single_1/2/3, niah_multikey_1/2/3) - **Estimated full validation time**: ~2 hours for all tasks ### Execution Plan Output ```markdown ## Execution Plan ### Task Classification - Type: Long-running validation - Reason: Full validation of 100 samples × 6 tasks would take ~2 hours ### GPU Allocation - Validation GPU(s): GPU 0 (1 GPU) - Reason: Single GPU sufficient for sequential 100-sample validation - Exploration GPU(s): GPU 1, 2, 3, 4, 5, 6, 7 (7 GPUs) - Exploration tasks: 1. GPU 1: Test 2-slot vs 4-slot ring buffer on error samples 2. GPU 2: Test N-way merge implementation 3. GPU 3: Test LSE precision fix 4. GPU 4: Profile merge accumulation error 5. GPU 5: Test with ruler_64k dataset (5 samples) 6. GPU 6: Test decode boundary conditions 7. GPU 7: Reserved for ad-hoc hypothesis testing ### Task Selection - Full validation task: niah_single_1 - Reason: Has documented error samples (19 known failures), smallest single-key task - Other tasks: Sanity-check only (5 samples each) after fix verified ### Stopping Criteria - Time limit: 60 minutes for full validation - Success metric: Error rate < 10% (down from 20%) - Error threshold: Pause if new error pattern emerges (>5 consecutive failures) ### Expected Output - Accuracy comparison: before vs after fix - Error sample analysis: which samples still fail - Hypothesis validation: which exploration branch identified the fix ``` ### Execution Flow 1. **GPU 0**: Runs full `niah_single_1` validation (100 samples, ~40 min) 2. **GPU 1-7**: Run parallel exploration tasks (each ~5-15 min) 3. **Checkpoint at 50%**: Report GPU 0 progress + any discoveries from exploration 4. **On discovery**: If exploration GPU finds fix, pause validation, apply fix, restart 5. **Completion**: Report final results, decide if scale-up needed --- ## 10. Quick Reference Checklist Before starting any debugging validation: - [ ] Classified task type? (Long-running vs Exploratory) - [ ] If long-running: Limited to 1-2 GPUs? - [ ] If long-running: Selected single task for full validation? - [ ] Remaining GPUs allocated for exploration? - [ ] Execution plan output with all required sections? - [ ] Stopping criteria defined? - [ ] No user override requested? (Default conservative behavior) Before proposing any code modification: - [ ] Bug reproducible with specific test cases? - [ ] Root cause isolated (not just symptoms)? - [ ] At least 3 independent test runs confirming the issue? - [ ] Alternative explanations ruled out? - [ ] Mechanism of bug clearly understood? - [ ] Evidence documented in findings.md? --- ## 11. Rule Violations The following actions **VIOLATE** this rule: 1. Using all 6+ GPUs for a single 100-sample validation 2. Running full validation on all tasks without completing single-task first 3. Starting long validation without outputting execution plan 4. Not reserving GPUs for exploration when ≥4 GPUs available 5. Scaling up without meeting conditions in Section 4 6. **Modifying code before gathering sufficient evidence** (Section 8) 7. Proposing fixes based on single test failure or speculation 8. Changing multiple code locations simultaneously without isolation testing --- ## 12. Integration with Other Rules This rule works alongside: - `gpu-testing.md`: GPU type detection and basic allocation - `planning-with-files.md`: Progress tracking for long validations - `testing.md`: Test script conventions When conflicts arise, this rule takes precedence for debugging scenarios.