From 512e1e5401d8fdeb7830c4d94c037e1931bbe0a7 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Tue, 20 Jan 2026 23:41:08 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=94=A7=20chore:=20add=20Claude=20rules=20?= =?UTF-8?q?for=20agent=20result=20format=20and=20multi-GPU=20debugging?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 --- .claude/rules/agent-result-format.md | 195 +++++++++++ .claude/rules/multi-gpu-debugging.md | 463 +++++++++++++++++++++++++++ CLAUDE.md | 9 + 3 files changed, 667 insertions(+) create mode 100644 .claude/rules/agent-result-format.md create mode 100644 .claude/rules/multi-gpu-debugging.md diff --git a/.claude/rules/agent-result-format.md b/.claude/rules/agent-result-format.md new file mode 100644 index 0000000..6cc5df0 --- /dev/null +++ b/.claude/rules/agent-result-format.md @@ -0,0 +1,195 @@ +# Agent Result Format Rules + +## Purpose + +Minimize token usage when background agents return results to the main agent. Raw program output is verbose and wastes context window space. + +--- + +## 1. Result Formatting Principle + +**MUST** return **structured summaries** instead of raw output. + +| Don't | Do | +|-------|-----| +| Full program stdout/stderr | Key metrics only | +| Debug logs | Pass/Fail status | +| Verbose error stacks | Error summary + location | + +--- + +## 2. Standard Result Templates + +### 2.1 Test Results (RULER, Unit Tests, etc.) + +```markdown +## Test Results: [Task Name] + +**Pass Rate**: X / Y (Z%) + +### Failed Samples (if any) +| Sample | Expected | Got | +|--------|----------|-----| +| N | expected_value | actual_value | + +### Passed Samples +[List sample IDs or "All N samples passed"] +``` + +**Example** (instead of raw test output): +```markdown +## Test Results: niah_single_1 (Samples 0-49) + +**Pass Rate**: 50 / 50 (100%) + +### Passed Samples +All 50 samples passed. +``` + +### 2.2 Benchmark Results + +```markdown +## Benchmark Results: [Task Name] + +| Metric | Value | +|--------|-------| +| Throughput | X tok/s | +| Latency (p50) | Y ms | +| Latency (p99) | Z ms | +| Memory Peak | W GB | +``` + +### 2.3 Build/Compile Results + +```markdown +## Build Results: [Target] + +**Status**: SUCCESS / FAILED + +### Errors (if any) +| File | Line | Error | +|------|------|-------| +| path/to/file.py | 123 | error message | +``` + +### 2.4 Investigation/Research Results + +```markdown +## Investigation: [Topic] + +### Findings +1. Finding 1 (with file:line reference) +2. Finding 2 + +### Relevant Files +- path/to/file1.py: description +- path/to/file2.py: description + +### Conclusion +[1-2 sentence summary] +``` + +--- + +## 3. Mandatory Fields by Task Type + +| Task Type | Required Fields | +|-----------|-----------------| +| Test Run | Pass/Fail count, failed sample details | +| Benchmark | Key metrics (throughput, latency, memory) | +| Build | Status, error locations | +| Search | File paths, line numbers, brief context | +| Verification | Before/After comparison, conclusion | + +--- + +## 4. What to EXCLUDE + +**MUST NOT** include in results: + +| Exclude | Reason | +|---------|--------| +| Full stack traces | Extract error type + location only | +| Model loading logs | Not relevant to result | +| Progress bars / tqdm output | Noise | +| Warnings (unless critical) | Noise | +| Repeated successful outputs | "All X passed" is sufficient | +| Timestamps | Usually not needed | +| Device info (unless debugging hardware) | Noise | + +--- + +## 5. Agent Prompt Template + +When spawning background agents, include this instruction: + +``` +When reporting results, use a structured summary format: +- For tests: Pass rate, failed sample details (expected vs actual) +- For benchmarks: Key metrics table +- Do NOT include raw program output, logs, or verbose debug info +- Focus on actionable information only +``` + +--- + +## 6. Main Agent Instructions + +When spawning a background agent for testing: + +**Before** (verbose): +``` +Run tests for samples 0-49 and report the output. +``` + +**After** (structured): +``` +Run tests for samples 0-49. Report results as: +- Total pass/fail count +- For each failure: sample ID, expected value, actual value +- Do NOT include raw program output or logs +``` + +--- + +## 7. Examples + +### Bad (Wastes ~500 tokens): +``` +The test output was: +Loading model from ~/models/Llama-3.1-8B-Instruct... +Model loaded in 12.3s +[niah_single_1] Sample 0: PASS | Expected: 1234567 | Got: : 1234567.<|eot_id|> +[niah_single_1] Sample 1: PASS | Expected: 2345678 | Got: : 2345678.<|eot_id|> +... (50 more lines) ... +``` + +### Good (Uses ~50 tokens): +``` +## Test Results: niah_single_1 (Samples 0-49) + +**Pass Rate**: 50 / 50 (100%) + +All samples passed. +``` + +--- + +## 8. Token Savings Estimate + +| Result Type | Raw Output | Structured | Savings | +|-------------|------------|------------|---------| +| 50-sample test | ~1000 tokens | ~100 tokens | 90% | +| Benchmark run | ~500 tokens | ~80 tokens | 84% | +| Build failure | ~2000 tokens | ~200 tokens | 90% | + +--- + +## 9. Integration + +This rule should be applied when: +1. Spawning agents via Task tool +2. Running background commands +3. Processing results from completed agents + +Combine with `multi-gpu-debugging.md` for efficient parallel testing workflows. diff --git a/.claude/rules/multi-gpu-debugging.md b/.claude/rules/multi-gpu-debugging.md new file mode 100644 index 0000000..fdb98f0 --- /dev/null +++ b/.claude/rules/multi-gpu-debugging.md @@ -0,0 +1,463 @@ +# Multi-GPU Debugging and Experimentation Rules + +## Purpose + +This rule governs GPU resource allocation and task execution strategy during debugging and experimentation on multi-GPU machines. The goal is to maximize debugging efficiency by: +- Running long validations on minimal GPUs (1-2) +- Using remaining GPUs for parallel hypothesis exploration +- Executing only one task/dataset for full validation during debugging + +--- + +## 1. Scenario Classification + +### 1.1 Long-Running Validation (Triggers Conservative Allocation) + +A task SHALL be classified as **long-running validation** if ANY of the following conditions apply: + +| Condition | Threshold | +|-----------|-----------| +| Estimated runtime | > 20 minutes | +| Sample count | > 50 samples per task | +| Full dataset execution | Any complete validation.jsonl | +| Full training/fine-tuning | Any training run | +| Large-scale inference | > 10K tokens total | + +**Examples:** +- Running all 100 samples of `niah_single_1` +- Full RULER benchmark (13 tasks × 100 samples) +- Complete model evaluation on any benchmark + +### 1.2 Exploratory / Fast-Iteration Work (Allows Full GPU Use) + +A task SHALL be classified as **exploratory** if ALL of the following apply: + +| Condition | Threshold | +|-----------|-----------| +| Estimated runtime | < 10 minutes | +| Sample count | ≤ 10 samples | +| Purpose | Sanity check, minimal reproduction, hypothesis testing | + +**Examples:** +- Testing 3-5 specific error samples +- Single-batch inference for debugging +- Verifying a code fix on minimal input +- Profiling a single forward pass + +--- + +## 2. GPU Allocation Strategy + +### 2.1 Core Allocation Rules + +| Task Type | GPU Allocation | Remaining GPUs | +|-----------|----------------|----------------| +| Long-running validation | 1 GPU (default), max 2 GPUs | Reserved for exploration | +| Exploratory work | As needed, can use multiple | - | + +### 2.2 Mandatory Constraints + +1. **MUST NOT** occupy all available GPUs for a single long-running validation +2. **MUST** reserve at least 50% of GPUs (minimum 2) for parallel exploration when ≥4 GPUs available +3. **MUST** select GPUs based on this priority: + - Idle GPUs first (check with `nvidia-smi`) + - If load info unavailable, use lowest-numbered GPUs for validation +4. **MUST** avoid resource conflicts: + - Each task uses unique `CUDA_VISIBLE_DEVICES` + - Each task uses unique output directories + - Log files include GPU ID in filename + +### 2.3 GPU Selection Algorithm + +``` +IF num_available_gpus >= 4: + validation_gpus = 1 (or 2 if justified) + exploration_gpus = remaining GPUs +ELSE IF num_available_gpus == 3: + validation_gpus = 1 + exploration_gpus = 2 +ELSE IF num_available_gpus == 2: + validation_gpus = 1 + exploration_gpus = 1 +ELSE: + validation_gpus = 1 + exploration_gpus = 0 (sequential exploration) +``` + +--- + +## 3. Task / Dataset Selection Policy + +### 3.1 Single-Task Validation Rule + +During debugging, when a long-running validation is required: + +- **MUST** execute only ONE task/dataset fully +- **MUST NOT** run all tasks unless explicitly requested or conditions in Section 4 are met + +### 3.2 Task Selection Priority + +Select the single task based on this priority order: + +| Priority | Criterion | Example | +|----------|-----------|---------| +| 1 | Task most likely to reproduce the bug | If error occurs in `niah_single_1`, use that | +| 2 | Smallest task covering critical paths | `niah_single_1` (100 samples) vs `niah_multikey_3` | +| 3 | Task with known error samples | Use task with documented failure cases | +| 4 | Most representative task | Single-key before multi-key for basic validation | + +### 3.3 Other Tasks Handling + +Tasks not selected for full validation: +- **MAY** receive lightweight sanity checks (≤5 samples) +- **MUST NOT** receive full end-to-end execution by default +- **SHOULD** be noted in execution plan for future validation + +--- + +## 4. Scale-Up Conditions + +Expansion to more GPUs or multiple full tasks is **ALLOWED ONLY IF**: + +| Condition | Justification Required | +|-----------|------------------------| +| Single-task validation completed successfully | Confirm fix works on one task first | +| Critical bug identified and fixed | Need cross-task verification | +| Cross-dataset consistency required | Clear technical justification needed | +| User explicitly requests full-scale | User override | + +### 4.1 Default Behavior + +- **DEFAULT**: Conservative, non-expansive +- **MUST** ask for confirmation before scaling up +- **MUST** document reason for scale-up in execution plan + +--- + +## 5. Execution Plan Transparency + +### 5.1 Mandatory Pre-Execution Output + +Before starting any validation, **MUST** output an execution plan containing: + +```markdown +## Execution Plan + +### Task Classification +- Type: [Long-running validation / Exploratory] +- Reason: [Why classified this way] + +### GPU Allocation +- Validation GPU(s): [GPU IDs] +- Reason: [Why these GPUs selected] +- Exploration GPU(s): [GPU IDs] +- Exploration tasks: [List of parallel hypotheses to test] + +### Task Selection +- Full validation task: [Task name] +- Reason: [Why this task selected] +- Other tasks: [Skipped / Sanity-check only] + +### Stopping Criteria +- Time limit: [X minutes] +- Success metric: [e.g., accuracy > 90%] +- Error threshold: [e.g., stop if >20 samples fail] + +### Expected Output +- [What results will be produced] +``` + +### 5.2 Progress Checkpoints + +For long-running validations, **SHOULD** report progress at: +- 25% completion +- 50% completion +- 75% completion +- Final results + +--- + +## 6. Configuration Defaults + +### 6.1 Default Parameters + +| Parameter | Default Value | Description | +|-----------|---------------|-------------| +| `LONG_RUNNING_THRESHOLD_MINUTES` | 20 | Runtime threshold for classification | +| `LONG_RUNNING_SAMPLE_THRESHOLD` | 50 | Sample count threshold | +| `MAX_VALIDATION_GPUS` | 2 | Maximum GPUs for long validation | +| `MIN_EXPLORATION_GPUS` | 2 | Minimum GPUs reserved for exploration (when ≥4 available) | +| `EXPLORATION_SAMPLE_LIMIT` | 10 | Max samples for exploratory tests | +| `SANITY_CHECK_SAMPLES` | 5 | Samples for non-selected tasks | + +### 6.2 User Override + +Users can override defaults by specifying in their request: +- "Use all GPUs for validation" +- "Run all tasks" +- "Increase validation GPUs to N" + +--- + +## 7. Async Monitoring (CRITICAL) + +### 7.1 Non-Blocking Principle + +**MUST NOT** block the main agent with `sleep` commands waiting for results: +- ❌ `sleep 300 && check_results` (blocks main agent) +- ✅ Launch background tasks, continue thinking, check periodically + +### 7.2 Continuous GPU Utilization + +**MUST** maximize GPU utilization: +- When an agent completes a task, immediately assign new work +- Use `run_in_background: true` for all long-running agents +- Check agent completion via system notifications, not polling + +### 7.3 Monitoring Strategy + +``` +CORRECT PATTERN: +1. Launch agents in background with run_in_background: true +2. Continue analysis, planning, or hypothesis generation +3. When agent completion notification arrives, process results +4. Immediately assign new tasks to freed GPUs + +WRONG PATTERN: +1. Launch agents +2. sleep 300 # BLOCKS EVERYTHING! +3. Check results +4. GPU sits idle during sleep +``` + +### 7.4 Between-Task Work + +While waiting for agents, the main agent SHOULD: +- Analyze code for additional hypotheses +- Prepare next batch of tests +- Update documentation with interim findings +- Plan fix implementations based on emerging patterns + +### 7.5 Idle GPU Utilization (CRITICAL) + +**MUST** utilize idle GPUs for exploratory tests while waiting: + +``` +WRONG PATTERN: +1. Launch 2 agents on GPU 0-1 +2. Wait for completion ← GPU 2-5 sit idle! +3. Process results + +CORRECT PATTERN: +1. Launch 2 agents on GPU 0-1 for main validation +2. IMMEDIATELY launch exploratory tests on GPU 2-5: + - Test alternative configurations + - Verify edge cases + - Run sanity checks on other datasets + - Profile performance bottlenecks +3. Continue spawning new tasks as GPUs become free +4. Process results as they arrive +``` + +**Idle GPU Detection**: +```bash +# Check which GPUs are free +nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv +``` + +**Exploratory Test Ideas** (when main validation is running): + +| GPU State | Suggested Task | +|-----------|----------------| +| Idle during single-task validation | Test same task with different config | +| Idle after quick test completes | Run related task (e.g., multikey after single-key) | +| Idle during long benchmark | Run profiling or memory analysis | +| Multiple GPUs idle | Parallelize hypothesis testing | + +**Anti-Pattern**: +- ❌ "I'll wait for the 100-sample test to finish before doing anything else" +- ✅ "While GPU 0-1 run the 100-sample test, I'll use GPU 2-5 to test configs X, Y, Z" + +--- + +## 8. Code Modification Policy (CRITICAL) + +### 8.1 Evidence-Before-Action Principle + +**MUST NOT** modify code until sufficient evidence has been gathered: + +| Phase | Action | Code Modification | +|-------|--------|-------------------| +| Hypothesis Formation | Identify potential causes | ❌ NO | +| Evidence Gathering | Run targeted tests | ❌ NO | +| Pattern Analysis | Analyze test results | ❌ NO | +| Root Cause Confirmation | Validate with multiple tests | ❌ NO | +| Solution Design | Design fix based on evidence | ❌ NO | +| **Implementation** | Apply targeted fix | ✅ YES | + +### 8.2 Minimum Evidence Requirements + +Before proposing ANY code modification: + +1. **Reproducibility**: Bug must be reproducible with specific test cases +2. **Isolation**: Root cause must be isolated (not symptoms) +3. **Multiple Data Points**: At least 3 independent test runs confirming the issue +4. **Counter-Evidence**: Attempted to disprove the hypothesis +5. **Mechanism Understanding**: Clear understanding of WHY the bug occurs + +### 8.3 Main Agent Behavior + +The main agent **SHOULD**: +- Keep thinking and analyzing while background agents run tests +- Formulate and refine hypotheses based on incoming results +- Document findings in `findings.md` as evidence accumulates +- Wait for sufficient test coverage before proposing fixes + +The main agent **MUST NOT**: +- Rush to modify code after seeing first failure +- Propose fixes based on speculation +- Change multiple things at once "just to be safe" +- Assume correlation implies causation + +### 8.4 Evidence Documentation Template + +Before any code modification, document in `findings.md`: + +```markdown +## Proposed Fix: [Brief Description] + +### Evidence Summary +- Test A: [Result] - supports/contradicts hypothesis +- Test B: [Result] - supports/contradicts hypothesis +- Test C: [Result] - supports/contradicts hypothesis + +### Root Cause Analysis +- What: [Specific bug behavior] +- Where: [File:line or function] +- Why: [Mechanism explanation] +- Confidence: [High/Medium/Low] + +### Alternative Explanations Ruled Out +1. [Alternative A]: Ruled out because [reason] +2. [Alternative B]: Ruled out because [reason] + +### Proposed Change +- File: [path] +- Change: [description] +- Expected Impact: [what should improve] +``` + +### 8.5 Anti-Patterns + +| Don't | Do Instead | +|-------|------------| +| See error → immediately edit code | See error → gather more data → analyze → then edit | +| Fix based on single test failure | Reproduce failure 3+ times, understand pattern | +| Change code "to see what happens" | Form hypothesis first, design targeted experiment | +| Modify multiple files simultaneously | Isolate changes, verify each independently | +| Skip documentation of findings | Document every significant finding before changing code | + +--- + +## 9. Example Scenario + +### Setup +- **Machine**: 8 GPUs (GPU 0-7) +- **Task**: Debug RULER chunked attention 20% error rate +- **Available tasks**: 6 RULER tasks (niah_single_1/2/3, niah_multikey_1/2/3) +- **Estimated full validation time**: ~2 hours for all tasks + +### Execution Plan Output + +```markdown +## Execution Plan + +### Task Classification +- Type: Long-running validation +- Reason: Full validation of 100 samples × 6 tasks would take ~2 hours + +### GPU Allocation +- Validation GPU(s): GPU 0 (1 GPU) +- Reason: Single GPU sufficient for sequential 100-sample validation +- Exploration GPU(s): GPU 1, 2, 3, 4, 5, 6, 7 (7 GPUs) +- Exploration tasks: + 1. GPU 1: Test 2-slot vs 4-slot ring buffer on error samples + 2. GPU 2: Test N-way merge implementation + 3. GPU 3: Test LSE precision fix + 4. GPU 4: Profile merge accumulation error + 5. GPU 5: Test with ruler_64k dataset (5 samples) + 6. GPU 6: Test decode boundary conditions + 7. GPU 7: Reserved for ad-hoc hypothesis testing + +### Task Selection +- Full validation task: niah_single_1 +- Reason: Has documented error samples (19 known failures), smallest single-key task +- Other tasks: Sanity-check only (5 samples each) after fix verified + +### Stopping Criteria +- Time limit: 60 minutes for full validation +- Success metric: Error rate < 10% (down from 20%) +- Error threshold: Pause if new error pattern emerges (>5 consecutive failures) + +### Expected Output +- Accuracy comparison: before vs after fix +- Error sample analysis: which samples still fail +- Hypothesis validation: which exploration branch identified the fix +``` + +### Execution Flow + +1. **GPU 0**: Runs full `niah_single_1` validation (100 samples, ~40 min) +2. **GPU 1-7**: Run parallel exploration tasks (each ~5-15 min) +3. **Checkpoint at 50%**: Report GPU 0 progress + any discoveries from exploration +4. **On discovery**: If exploration GPU finds fix, pause validation, apply fix, restart +5. **Completion**: Report final results, decide if scale-up needed + +--- + +## 10. Quick Reference Checklist + +Before starting any debugging validation: + +- [ ] Classified task type? (Long-running vs Exploratory) +- [ ] If long-running: Limited to 1-2 GPUs? +- [ ] If long-running: Selected single task for full validation? +- [ ] Remaining GPUs allocated for exploration? +- [ ] Execution plan output with all required sections? +- [ ] Stopping criteria defined? +- [ ] No user override requested? (Default conservative behavior) + +Before proposing any code modification: + +- [ ] Bug reproducible with specific test cases? +- [ ] Root cause isolated (not just symptoms)? +- [ ] At least 3 independent test runs confirming the issue? +- [ ] Alternative explanations ruled out? +- [ ] Mechanism of bug clearly understood? +- [ ] Evidence documented in findings.md? + +--- + +## 11. Rule Violations + +The following actions **VIOLATE** this rule: + +1. Using all 6+ GPUs for a single 100-sample validation +2. Running full validation on all tasks without completing single-task first +3. Starting long validation without outputting execution plan +4. Not reserving GPUs for exploration when ≥4 GPUs available +5. Scaling up without meeting conditions in Section 4 +6. **Modifying code before gathering sufficient evidence** (Section 8) +7. Proposing fixes based on single test failure or speculation +8. Changing multiple code locations simultaneously without isolation testing + +--- + +## 12. Integration with Other Rules + +This rule works alongside: +- `gpu-testing.md`: GPU type detection and basic allocation +- `planning-with-files.md`: Progress tracking for long validations +- `testing.md`: Test script conventions + +When conflicts arise, this rule takes precedence for debugging scenarios. diff --git a/CLAUDE.md b/CLAUDE.md index c0f4621..716e9db 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -23,6 +23,15 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L | [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) | | [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 | +## Rules Index + +| Rule | Purpose | +|------|---------| +| [`.claude/rules/multi-gpu-debugging.md`](.claude/rules/multi-gpu-debugging.md) | **Multi-GPU debugging**: GPU allocation (1-2 for validation, rest for exploration), single-task validation policy | +| [`.claude/rules/gpu-testing.md`](.claude/rules/gpu-testing.md) | GPU type detection, card assignment, needle test requirements | +| [`.claude/rules/sparse-policy.md`](.claude/rules/sparse-policy.md) | SparsePolicy implementation requirements | +| [`.claude/rules/planning-with-files.md`](.claude/rules/planning-with-files.md) | Planning file management for complex tasks | + ## GPU Mutex for Multi-Instance Debugging **IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type: