From 49519c7ce73ad21b3db9348f5fe9b29fc1d8d2b5 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Mon, 12 Jan 2026 21:08:35 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20docs:=20update=20offload=20accur?= =?UTF-8?q?acy=20issue=20with=20independent=20testing=20results?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document key finding: single request inference works correctly (100% accuracy). The 66% accuracy issue in batch mode is due to state accumulation between sequential requests in the same process. - Add comparison table: independent (100%) vs batch (66%) testing modes - Document root cause analysis: state cleanup issue between requests - Add workaround using test_ruler_niah.sh for independent testing - Update next steps to focus on OffloadEngine reset/cleanup logic Co-Authored-By: Claude Opus 4.5 --- docs/offload_accuracy_issue.md | 91 +++++++++++++++++++++++++++++----- 1 file changed, 79 insertions(+), 12 deletions(-) diff --git a/docs/offload_accuracy_issue.md b/docs/offload_accuracy_issue.md index febadea..289bc39 100644 --- a/docs/offload_accuracy_issue.md +++ b/docs/offload_accuracy_issue.md @@ -2,14 +2,15 @@ ## Problem Summary -CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark. +**UPDATE (2026-01-12)**: Single request inference works correctly! The issue is with batch/sequential request handling. -| Mode | Accuracy | Pass/Total | -|------|----------|------------| -| **Non-Offload (GPU only)** | **100%** | 100/100 | -| **CPU Offload** | **66%** | 66/100 | +| Mode | Testing Method | Accuracy | +|------|----------------|----------| +| **CPU Offload** | **Independent** (1 request per process) | **100%** ✓ | +| **CPU Offload** | Batch (multiple requests per process) | 66% ✗ | +| **Non-Offload** | Batch | 100% ✓ | -This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness. +**Conclusion**: The offload implementation is correct for single requests. The bug is in state cleanup between sequential requests within the same process. ## Test Environment @@ -223,17 +224,83 @@ CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py ## Test Results Log -**Date**: 2025-01-12 +### 2026-01-12 (Updated - Independent Testing) + +**Key Finding**: When each sample is tested independently (separate Python process per sample), CPU offload achieves **100% accuracy**. + +| Test | Mode | Testing Method | Samples | Passed | Accuracy | +|------|------|----------------|---------|--------|----------| +| RULER NIAH 32K | CPU Offload | **Independent** (separate process) | 100 | 100 | **100%** | +| RULER NIAH 32K | CPU Offload | Batch (single process) | 100 | 66 | 66% | +| RULER NIAH 32K | Non-Offload | Batch (single process) | 100 | 100 | 100% | + +**Test Configuration (Independent Mode)**: +- GPUs: 4x RTX 3090 (parallel testing) +- Each sample: Fresh Python process with new LLM instance +- Port: Each GPU uses unique port (2333+gpu_id) +- Duration: 17.9 minutes for 100 samples +- Throughput: 5.58 samples/min + +### 2025-01-12 (Original - Batch Testing) | Test | Mode | Samples | Passed | Accuracy | |------|------|---------|--------|----------| | RULER NIAH 32K | Non-Offload | 100 | 100 | 100% | | RULER NIAH 32K | CPU Offload | 100 | 66 | 66% | +## Root Cause Analysis Update + +### Confirmed: Single Request Inference is Correct + +The 100% accuracy in independent testing mode confirms that: +1. **Single request inference works correctly** - The offload engine, ring buffer, and chunked prefill are functioning properly for individual requests +2. **The bug is in batch/sequential request handling** - State accumulation or incomplete cleanup between requests causes failures + +### Suspected Issue: State Accumulation Between Requests + +When multiple requests are processed in the same Python process: +- The first request succeeds (e.g., Sample 0: PASS) +- Subsequent requests may fail due to: + - Residual state in ring buffer + - Incomplete KV cache cleanup + - Position tracking errors across requests + - CPU block allocation fragmentation + +### Evidence + +From batch mode testing (5 samples): +| Sample | Expected | Output | Status | +|--------|----------|--------|--------| +| 0 | 8930103 | `: 8930103.` | PASS (first request) | +| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** (second request) | +| 2 | 8231838 | `:ное 8231838.` | PASS | +| 3 | 8835373 | `: 8835373.` | PASS | +| 4 | 7754864 | `aster 7754864.` | PASS | + +The corrupted output in Sample 1 suggests interference from Sample 0's state. + +## Workaround + +Use independent testing mode (separate process per request) for production evaluation: + +```bash +# Using test_ruler_niah.sh for parallel independent testing +./tests/test_ruler_niah.sh --gpus "0,1,2,3" --total 100 + +# Or manually run each sample in a separate process +for i in $(seq 0 99); do + CUDA_VISIBLE_DEVICES=0 python tests/test_ruler_niah.py \ + --enable-offload --sample-indices $i --quiet +done +``` + ## Next Steps -1. [ ] Identify pattern in failing samples (position of needle? specific numbers?) -2. [ ] Add detailed logging to offload engine -3. [ ] Compare logits between offload and non-offload modes -4. [ ] Bisect the code to find the exact bug location -5. [ ] Write unit test that isolates the bug +1. [x] ~~Identify pattern in failing samples~~ → Pattern: First sample usually passes, failures occur in subsequent samples +2. [ ] **Investigate state cleanup between requests in offload mode** + - Check `OffloadEngine` reset/cleanup logic + - Check ring buffer state between requests + - Check CPU block manager cleanup +3. [ ] Add `reset()` method to `OffloadEngine` for explicit state cleanup +4. [ ] Compare state between first and second request in batch mode +5. [ ] Write unit test that reproduces the batch mode failure