📝 docs: update offload accuracy issue with independent testing results
Document key finding: single request inference works correctly (100% accuracy). The 66% accuracy issue in batch mode is due to state accumulation between sequential requests in the same process. - Add comparison table: independent (100%) vs batch (66%) testing modes - Document root cause analysis: state cleanup issue between requests - Add workaround using test_ruler_niah.sh for independent testing - Update next steps to focus on OffloadEngine reset/cleanup logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -2,14 +2,15 @@
|
||||
|
||||
## Problem Summary
|
||||
|
||||
CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark.
|
||||
**UPDATE (2026-01-12)**: Single request inference works correctly! The issue is with batch/sequential request handling.
|
||||
|
||||
| Mode | Accuracy | Pass/Total |
|
||||
|------|----------|------------|
|
||||
| **Non-Offload (GPU only)** | **100%** | 100/100 |
|
||||
| **CPU Offload** | **66%** | 66/100 |
|
||||
| Mode | Testing Method | Accuracy |
|
||||
|------|----------------|----------|
|
||||
| **CPU Offload** | **Independent** (1 request per process) | **100%** ✓ |
|
||||
| **CPU Offload** | Batch (multiple requests per process) | 66% ✗ |
|
||||
| **Non-Offload** | Batch | 100% ✓ |
|
||||
|
||||
This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness.
|
||||
**Conclusion**: The offload implementation is correct for single requests. The bug is in state cleanup between sequential requests within the same process.
|
||||
|
||||
## Test Environment
|
||||
|
||||
@@ -223,17 +224,83 @@ CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py
|
||||
|
||||
## Test Results Log
|
||||
|
||||
**Date**: 2025-01-12
|
||||
### 2026-01-12 (Updated - Independent Testing)
|
||||
|
||||
**Key Finding**: When each sample is tested independently (separate Python process per sample), CPU offload achieves **100% accuracy**.
|
||||
|
||||
| Test | Mode | Testing Method | Samples | Passed | Accuracy |
|
||||
|------|------|----------------|---------|--------|----------|
|
||||
| RULER NIAH 32K | CPU Offload | **Independent** (separate process) | 100 | 100 | **100%** |
|
||||
| RULER NIAH 32K | CPU Offload | Batch (single process) | 100 | 66 | 66% |
|
||||
| RULER NIAH 32K | Non-Offload | Batch (single process) | 100 | 100 | 100% |
|
||||
|
||||
**Test Configuration (Independent Mode)**:
|
||||
- GPUs: 4x RTX 3090 (parallel testing)
|
||||
- Each sample: Fresh Python process with new LLM instance
|
||||
- Port: Each GPU uses unique port (2333+gpu_id)
|
||||
- Duration: 17.9 minutes for 100 samples
|
||||
- Throughput: 5.58 samples/min
|
||||
|
||||
### 2025-01-12 (Original - Batch Testing)
|
||||
|
||||
| Test | Mode | Samples | Passed | Accuracy |
|
||||
|------|------|---------|--------|----------|
|
||||
| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
|
||||
| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
|
||||
|
||||
## Root Cause Analysis Update
|
||||
|
||||
### Confirmed: Single Request Inference is Correct
|
||||
|
||||
The 100% accuracy in independent testing mode confirms that:
|
||||
1. **Single request inference works correctly** - The offload engine, ring buffer, and chunked prefill are functioning properly for individual requests
|
||||
2. **The bug is in batch/sequential request handling** - State accumulation or incomplete cleanup between requests causes failures
|
||||
|
||||
### Suspected Issue: State Accumulation Between Requests
|
||||
|
||||
When multiple requests are processed in the same Python process:
|
||||
- The first request succeeds (e.g., Sample 0: PASS)
|
||||
- Subsequent requests may fail due to:
|
||||
- Residual state in ring buffer
|
||||
- Incomplete KV cache cleanup
|
||||
- Position tracking errors across requests
|
||||
- CPU block allocation fragmentation
|
||||
|
||||
### Evidence
|
||||
|
||||
From batch mode testing (5 samples):
|
||||
| Sample | Expected | Output | Status |
|
||||
|--------|----------|--------|--------|
|
||||
| 0 | 8930103 | `: 8930103.` | PASS (first request) |
|
||||
| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** (second request) |
|
||||
| 2 | 8231838 | `:ное 8231838.` | PASS |
|
||||
| 3 | 8835373 | `: 8835373.` | PASS |
|
||||
| 4 | 7754864 | `aster 7754864.` | PASS |
|
||||
|
||||
The corrupted output in Sample 1 suggests interference from Sample 0's state.
|
||||
|
||||
## Workaround
|
||||
|
||||
Use independent testing mode (separate process per request) for production evaluation:
|
||||
|
||||
```bash
|
||||
# Using test_ruler_niah.sh for parallel independent testing
|
||||
./tests/test_ruler_niah.sh --gpus "0,1,2,3" --total 100
|
||||
|
||||
# Or manually run each sample in a separate process
|
||||
for i in $(seq 0 99); do
|
||||
CUDA_VISIBLE_DEVICES=0 python tests/test_ruler_niah.py \
|
||||
--enable-offload --sample-indices $i --quiet
|
||||
done
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. [ ] Identify pattern in failing samples (position of needle? specific numbers?)
|
||||
2. [ ] Add detailed logging to offload engine
|
||||
3. [ ] Compare logits between offload and non-offload modes
|
||||
4. [ ] Bisect the code to find the exact bug location
|
||||
5. [ ] Write unit test that isolates the bug
|
||||
1. [x] ~~Identify pattern in failing samples~~ → Pattern: First sample usually passes, failures occur in subsequent samples
|
||||
2. [ ] **Investigate state cleanup between requests in offload mode**
|
||||
- Check `OffloadEngine` reset/cleanup logic
|
||||
- Check ring buffer state between requests
|
||||
- Check CPU block manager cleanup
|
||||
3. [ ] Add `reset()` method to `OffloadEngine` for explicit state cleanup
|
||||
4. [ ] Compare state between first and second request in batch mode
|
||||
5. [ ] Write unit test that reproduces the batch mode failure
|
||||
|
||||
Reference in New Issue
Block a user