Document key finding: single request inference works correctly (100% accuracy). The 66% accuracy issue in batch mode is due to state accumulation between sequential requests in the same process. - Add comparison table: independent (100%) vs batch (66%) testing modes - Document root cause analysis: state cleanup issue between requests - Add workaround using test_ruler_niah.sh for independent testing - Update next steps to focus on OffloadEngine reset/cleanup logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
307 lines
10 KiB
Markdown
307 lines
10 KiB
Markdown
# CPU Offload Accuracy Issue Investigation
|
||
|
||
## Problem Summary
|
||
|
||
**UPDATE (2026-01-12)**: Single request inference works correctly! The issue is with batch/sequential request handling.
|
||
|
||
| Mode | Testing Method | Accuracy |
|
||
|------|----------------|----------|
|
||
| **CPU Offload** | **Independent** (1 request per process) | **100%** ✓ |
|
||
| **CPU Offload** | Batch (multiple requests per process) | 66% ✗ |
|
||
| **Non-Offload** | Batch | 100% ✓ |
|
||
|
||
**Conclusion**: The offload implementation is correct for single requests. The bug is in state cleanup between sequential requests within the same process.
|
||
|
||
## Test Environment
|
||
|
||
- **Model**: Llama-3.1-8B-Instruct
|
||
- **Task**: RULER NIAH (Needle-In-A-Haystack) 32K context
|
||
- **GPU**: NVIDIA A100-SXM4-80GB
|
||
- **Data**: `tests/data/ruler_niah/niah_single_1_32k.jsonl` (100 samples)
|
||
|
||
## Reproduction Commands
|
||
|
||
### Non-Offload Mode (100% accuracy)
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||
--model ~/models/Llama-3.1-8B-Instruct \
|
||
--gpu-utilization 0.7 \
|
||
--quiet
|
||
```
|
||
|
||
**Configuration**:
|
||
- KV Cache: GPU only, 51 blocks (6528 MB)
|
||
- Block size: 1024 tokens
|
||
|
||
### Offload Mode (66% accuracy)
|
||
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||
--model ~/models/Llama-3.1-8B-Instruct \
|
||
--enable-offload \
|
||
--quiet
|
||
```
|
||
|
||
**Configuration**:
|
||
- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
|
||
- Ring buffer: 4 buffers × 33280 tokens (520 MB)
|
||
- Per-layer decode buffer: 128 MB
|
||
- Block size: 1024 tokens
|
||
|
||
## Observed Failure Patterns
|
||
|
||
From the 5-sample verbose test:
|
||
|
||
| Sample | Expected | Offload Output | Status |
|
||
|--------|----------|----------------|--------|
|
||
| 0 | 8930103 | `: 8930103.` | PASS |
|
||
| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** |
|
||
| 2 | 8231838 | `:ное 8231838.` | PASS |
|
||
| 3 | 8835373 | `: 8835373.` | PASS |
|
||
| 4 | 7754864 | `aster 7754864.` | PASS |
|
||
|
||
**Failure pattern**: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").
|
||
|
||
## Architecture Overview
|
||
|
||
### Offload Mode Data Flow
|
||
|
||
```
|
||
Prefill Phase:
|
||
1. Input tokens → chunked into 2048-token chunks
|
||
2. Each chunk processed layer by layer:
|
||
- Load KV from CPU → GPU ring buffer
|
||
- Compute attention
|
||
- Store KV back to CPU
|
||
3. Ring buffer holds recent KV for decode
|
||
|
||
Decode Phase:
|
||
1. For each new token:
|
||
- Load all layer KV from CPU (one layer at a time)
|
||
- Compute attention against full context
|
||
- Generate next token
|
||
```
|
||
|
||
### Key Components
|
||
|
||
| File | Component | Description |
|
||
|------|-----------|-------------|
|
||
| `nanovllm/kvcache/offload_engine.py` | `OffloadEngine` | Manages CPU↔GPU KV cache transfers |
|
||
| `nanovllm/kvcache/offload_engine.py` | `RingKVBuffer` | GPU ring buffer for recent KV |
|
||
| `nanovllm/engine/model_runner.py` | `run_chunked_offload_prefill()` | Chunked prefill with offload |
|
||
| `nanovllm/engine/model_runner.py` | `run_offload_decode()` | Layer-wise decode with offload |
|
||
| `nanovllm/kvcache/hybrid_manager.py` | `HybridBlockManager` | CPU block allocation |
|
||
|
||
## Potential Root Causes
|
||
|
||
### 1. Ring Buffer Index/Position Issues
|
||
|
||
**Location**: `nanovllm/kvcache/offload_engine.py`
|
||
|
||
The ring buffer uses modular indexing. Potential issues:
|
||
- Position calculation errors during prefill/decode transition
|
||
- Off-by-one errors in KV storage/retrieval
|
||
- Incorrect handling when sequence length approaches `max_seq_len`
|
||
|
||
**Recent fix applied**: `max_seq_len = max_model_len + 512` to prevent overflow, but there may be other indexing issues.
|
||
|
||
### 2. Chunked Prefill KV Storage
|
||
|
||
**Location**: `nanovllm/engine/model_runner.py:run_chunked_offload_prefill()`
|
||
|
||
During chunked prefill:
|
||
- KV computed for chunk N must be correctly stored before processing chunk N+1
|
||
- Position IDs must be correctly accumulated across chunks
|
||
- CPU block allocation must be contiguous and correctly tracked
|
||
|
||
**Suspect areas**:
|
||
```python
|
||
# Check if positions are correctly tracked across chunks
|
||
# Check if KV is correctly copied to CPU after each chunk
|
||
# Check if ring buffer indices align with CPU block indices
|
||
```
|
||
|
||
### 3. Decode Phase KV Loading
|
||
|
||
**Location**: `nanovllm/engine/model_runner.py:run_offload_decode()`
|
||
|
||
During decode:
|
||
- Must load KV for ALL previous tokens (both prefill and decode)
|
||
- Layer-by-layer loading must be synchronized correctly
|
||
- Attention computation must use correct sequence length
|
||
|
||
**Suspect areas**:
|
||
```python
|
||
# Check if decode loads KV for full context length
|
||
# Check if new decode KV is stored correctly
|
||
# Check if attention mask/positions are correct
|
||
```
|
||
|
||
### 4. CPU↔GPU Transfer Synchronization
|
||
|
||
**Location**: `nanovllm/kvcache/offload_engine.py`
|
||
|
||
CUDA streams and synchronization:
|
||
- Async copies may complete out of order
|
||
- Missing synchronization points could cause stale data
|
||
- Stream priorities may affect correctness
|
||
|
||
### 5. Numerical Precision
|
||
|
||
- CPU tensors use float16/bfloat16
|
||
- GPU computation precision
|
||
- Potential precision loss during transfers
|
||
|
||
## Debugging Strategy
|
||
|
||
### Step 1: Identify Failing Samples
|
||
|
||
```bash
|
||
# Run verbose mode to see which samples fail
|
||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||
--model ~/models/Llama-3.1-8B-Instruct \
|
||
--enable-offload \
|
||
--verbose 2>&1 | tee offload_verbose.log
|
||
```
|
||
|
||
### Step 2: Compare Token-by-Token
|
||
|
||
Create a debug script to compare token generation between offload and non-offload modes for a failing sample:
|
||
|
||
```python
|
||
# Compare logits at each decode step
|
||
# Check if divergence starts at a specific position
|
||
# Log KV cache contents at divergence point
|
||
```
|
||
|
||
### Step 3: Verify KV Cache Contents
|
||
|
||
Add debugging to `OffloadEngine`:
|
||
|
||
```python
|
||
# In store_kv(): Log what's being stored
|
||
# In load_kv(): Log what's being loaded
|
||
# Compare loaded KV with expected values
|
||
```
|
||
|
||
### Step 4: Check Position/Index Calculations
|
||
|
||
```python
|
||
# Log ring buffer write/read positions
|
||
# Log CPU block indices
|
||
# Verify position IDs match actual token positions
|
||
```
|
||
|
||
### Step 5: Isolate the Bug
|
||
|
||
1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent
|
||
2. Test with single chunk (no chunking) to isolate chunked prefill
|
||
3. Test prefill-only (no decode) to isolate decode phase
|
||
|
||
## Quick Debugging Commands
|
||
|
||
```bash
|
||
# Test single failing sample with verbose output
|
||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||
--model ~/models/Llama-3.1-8B-Instruct \
|
||
--enable-offload \
|
||
--sample-indices 1 \
|
||
--verbose
|
||
|
||
# Test with different context lengths
|
||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||
--model ~/models/Llama-3.1-8B-Instruct \
|
||
--enable-offload \
|
||
--max-model-len 16384 \
|
||
--verbose
|
||
```
|
||
|
||
## Related Documentation
|
||
|
||
- [`docs/ruler_niah_standalone_test.md`](ruler_niah_standalone_test.md) - Test setup and background
|
||
- [`docs/layerwise_offload_memory_analysis.md`](layerwise_offload_memory_analysis.md) - Memory analysis (if exists)
|
||
|
||
## Test Results Log
|
||
|
||
### 2026-01-12 (Updated - Independent Testing)
|
||
|
||
**Key Finding**: When each sample is tested independently (separate Python process per sample), CPU offload achieves **100% accuracy**.
|
||
|
||
| Test | Mode | Testing Method | Samples | Passed | Accuracy |
|
||
|------|------|----------------|---------|--------|----------|
|
||
| RULER NIAH 32K | CPU Offload | **Independent** (separate process) | 100 | 100 | **100%** |
|
||
| RULER NIAH 32K | CPU Offload | Batch (single process) | 100 | 66 | 66% |
|
||
| RULER NIAH 32K | Non-Offload | Batch (single process) | 100 | 100 | 100% |
|
||
|
||
**Test Configuration (Independent Mode)**:
|
||
- GPUs: 4x RTX 3090 (parallel testing)
|
||
- Each sample: Fresh Python process with new LLM instance
|
||
- Port: Each GPU uses unique port (2333+gpu_id)
|
||
- Duration: 17.9 minutes for 100 samples
|
||
- Throughput: 5.58 samples/min
|
||
|
||
### 2025-01-12 (Original - Batch Testing)
|
||
|
||
| Test | Mode | Samples | Passed | Accuracy |
|
||
|------|------|---------|--------|----------|
|
||
| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
|
||
| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
|
||
|
||
## Root Cause Analysis Update
|
||
|
||
### Confirmed: Single Request Inference is Correct
|
||
|
||
The 100% accuracy in independent testing mode confirms that:
|
||
1. **Single request inference works correctly** - The offload engine, ring buffer, and chunked prefill are functioning properly for individual requests
|
||
2. **The bug is in batch/sequential request handling** - State accumulation or incomplete cleanup between requests causes failures
|
||
|
||
### Suspected Issue: State Accumulation Between Requests
|
||
|
||
When multiple requests are processed in the same Python process:
|
||
- The first request succeeds (e.g., Sample 0: PASS)
|
||
- Subsequent requests may fail due to:
|
||
- Residual state in ring buffer
|
||
- Incomplete KV cache cleanup
|
||
- Position tracking errors across requests
|
||
- CPU block allocation fragmentation
|
||
|
||
### Evidence
|
||
|
||
From batch mode testing (5 samples):
|
||
| Sample | Expected | Output | Status |
|
||
|--------|----------|--------|--------|
|
||
| 0 | 8930103 | `: 8930103.` | PASS (first request) |
|
||
| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** (second request) |
|
||
| 2 | 8231838 | `:ное 8231838.` | PASS |
|
||
| 3 | 8835373 | `: 8835373.` | PASS |
|
||
| 4 | 7754864 | `aster 7754864.` | PASS |
|
||
|
||
The corrupted output in Sample 1 suggests interference from Sample 0's state.
|
||
|
||
## Workaround
|
||
|
||
Use independent testing mode (separate process per request) for production evaluation:
|
||
|
||
```bash
|
||
# Using test_ruler_niah.sh for parallel independent testing
|
||
./tests/test_ruler_niah.sh --gpus "0,1,2,3" --total 100
|
||
|
||
# Or manually run each sample in a separate process
|
||
for i in $(seq 0 99); do
|
||
CUDA_VISIBLE_DEVICES=0 python tests/test_ruler_niah.py \
|
||
--enable-offload --sample-indices $i --quiet
|
||
done
|
||
```
|
||
|
||
## Next Steps
|
||
|
||
1. [x] ~~Identify pattern in failing samples~~ → Pattern: First sample usually passes, failures occur in subsequent samples
|
||
2. [ ] **Investigate state cleanup between requests in offload mode**
|
||
- Check `OffloadEngine` reset/cleanup logic
|
||
- Check ring buffer state between requests
|
||
- Check CPU block manager cleanup
|
||
3. [ ] Add `reset()` method to `OffloadEngine` for explicit state cleanup
|
||
4. [ ] Compare state between first and second request in batch mode
|
||
5. [ ] Write unit test that reproduces the batch mode failure
|