[docs] Added offload_acc issue.

2026-01-12 15:05:55 +08:00
parent a6cc703d73
commit 8e0888c20c
3 changed files with 623 additions and 74 deletions
--- a/docs/offload_accuracy_issue.md
+++ b/docs/offload_accuracy_issue.md
@@ -0,0 +1,239 @@
+# CPU Offload Accuracy Issue Investigation
+
+## Problem Summary
+
+CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark.
+
+| Mode | Accuracy | Pass/Total |
+|------|----------|------------|
+| **Non-Offload (GPU only)** | **100%** | 100/100 |
+| **CPU Offload** | **66%** | 66/100 |
+
+This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness.
+
+## Test Environment
+
+- **Model**: Llama-3.1-8B-Instruct
+- **Task**: RULER NIAH (Needle-In-A-Haystack) 32K context
+- **GPU**: NVIDIA A100-SXM4-80GB
+- **Data**: `tests/data/ruler_niah/niah_single_1_32k.jsonl` (100 samples)
+
+## Reproduction Commands
+
+### Non-Offload Mode (100% accuracy)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --gpu-utilization 0.7 \
+    --quiet
+```
+
+**Configuration**:
+- KV Cache: GPU only, 51 blocks (6528 MB)
+- Block size: 1024 tokens
+
+### Offload Mode (66% accuracy)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --quiet
+```
+
+**Configuration**:
+- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
+- Ring buffer: 4 buffers × 33280 tokens (520 MB)
+- Per-layer decode buffer: 128 MB
+- Block size: 1024 tokens
+
+## Observed Failure Patterns
+
+From the 5-sample verbose test:
+
+| Sample | Expected | Offload Output | Status |
+|--------|----------|----------------|--------|
+| 0 | 8930103 | `: 8930103.` | PASS |
+| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** |
+| 2 | 8231838 | `:ное 8231838.` | PASS |
+| 3 | 8835373 | `: 8835373.` | PASS |
+| 4 | 7754864 | `aster 7754864.` | PASS |
+
+**Failure pattern**: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").
+
+## Architecture Overview
+
+### Offload Mode Data Flow
+
+```
+Prefill Phase:
+1. Input tokens → chunked into 2048-token chunks
+2. Each chunk processed layer by layer:
+   - Load KV from CPU → GPU ring buffer
+   - Compute attention
+   - Store KV back to CPU
+3. Ring buffer holds recent KV for decode
+
+Decode Phase:
+1. For each new token:
+   - Load all layer KV from CPU (one layer at a time)
+   - Compute attention against full context
+   - Generate next token
+```
+
+### Key Components
+
+| File | Component | Description |
+|------|-----------|-------------|
+| `nanovllm/kvcache/offload_engine.py` | `OffloadEngine` | Manages CPU↔GPU KV cache transfers |
+| `nanovllm/kvcache/offload_engine.py` | `RingKVBuffer` | GPU ring buffer for recent KV |
+| `nanovllm/engine/model_runner.py` | `run_chunked_offload_prefill()` | Chunked prefill with offload |
+| `nanovllm/engine/model_runner.py` | `run_offload_decode()` | Layer-wise decode with offload |
+| `nanovllm/kvcache/hybrid_manager.py` | `HybridBlockManager` | CPU block allocation |
+
+## Potential Root Causes
+
+### 1. Ring Buffer Index/Position Issues
+
+**Location**: `nanovllm/kvcache/offload_engine.py`
+
+The ring buffer uses modular indexing. Potential issues:
+- Position calculation errors during prefill/decode transition
+- Off-by-one errors in KV storage/retrieval
+- Incorrect handling when sequence length approaches `max_seq_len`
+
+**Recent fix applied**: `max_seq_len = max_model_len + 512` to prevent overflow, but there may be other indexing issues.
+
+### 2. Chunked Prefill KV Storage
+
+**Location**: `nanovllm/engine/model_runner.py:run_chunked_offload_prefill()`
+
+During chunked prefill:
+- KV computed for chunk N must be correctly stored before processing chunk N+1
+- Position IDs must be correctly accumulated across chunks
+- CPU block allocation must be contiguous and correctly tracked
+
+**Suspect areas**:
+```python
+# Check if positions are correctly tracked across chunks
+# Check if KV is correctly copied to CPU after each chunk
+# Check if ring buffer indices align with CPU block indices
+```
+
+### 3. Decode Phase KV Loading
+
+**Location**: `nanovllm/engine/model_runner.py:run_offload_decode()`
+
+During decode:
+- Must load KV for ALL previous tokens (both prefill and decode)
+- Layer-by-layer loading must be synchronized correctly
+- Attention computation must use correct sequence length
+
+**Suspect areas**:
+```python
+# Check if decode loads KV for full context length
+# Check if new decode KV is stored correctly
+# Check if attention mask/positions are correct
+```
+
+### 4. CPU↔GPU Transfer Synchronization
+
+**Location**: `nanovllm/kvcache/offload_engine.py`
+
+CUDA streams and synchronization:
+- Async copies may complete out of order
+- Missing synchronization points could cause stale data
+- Stream priorities may affect correctness
+
+### 5. Numerical Precision
+
+- CPU tensors use float16/bfloat16
+- GPU computation precision
+- Potential precision loss during transfers
+
+## Debugging Strategy
+
+### Step 1: Identify Failing Samples
+
+```bash
+# Run verbose mode to see which samples fail
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --verbose 2>&1 | tee offload_verbose.log
+```
+
+### Step 2: Compare Token-by-Token
+
+Create a debug script to compare token generation between offload and non-offload modes for a failing sample:
+
+```python
+# Compare logits at each decode step
+# Check if divergence starts at a specific position
+# Log KV cache contents at divergence point
+```
+
+### Step 3: Verify KV Cache Contents
+
+Add debugging to `OffloadEngine`:
+
+```python
+# In store_kv(): Log what's being stored
+# In load_kv(): Log what's being loaded
+# Compare loaded KV with expected values
+```
+
+### Step 4: Check Position/Index Calculations
+
+```python
+# Log ring buffer write/read positions
+# Log CPU block indices
+# Verify position IDs match actual token positions
+```
+
+### Step 5: Isolate the Bug
+
+1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent
+2. Test with single chunk (no chunking) to isolate chunked prefill
+3. Test prefill-only (no decode) to isolate decode phase
+
+## Quick Debugging Commands
+
+```bash
+# Test single failing sample with verbose output
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --sample-indices 1 \
+    --verbose
+
+# Test with different context lengths
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --max-model-len 16384 \
+    --verbose
+```
+
+## Related Documentation
+
+- [`docs/ruler_niah_standalone_test.md`](ruler_niah_standalone_test.md) - Test setup and background
+- [`docs/layerwise_offload_memory_analysis.md`](layerwise_offload_memory_analysis.md) - Memory analysis (if exists)
+
+## Test Results Log
+
+**Date**: 2025-01-12
+
+| Test | Mode | Samples | Passed | Accuracy |
+|------|------|---------|--------|----------|
+| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
+| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
+
+## Next Steps
+
+1. [ ] Identify pattern in failing samples (position of needle? specific numbers?)
+2. [ ] Add detailed logging to offload engine
+3. [ ] Compare logits between offload and non-offload modes
+4. [ ] Bisect the code to find the exact bug location
+5. [ ] Write unit test that isolates the bug