Document key finding: single request inference works correctly (100% accuracy). The 66% accuracy issue in batch mode is due to state accumulation between sequential requests in the same process. - Add comparison table: independent (100%) vs batch (66%) testing modes - Document root cause analysis: state cleanup issue between requests - Add workaround using test_ruler_niah.sh for independent testing - Update next steps to focus on OffloadEngine reset/cleanup logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
10 KiB
CPU Offload Accuracy Issue Investigation
Problem Summary
UPDATE (2026-01-12): Single request inference works correctly! The issue is with batch/sequential request handling.
| Mode | Testing Method | Accuracy |
|---|---|---|
| CPU Offload | Independent (1 request per process) | 100% ✓ |
| CPU Offload | Batch (multiple requests per process) | 66% ✗ |
| Non-Offload | Batch | 100% ✓ |
Conclusion: The offload implementation is correct for single requests. The bug is in state cleanup between sequential requests within the same process.
Test Environment
- Model: Llama-3.1-8B-Instruct
- Task: RULER NIAH (Needle-In-A-Haystack) 32K context
- GPU: NVIDIA A100-SXM4-80GB
- Data:
tests/data/ruler_niah/niah_single_1_32k.jsonl(100 samples)
Reproduction Commands
Non-Offload Mode (100% accuracy)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--gpu-utilization 0.7 \
--quiet
Configuration:
- KV Cache: GPU only, 51 blocks (6528 MB)
- Block size: 1024 tokens
Offload Mode (66% accuracy)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--quiet
Configuration:
- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
- Ring buffer: 4 buffers × 33280 tokens (520 MB)
- Per-layer decode buffer: 128 MB
- Block size: 1024 tokens
Observed Failure Patterns
From the 5-sample verbose test:
| Sample | Expected | Offload Output | Status |
|---|---|---|---|
| 0 | 8930103 | : 8930103. |
PASS |
| 1 | 4194548 | : 419 multiplication of 4548. |
FAIL |
| 2 | 8231838 | :ное 8231838. |
PASS |
| 3 | 8835373 | : 8835373. |
PASS |
| 4 | 7754864 | aster 7754864. |
PASS |
Failure pattern: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").
Architecture Overview
Offload Mode Data Flow
Prefill Phase:
1. Input tokens → chunked into 2048-token chunks
2. Each chunk processed layer by layer:
- Load KV from CPU → GPU ring buffer
- Compute attention
- Store KV back to CPU
3. Ring buffer holds recent KV for decode
Decode Phase:
1. For each new token:
- Load all layer KV from CPU (one layer at a time)
- Compute attention against full context
- Generate next token
Key Components
| File | Component | Description |
|---|---|---|
nanovllm/kvcache/offload_engine.py |
OffloadEngine |
Manages CPU↔GPU KV cache transfers |
nanovllm/kvcache/offload_engine.py |
RingKVBuffer |
GPU ring buffer for recent KV |
nanovllm/engine/model_runner.py |
run_chunked_offload_prefill() |
Chunked prefill with offload |
nanovllm/engine/model_runner.py |
run_offload_decode() |
Layer-wise decode with offload |
nanovllm/kvcache/hybrid_manager.py |
HybridBlockManager |
CPU block allocation |
Potential Root Causes
1. Ring Buffer Index/Position Issues
Location: nanovllm/kvcache/offload_engine.py
The ring buffer uses modular indexing. Potential issues:
- Position calculation errors during prefill/decode transition
- Off-by-one errors in KV storage/retrieval
- Incorrect handling when sequence length approaches
max_seq_len
Recent fix applied: max_seq_len = max_model_len + 512 to prevent overflow, but there may be other indexing issues.
2. Chunked Prefill KV Storage
Location: nanovllm/engine/model_runner.py:run_chunked_offload_prefill()
During chunked prefill:
- KV computed for chunk N must be correctly stored before processing chunk N+1
- Position IDs must be correctly accumulated across chunks
- CPU block allocation must be contiguous and correctly tracked
Suspect areas:
# Check if positions are correctly tracked across chunks
# Check if KV is correctly copied to CPU after each chunk
# Check if ring buffer indices align with CPU block indices
3. Decode Phase KV Loading
Location: nanovllm/engine/model_runner.py:run_offload_decode()
During decode:
- Must load KV for ALL previous tokens (both prefill and decode)
- Layer-by-layer loading must be synchronized correctly
- Attention computation must use correct sequence length
Suspect areas:
# Check if decode loads KV for full context length
# Check if new decode KV is stored correctly
# Check if attention mask/positions are correct
4. CPU↔GPU Transfer Synchronization
Location: nanovllm/kvcache/offload_engine.py
CUDA streams and synchronization:
- Async copies may complete out of order
- Missing synchronization points could cause stale data
- Stream priorities may affect correctness
5. Numerical Precision
- CPU tensors use float16/bfloat16
- GPU computation precision
- Potential precision loss during transfers
Debugging Strategy
Step 1: Identify Failing Samples
# Run verbose mode to see which samples fail
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--verbose 2>&1 | tee offload_verbose.log
Step 2: Compare Token-by-Token
Create a debug script to compare token generation between offload and non-offload modes for a failing sample:
# Compare logits at each decode step
# Check if divergence starts at a specific position
# Log KV cache contents at divergence point
Step 3: Verify KV Cache Contents
Add debugging to OffloadEngine:
# In store_kv(): Log what's being stored
# In load_kv(): Log what's being loaded
# Compare loaded KV with expected values
Step 4: Check Position/Index Calculations
# Log ring buffer write/read positions
# Log CPU block indices
# Verify position IDs match actual token positions
Step 5: Isolate the Bug
- Test with shorter sequences (16K, 8K) to see if issue is length-dependent
- Test with single chunk (no chunking) to isolate chunked prefill
- Test prefill-only (no decode) to isolate decode phase
Quick Debugging Commands
# Test single failing sample with verbose output
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--sample-indices 1 \
--verbose
# Test with different context lengths
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--max-model-len 16384 \
--verbose
Related Documentation
docs/ruler_niah_standalone_test.md- Test setup and backgrounddocs/layerwise_offload_memory_analysis.md- Memory analysis (if exists)
Test Results Log
2026-01-12 (Updated - Independent Testing)
Key Finding: When each sample is tested independently (separate Python process per sample), CPU offload achieves 100% accuracy.
| Test | Mode | Testing Method | Samples | Passed | Accuracy |
|---|---|---|---|---|---|
| RULER NIAH 32K | CPU Offload | Independent (separate process) | 100 | 100 | 100% |
| RULER NIAH 32K | CPU Offload | Batch (single process) | 100 | 66 | 66% |
| RULER NIAH 32K | Non-Offload | Batch (single process) | 100 | 100 | 100% |
Test Configuration (Independent Mode):
- GPUs: 4x RTX 3090 (parallel testing)
- Each sample: Fresh Python process with new LLM instance
- Port: Each GPU uses unique port (2333+gpu_id)
- Duration: 17.9 minutes for 100 samples
- Throughput: 5.58 samples/min
2025-01-12 (Original - Batch Testing)
| Test | Mode | Samples | Passed | Accuracy |
|---|---|---|---|---|
| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
Root Cause Analysis Update
Confirmed: Single Request Inference is Correct
The 100% accuracy in independent testing mode confirms that:
- Single request inference works correctly - The offload engine, ring buffer, and chunked prefill are functioning properly for individual requests
- The bug is in batch/sequential request handling - State accumulation or incomplete cleanup between requests causes failures
Suspected Issue: State Accumulation Between Requests
When multiple requests are processed in the same Python process:
- The first request succeeds (e.g., Sample 0: PASS)
- Subsequent requests may fail due to:
- Residual state in ring buffer
- Incomplete KV cache cleanup
- Position tracking errors across requests
- CPU block allocation fragmentation
Evidence
From batch mode testing (5 samples):
| Sample | Expected | Output | Status |
|---|---|---|---|
| 0 | 8930103 | : 8930103. |
PASS (first request) |
| 1 | 4194548 | : 419 multiplication of 4548. |
FAIL (second request) |
| 2 | 8231838 | :ное 8231838. |
PASS |
| 3 | 8835373 | : 8835373. |
PASS |
| 4 | 7754864 | aster 7754864. |
PASS |
The corrupted output in Sample 1 suggests interference from Sample 0's state.
Workaround
Use independent testing mode (separate process per request) for production evaluation:
# Using test_ruler_niah.sh for parallel independent testing
./tests/test_ruler_niah.sh --gpus "0,1,2,3" --total 100
# Or manually run each sample in a separate process
for i in $(seq 0 99); do
CUDA_VISIBLE_DEVICES=0 python tests/test_ruler_niah.py \
--enable-offload --sample-indices $i --quiet
done
Next Steps
Identify pattern in failing samples→ Pattern: First sample usually passes, failures occur in subsequent samples- Investigate state cleanup between requests in offload mode
- Check
OffloadEnginereset/cleanup logic - Check ring buffer state between requests
- Check CPU block manager cleanup
- Check
- Add
reset()method toOffloadEnginefor explicit state cleanup - Compare state between first and second request in batch mode
- Write unit test that reproduces the batch mode failure