Files
nano-vllm/docs/offload_accuracy_issue.md
Zijie Tian 49519c7ce7 📝 docs: update offload accuracy issue with independent testing results
Document key finding: single request inference works correctly (100% accuracy).
The 66% accuracy issue in batch mode is due to state accumulation between
sequential requests in the same process.

- Add comparison table: independent (100%) vs batch (66%) testing modes
- Document root cause analysis: state cleanup issue between requests
- Add workaround using test_ruler_niah.sh for independent testing
- Update next steps to focus on OffloadEngine reset/cleanup logic

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 21:08:35 +08:00

10 KiB
Raw Blame History

CPU Offload Accuracy Issue Investigation

Problem Summary

UPDATE (2026-01-12): Single request inference works correctly! The issue is with batch/sequential request handling.

Mode Testing Method Accuracy
CPU Offload Independent (1 request per process) 100%
CPU Offload Batch (multiple requests per process) 66% ✗
Non-Offload Batch 100% ✓

Conclusion: The offload implementation is correct for single requests. The bug is in state cleanup between sequential requests within the same process.

Test Environment

  • Model: Llama-3.1-8B-Instruct
  • Task: RULER NIAH (Needle-In-A-Haystack) 32K context
  • GPU: NVIDIA A100-SXM4-80GB
  • Data: tests/data/ruler_niah/niah_single_1_32k.jsonl (100 samples)

Reproduction Commands

Non-Offload Mode (100% accuracy)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --gpu-utilization 0.7 \
    --quiet

Configuration:

  • KV Cache: GPU only, 51 blocks (6528 MB)
  • Block size: 1024 tokens

Offload Mode (66% accuracy)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --quiet

Configuration:

  • KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
  • Ring buffer: 4 buffers × 33280 tokens (520 MB)
  • Per-layer decode buffer: 128 MB
  • Block size: 1024 tokens

Observed Failure Patterns

From the 5-sample verbose test:

Sample Expected Offload Output Status
0 8930103 : 8930103. PASS
1 4194548 : 419 multiplication of 4548. FAIL
2 8231838 :ное 8231838. PASS
3 8835373 : 8835373. PASS
4 7754864 aster 7754864. PASS

Failure pattern: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").

Architecture Overview

Offload Mode Data Flow

Prefill Phase:
1. Input tokens → chunked into 2048-token chunks
2. Each chunk processed layer by layer:
   - Load KV from CPU → GPU ring buffer
   - Compute attention
   - Store KV back to CPU
3. Ring buffer holds recent KV for decode

Decode Phase:
1. For each new token:
   - Load all layer KV from CPU (one layer at a time)
   - Compute attention against full context
   - Generate next token

Key Components

File Component Description
nanovllm/kvcache/offload_engine.py OffloadEngine Manages CPU↔GPU KV cache transfers
nanovllm/kvcache/offload_engine.py RingKVBuffer GPU ring buffer for recent KV
nanovllm/engine/model_runner.py run_chunked_offload_prefill() Chunked prefill with offload
nanovllm/engine/model_runner.py run_offload_decode() Layer-wise decode with offload
nanovllm/kvcache/hybrid_manager.py HybridBlockManager CPU block allocation

Potential Root Causes

1. Ring Buffer Index/Position Issues

Location: nanovllm/kvcache/offload_engine.py

The ring buffer uses modular indexing. Potential issues:

  • Position calculation errors during prefill/decode transition
  • Off-by-one errors in KV storage/retrieval
  • Incorrect handling when sequence length approaches max_seq_len

Recent fix applied: max_seq_len = max_model_len + 512 to prevent overflow, but there may be other indexing issues.

2. Chunked Prefill KV Storage

Location: nanovllm/engine/model_runner.py:run_chunked_offload_prefill()

During chunked prefill:

  • KV computed for chunk N must be correctly stored before processing chunk N+1
  • Position IDs must be correctly accumulated across chunks
  • CPU block allocation must be contiguous and correctly tracked

Suspect areas:

# Check if positions are correctly tracked across chunks
# Check if KV is correctly copied to CPU after each chunk
# Check if ring buffer indices align with CPU block indices

3. Decode Phase KV Loading

Location: nanovllm/engine/model_runner.py:run_offload_decode()

During decode:

  • Must load KV for ALL previous tokens (both prefill and decode)
  • Layer-by-layer loading must be synchronized correctly
  • Attention computation must use correct sequence length

Suspect areas:

# Check if decode loads KV for full context length
# Check if new decode KV is stored correctly
# Check if attention mask/positions are correct

4. CPU↔GPU Transfer Synchronization

Location: nanovllm/kvcache/offload_engine.py

CUDA streams and synchronization:

  • Async copies may complete out of order
  • Missing synchronization points could cause stale data
  • Stream priorities may affect correctness

5. Numerical Precision

  • CPU tensors use float16/bfloat16
  • GPU computation precision
  • Potential precision loss during transfers

Debugging Strategy

Step 1: Identify Failing Samples

# Run verbose mode to see which samples fail
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --verbose 2>&1 | tee offload_verbose.log

Step 2: Compare Token-by-Token

Create a debug script to compare token generation between offload and non-offload modes for a failing sample:

# Compare logits at each decode step
# Check if divergence starts at a specific position
# Log KV cache contents at divergence point

Step 3: Verify KV Cache Contents

Add debugging to OffloadEngine:

# In store_kv(): Log what's being stored
# In load_kv(): Log what's being loaded
# Compare loaded KV with expected values

Step 4: Check Position/Index Calculations

# Log ring buffer write/read positions
# Log CPU block indices
# Verify position IDs match actual token positions

Step 5: Isolate the Bug

  1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent
  2. Test with single chunk (no chunking) to isolate chunked prefill
  3. Test prefill-only (no decode) to isolate decode phase

Quick Debugging Commands

# Test single failing sample with verbose output
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --sample-indices 1 \
    --verbose

# Test with different context lengths
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --max-model-len 16384 \
    --verbose

Test Results Log

2026-01-12 (Updated - Independent Testing)

Key Finding: When each sample is tested independently (separate Python process per sample), CPU offload achieves 100% accuracy.

Test Mode Testing Method Samples Passed Accuracy
RULER NIAH 32K CPU Offload Independent (separate process) 100 100 100%
RULER NIAH 32K CPU Offload Batch (single process) 100 66 66%
RULER NIAH 32K Non-Offload Batch (single process) 100 100 100%

Test Configuration (Independent Mode):

  • GPUs: 4x RTX 3090 (parallel testing)
  • Each sample: Fresh Python process with new LLM instance
  • Port: Each GPU uses unique port (2333+gpu_id)
  • Duration: 17.9 minutes for 100 samples
  • Throughput: 5.58 samples/min

2025-01-12 (Original - Batch Testing)

Test Mode Samples Passed Accuracy
RULER NIAH 32K Non-Offload 100 100 100%
RULER NIAH 32K CPU Offload 100 66 66%

Root Cause Analysis Update

Confirmed: Single Request Inference is Correct

The 100% accuracy in independent testing mode confirms that:

  1. Single request inference works correctly - The offload engine, ring buffer, and chunked prefill are functioning properly for individual requests
  2. The bug is in batch/sequential request handling - State accumulation or incomplete cleanup between requests causes failures

Suspected Issue: State Accumulation Between Requests

When multiple requests are processed in the same Python process:

  • The first request succeeds (e.g., Sample 0: PASS)
  • Subsequent requests may fail due to:
    • Residual state in ring buffer
    • Incomplete KV cache cleanup
    • Position tracking errors across requests
    • CPU block allocation fragmentation

Evidence

From batch mode testing (5 samples):

Sample Expected Output Status
0 8930103 : 8930103. PASS (first request)
1 4194548 : 419 multiplication of 4548. FAIL (second request)
2 8231838 :ное 8231838. PASS
3 8835373 : 8835373. PASS
4 7754864 aster 7754864. PASS

The corrupted output in Sample 1 suggests interference from Sample 0's state.

Workaround

Use independent testing mode (separate process per request) for production evaluation:

# Using test_ruler_niah.sh for parallel independent testing
./tests/test_ruler_niah.sh --gpus "0,1,2,3" --total 100

# Or manually run each sample in a separate process
for i in $(seq 0 99); do
    CUDA_VISIBLE_DEVICES=0 python tests/test_ruler_niah.py \
        --enable-offload --sample-indices $i --quiet
done

Next Steps

  1. Identify pattern in failing samples → Pattern: First sample usually passes, failures occur in subsequent samples
  2. Investigate state cleanup between requests in offload mode
    • Check OffloadEngine reset/cleanup logic
    • Check ring buffer state between requests
    • Check CPU block manager cleanup
  3. Add reset() method to OffloadEngine for explicit state cleanup
  4. Compare state between first and second request in batch mode
  5. Write unit test that reproduces the batch mode failure