Files

Zijie Tian 49519c7ce7 📝 docs: update offload accuracy issue with independent testing results

Document key finding: single request inference works correctly (100% accuracy).
The 66% accuracy issue in batch mode is due to state accumulation between
sequential requests in the same process.

- Add comparison table: independent (100%) vs batch (66%) testing modes
- Document root cause analysis: state cleanup issue between requests
- Add workaround using test_ruler_niah.sh for independent testing
- Update next steps to focus on OffloadEngine reset/cleanup logic

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-12 21:08:35 +08:00

10 KiB

Raw Blame History

CPU Offload Accuracy Issue Investigation

Problem Summary

UPDATE (2026-01-12): Single request inference works correctly! The issue is with batch/sequential request handling.

Mode	Testing Method	Accuracy
CPU Offload	Independent (1 request per process)	100% ✓
CPU Offload	Batch (multiple requests per process)	66% ✗
Non-Offload	Batch	100% ✓

Conclusion: The offload implementation is correct for single requests. The bug is in state cleanup between sequential requests within the same process.

Test Environment

Model: Llama-3.1-8B-Instruct
Task: RULER NIAH (Needle-In-A-Haystack) 32K context
GPU: NVIDIA A100-SXM4-80GB
Data: tests/data/ruler_niah/niah_single_1_32k.jsonl (100 samples)

Reproduction Commands

Non-Offload Mode (100% accuracy)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --gpu-utilization 0.7 \
    --quiet

Configuration:

KV Cache: GPU only, 51 blocks (6528 MB)
Block size: 1024 tokens

Offload Mode (66% accuracy)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --quiet

Configuration:

KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
Ring buffer: 4 buffers × 33280 tokens (520 MB)
Per-layer decode buffer: 128 MB
Block size: 1024 tokens

Observed Failure Patterns

From the 5-sample verbose test:

Sample	Expected	Offload Output	Status
0	8930103	`: 8930103.`	PASS
1	4194548	`: 419 multiplication of 4548.`	FAIL
2	8231838	`:ное 8231838.`	PASS
3	8835373	`: 8835373.`	PASS
4	7754864	`aster 7754864.`	PASS

Failure pattern: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").

Architecture Overview

Offload Mode Data Flow

Prefill Phase:
1. Input tokens → chunked into 2048-token chunks
2. Each chunk processed layer by layer:
   - Load KV from CPU → GPU ring buffer
   - Compute attention
   - Store KV back to CPU
3. Ring buffer holds recent KV for decode

Decode Phase:
1. For each new token:
   - Load all layer KV from CPU (one layer at a time)
   - Compute attention against full context
   - Generate next token

Key Components

File	Component	Description
`nanovllm/kvcache/offload_engine.py`	`OffloadEngine`	Manages CPU↔GPU KV cache transfers
`nanovllm/kvcache/offload_engine.py`	`RingKVBuffer`	GPU ring buffer for recent KV
`nanovllm/engine/model_runner.py`	`run_chunked_offload_prefill()`	Chunked prefill with offload
`nanovllm/engine/model_runner.py`	`run_offload_decode()`	Layer-wise decode with offload
`nanovllm/kvcache/hybrid_manager.py`	`HybridBlockManager`	CPU block allocation

Potential Root Causes

1. Ring Buffer Index/Position Issues

Location: nanovllm/kvcache/offload_engine.py

The ring buffer uses modular indexing. Potential issues:

Position calculation errors during prefill/decode transition
Off-by-one errors in KV storage/retrieval
Incorrect handling when sequence length approaches max_seq_len

Recent fix applied: max_seq_len = max_model_len + 512 to prevent overflow, but there may be other indexing issues.

2. Chunked Prefill KV Storage

Location: nanovllm/engine/model_runner.py:run_chunked_offload_prefill()

During chunked prefill:

KV computed for chunk N must be correctly stored before processing chunk N+1
Position IDs must be correctly accumulated across chunks
CPU block allocation must be contiguous and correctly tracked

Suspect areas:

# Check if positions are correctly tracked across chunks
# Check if KV is correctly copied to CPU after each chunk
# Check if ring buffer indices align with CPU block indices

3. Decode Phase KV Loading

Location: nanovllm/engine/model_runner.py:run_offload_decode()

During decode:

Must load KV for ALL previous tokens (both prefill and decode)
Layer-by-layer loading must be synchronized correctly
Attention computation must use correct sequence length

Suspect areas:

# Check if decode loads KV for full context length
# Check if new decode KV is stored correctly
# Check if attention mask/positions are correct

4. CPU↔GPU Transfer Synchronization

Location: nanovllm/kvcache/offload_engine.py

CUDA streams and synchronization:

Async copies may complete out of order
Missing synchronization points could cause stale data
Stream priorities may affect correctness

5. Numerical Precision

CPU tensors use float16/bfloat16
GPU computation precision
Potential precision loss during transfers

Debugging Strategy

Step 1: Identify Failing Samples

# Run verbose mode to see which samples fail
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --verbose 2>&1 | tee offload_verbose.log

Step 2: Compare Token-by-Token

Create a debug script to compare token generation between offload and non-offload modes for a failing sample:

# Compare logits at each decode step
# Check if divergence starts at a specific position
# Log KV cache contents at divergence point

Step 3: Verify KV Cache Contents

Add debugging to OffloadEngine:

# In store_kv(): Log what's being stored
# In load_kv(): Log what's being loaded
# Compare loaded KV with expected values

Step 4: Check Position/Index Calculations

# Log ring buffer write/read positions
# Log CPU block indices
# Verify position IDs match actual token positions

Step 5: Isolate the Bug

Test with shorter sequences (16K, 8K) to see if issue is length-dependent
Test with single chunk (no chunking) to isolate chunked prefill
Test prefill-only (no decode) to isolate decode phase

Quick Debugging Commands

# Test single failing sample with verbose output
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --sample-indices 1 \
    --verbose

# Test with different context lengths
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --max-model-len 16384 \
    --verbose

docs/ruler_niah_standalone_test.md - Test setup and background
docs/layerwise_offload_memory_analysis.md - Memory analysis (if exists)

Test Results Log

2026-01-12 (Updated - Independent Testing)

Key Finding: When each sample is tested independently (separate Python process per sample), CPU offload achieves 100% accuracy.

Test	Mode	Testing Method	Samples	Passed	Accuracy
RULER NIAH 32K	CPU Offload	Independent (separate process)	100	100	100%
RULER NIAH 32K	CPU Offload	Batch (single process)	100	66	66%
RULER NIAH 32K	Non-Offload	Batch (single process)	100	100	100%

Test Configuration (Independent Mode):

GPUs: 4x RTX 3090 (parallel testing)
Each sample: Fresh Python process with new LLM instance
Port: Each GPU uses unique port (2333+gpu_id)
Duration: 17.9 minutes for 100 samples
Throughput: 5.58 samples/min

2025-01-12 (Original - Batch Testing)

Test	Mode	Samples	Passed	Accuracy
RULER NIAH 32K	Non-Offload	100	100	100%
RULER NIAH 32K	CPU Offload	100	66	66%

Root Cause Analysis Update

Confirmed: Single Request Inference is Correct

The 100% accuracy in independent testing mode confirms that:

Single request inference works correctly - The offload engine, ring buffer, and chunked prefill are functioning properly for individual requests
The bug is in batch/sequential request handling - State accumulation or incomplete cleanup between requests causes failures

Suspected Issue: State Accumulation Between Requests

When multiple requests are processed in the same Python process:

The first request succeeds (e.g., Sample 0: PASS)
Subsequent requests may fail due to:
- Residual state in ring buffer
- Incomplete KV cache cleanup
- Position tracking errors across requests
- CPU block allocation fragmentation

Evidence

From batch mode testing (5 samples):

Sample	Expected	Output	Status
0	8930103	`: 8930103.`	PASS (first request)
1	4194548	`: 419 multiplication of 4548.`	FAIL (second request)
2	8231838	`:ное 8231838.`	PASS
3	8835373	`: 8835373.`	PASS
4	7754864	`aster 7754864.`	PASS

The corrupted output in Sample 1 suggests interference from Sample 0's state.

Workaround

Use independent testing mode (separate process per request) for production evaluation:

# Using test_ruler_niah.sh for parallel independent testing
./tests/test_ruler_niah.sh --gpus "0,1,2,3" --total 100

# Or manually run each sample in a separate process
for i in $(seq 0 99); do
    CUDA_VISIBLE_DEVICES=0 python tests/test_ruler_niah.py \
        --enable-offload --sample-indices $i --quiet
done

Next Steps

~~Identify pattern in failing samples~~ → Pattern: First sample usually passes, failures occur in subsequent samples
Investigate state cleanup between requests in offload mode
- Check OffloadEngine reset/cleanup logic
- Check ring buffer state between requests
- Check CPU block manager cleanup
Add reset() method to OffloadEngine for explicit state cleanup
Compare state between first and second request in batch mode
Write unit test that reproduces the batch mode failure

10 KiB Raw Blame History Unescape Escape