zijie-tian/nano-vllm

Fork 0

Files

Zijie Tian 8e0888c20c [docs] Added offload_acc issue.

2026-01-12 15:05:55 +08:00

7.2 KiB

Raw Blame History

CPU Offload Accuracy Issue Investigation

Problem Summary

CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark.

Mode	Accuracy	Pass/Total
Non-Offload (GPU only)	100%	100/100
CPU Offload	66%	66/100

This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness.

Test Environment

Model: Llama-3.1-8B-Instruct
Task: RULER NIAH (Needle-In-A-Haystack) 32K context
GPU: NVIDIA A100-SXM4-80GB
Data: tests/data/ruler_niah/niah_single_1_32k.jsonl (100 samples)

Reproduction Commands

Non-Offload Mode (100% accuracy)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --gpu-utilization 0.7 \
    --quiet

Configuration:

KV Cache: GPU only, 51 blocks (6528 MB)
Block size: 1024 tokens

Offload Mode (66% accuracy)

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --quiet

Configuration:

KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
Ring buffer: 4 buffers × 33280 tokens (520 MB)
Per-layer decode buffer: 128 MB
Block size: 1024 tokens

Observed Failure Patterns

From the 5-sample verbose test:

Sample	Expected	Offload Output	Status
0	8930103	`: 8930103.`	PASS
1	4194548	`: 419 multiplication of 4548.`	FAIL
2	8231838	`:ное 8231838.`	PASS
3	8835373	`: 8835373.`	PASS
4	7754864	`aster 7754864.`	PASS

Failure pattern: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").

Architecture Overview

Offload Mode Data Flow

Prefill Phase:
1. Input tokens → chunked into 2048-token chunks
2. Each chunk processed layer by layer:
   - Load KV from CPU → GPU ring buffer
   - Compute attention
   - Store KV back to CPU
3. Ring buffer holds recent KV for decode

Decode Phase:
1. For each new token:
   - Load all layer KV from CPU (one layer at a time)
   - Compute attention against full context
   - Generate next token

Key Components

File	Component	Description
`nanovllm/kvcache/offload_engine.py`	`OffloadEngine`	Manages CPU↔GPU KV cache transfers
`nanovllm/kvcache/offload_engine.py`	`RingKVBuffer`	GPU ring buffer for recent KV
`nanovllm/engine/model_runner.py`	`run_chunked_offload_prefill()`	Chunked prefill with offload
`nanovllm/engine/model_runner.py`	`run_offload_decode()`	Layer-wise decode with offload
`nanovllm/kvcache/hybrid_manager.py`	`HybridBlockManager`	CPU block allocation

Potential Root Causes

1. Ring Buffer Index/Position Issues

Location: nanovllm/kvcache/offload_engine.py

The ring buffer uses modular indexing. Potential issues:

Position calculation errors during prefill/decode transition
Off-by-one errors in KV storage/retrieval
Incorrect handling when sequence length approaches max_seq_len

Recent fix applied: max_seq_len = max_model_len + 512 to prevent overflow, but there may be other indexing issues.

2. Chunked Prefill KV Storage

Location: nanovllm/engine/model_runner.py:run_chunked_offload_prefill()

During chunked prefill:

KV computed for chunk N must be correctly stored before processing chunk N+1
Position IDs must be correctly accumulated across chunks
CPU block allocation must be contiguous and correctly tracked

Suspect areas:

# Check if positions are correctly tracked across chunks
# Check if KV is correctly copied to CPU after each chunk
# Check if ring buffer indices align with CPU block indices

3. Decode Phase KV Loading

Location: nanovllm/engine/model_runner.py:run_offload_decode()

During decode:

Must load KV for ALL previous tokens (both prefill and decode)
Layer-by-layer loading must be synchronized correctly
Attention computation must use correct sequence length

Suspect areas:

# Check if decode loads KV for full context length
# Check if new decode KV is stored correctly
# Check if attention mask/positions are correct

4. CPU↔GPU Transfer Synchronization

Location: nanovllm/kvcache/offload_engine.py

CUDA streams and synchronization:

Async copies may complete out of order
Missing synchronization points could cause stale data
Stream priorities may affect correctness

5. Numerical Precision

CPU tensors use float16/bfloat16
GPU computation precision
Potential precision loss during transfers

Debugging Strategy

Step 1: Identify Failing Samples

# Run verbose mode to see which samples fail
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --verbose 2>&1 | tee offload_verbose.log

Step 2: Compare Token-by-Token

Create a debug script to compare token generation between offload and non-offload modes for a failing sample:

# Compare logits at each decode step
# Check if divergence starts at a specific position
# Log KV cache contents at divergence point

Step 3: Verify KV Cache Contents

Add debugging to OffloadEngine:

# In store_kv(): Log what's being stored
# In load_kv(): Log what's being loaded
# Compare loaded KV with expected values

Step 4: Check Position/Index Calculations

# Log ring buffer write/read positions
# Log CPU block indices
# Verify position IDs match actual token positions

Step 5: Isolate the Bug

Test with shorter sequences (16K, 8K) to see if issue is length-dependent
Test with single chunk (no chunking) to isolate chunked prefill
Test prefill-only (no decode) to isolate decode phase

Quick Debugging Commands

# Test single failing sample with verbose output
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --sample-indices 1 \
    --verbose

# Test with different context lengths
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --enable-offload \
    --max-model-len 16384 \
    --verbose

docs/ruler_niah_standalone_test.md - Test setup and background
docs/layerwise_offload_memory_analysis.md - Memory analysis (if exists)

Test Results Log

Date: 2025-01-12

Test	Mode	Samples	Passed	Accuracy
RULER NIAH 32K	Non-Offload	100	100	100%
RULER NIAH 32K	CPU Offload	100	66	66%

Next Steps

Identify pattern in failing samples (position of needle? specific numbers?)
Add detailed logging to offload engine
Compare logits between offload and non-offload modes
Bisect the code to find the exact bug location
Write unit test that isolates the bug

7.2 KiB Raw Blame History Unescape Escape