zijie-tian/nano-vllm

Fork 0

Files

Zijie Tian 772313db8f [refactor] Refactor the kvcache offload.

2026-01-04 19:37:03 +08:00

3.8 KiB

Raw Blame History

Chunked Prefill Bug Debug Summary

Problem

test_needle.py --enable-offload --input-len 8192 fails with garbage output.

The model generates completely wrong tokens instead of the expected "7492".

Investigation Progress

1. Stream Synchronization Fix (Completed)

Replaced Triton store_kvcache kernel with pure PyTorch operations
Moved store_kvcache to compute_stream in chunked prefill mode
Added sync: compute_stream.wait_event(offload_done) after per-layer offload
Added sync: default_stream.wait_stream(compute_stream) before return

2. KV Cache Alignment Verification (Completed)

Created alignment tests to compare K/V tensors between torch reference and nanovllm:

RoPE Alignment:

RoPE implementations match perfectly (max_diff=0.002, cosine ~1.0)
Confirmed RoPE is NOT the cause of the bug

K/V Cache Alignment (Chunk 0):

Cosine similarity: ~1.0 for all layers
Max diff: 2-7 (grows linearly with position, characteristic of FP16 precision)
Mean diff: < 0.001
Conclusion: K/V cache offload is working correctly

3. Layer Output Divergence Analysis (Completed)

Created per-chunk layer output comparison:

Chunk 0 (tokens 0-4096):

All layers pass with excellent cosine similarity (0.999+)
Max diff grows in later layers but within acceptable range

Chunk 1 (tokens 4096-8192):

Layers 0-19: OK (cosine ~1.0)
Layers 20-27: Diverge (cosine 0.83-0.96, max_diff up to 114)
Divergence correlates with later transformer layers

4. Critical Discovery: Single-Chunk Offload Also Fails

Key finding: Even with input_len=2048 (single chunk, no chunked attention), the model produces garbage output with CPU offload enabled.

# Without offload: PASSES
python tests/test_needle.py --input-len 2048
# Output: "7492" (correct)

# With offload: FAILS
python tests/test_needle.py --enable-offload --input-len 2048
# Output: "The Ble White Th G Lopsiswin..." (garbage)

This proves the bug is NOT in:

Chunked attention logic (merge_attention_outputs)
Multi-chunk KV loading
Ring buffer pipeline

The bug IS in:

The decode path when CPU offload is enabled
How prefilled KV is loaded/used during decode

5. Decode Path Analysis (In Progress)

The decode path in CPU offload mode:

Prefill writes KV to GPU, offloads to CPU
Decode loads prefilled KV from CPU via _decode_ring_buffer_pipeline
Attend to prefilled KV + accumulated decode tokens
Merge results

Observations:

prefilled_blocks set is empty after decode (should contain block IDs)
CPU cache has valid data (reasonable mean/std values)
Decode buffer has zeros (decode tokens not being stored correctly?)

Current Status

Working

Stream synchronization fixes
K/V cache offload to CPU (verified alignment)
RoPE implementation
Chunked prefill attention for first chunk

Not Working

Decode with CPU offload (even for single-chunk inputs)
Multi-chunk attention (divergence in later layers for chunk 1)

Next Steps

Debug why prefilled_blocks is empty after decode
Check if decode path correctly loads KV from CPU
Verify decode buffer is being written correctly
Compare decode attention outputs between offload and non-offload modes

Key Files

nanovllm/layers/attention.py - Main attention implementation with chunked paths
nanovllm/kvcache/offload_engine.py - CPU-GPU transfer engine
nanovllm/kvcache/hybrid_manager.py - KV cache management with prefilled_blocks
nanovllm/engine/model_runner.py - Prefill/decode orchestration

Hypothesis

The decode path fails because:

prefilled_blocks is not being tracked correctly, causing get_prefilled_cpu_blocks() to return empty
OR the decode attention is not correctly loading/using the prefilled KV from CPU
OR there's a stream synchronization issue specific to decode path

3.8 KiB Raw Blame History