Files
nano-vllm/DEBUG_SUMMARY.md
2026-01-04 19:37:03 +08:00

3.8 KiB

Chunked Prefill Bug Debug Summary

Problem

test_needle.py --enable-offload --input-len 8192 fails with garbage output.

The model generates completely wrong tokens instead of the expected "7492".

Investigation Progress

1. Stream Synchronization Fix (Completed)

  • Replaced Triton store_kvcache kernel with pure PyTorch operations
  • Moved store_kvcache to compute_stream in chunked prefill mode
  • Added sync: compute_stream.wait_event(offload_done) after per-layer offload
  • Added sync: default_stream.wait_stream(compute_stream) before return

2. KV Cache Alignment Verification (Completed)

Created alignment tests to compare K/V tensors between torch reference and nanovllm:

RoPE Alignment:

  • RoPE implementations match perfectly (max_diff=0.002, cosine ~1.0)
  • Confirmed RoPE is NOT the cause of the bug

K/V Cache Alignment (Chunk 0):

  • Cosine similarity: ~1.0 for all layers
  • Max diff: 2-7 (grows linearly with position, characteristic of FP16 precision)
  • Mean diff: < 0.001
  • Conclusion: K/V cache offload is working correctly

3. Layer Output Divergence Analysis (Completed)

Created per-chunk layer output comparison:

Chunk 0 (tokens 0-4096):

  • All layers pass with excellent cosine similarity (0.999+)
  • Max diff grows in later layers but within acceptable range

Chunk 1 (tokens 4096-8192):

  • Layers 0-19: OK (cosine ~1.0)
  • Layers 20-27: Diverge (cosine 0.83-0.96, max_diff up to 114)
  • Divergence correlates with later transformer layers

4. Critical Discovery: Single-Chunk Offload Also Fails

Key finding: Even with input_len=2048 (single chunk, no chunked attention), the model produces garbage output with CPU offload enabled.

# Without offload: PASSES
python tests/test_needle.py --input-len 2048
# Output: "7492" (correct)

# With offload: FAILS
python tests/test_needle.py --enable-offload --input-len 2048
# Output: "The Ble White Th G Lopsiswin..." (garbage)

This proves the bug is NOT in:

  • Chunked attention logic (merge_attention_outputs)
  • Multi-chunk KV loading
  • Ring buffer pipeline

The bug IS in:

  • The decode path when CPU offload is enabled
  • How prefilled KV is loaded/used during decode

5. Decode Path Analysis (In Progress)

The decode path in CPU offload mode:

  1. Prefill writes KV to GPU, offloads to CPU
  2. Decode loads prefilled KV from CPU via _decode_ring_buffer_pipeline
  3. Attend to prefilled KV + accumulated decode tokens
  4. Merge results

Observations:

  • prefilled_blocks set is empty after decode (should contain block IDs)
  • CPU cache has valid data (reasonable mean/std values)
  • Decode buffer has zeros (decode tokens not being stored correctly?)

Current Status

Working

  • Stream synchronization fixes
  • K/V cache offload to CPU (verified alignment)
  • RoPE implementation
  • Chunked prefill attention for first chunk

Not Working

  • Decode with CPU offload (even for single-chunk inputs)
  • Multi-chunk attention (divergence in later layers for chunk 1)

Next Steps

  1. Debug why prefilled_blocks is empty after decode
  2. Check if decode path correctly loads KV from CPU
  3. Verify decode buffer is being written correctly
  4. Compare decode attention outputs between offload and non-offload modes

Key Files

  • nanovllm/layers/attention.py - Main attention implementation with chunked paths
  • nanovllm/kvcache/offload_engine.py - CPU-GPU transfer engine
  • nanovllm/kvcache/hybrid_manager.py - KV cache management with prefilled_blocks
  • nanovllm/engine/model_runner.py - Prefill/decode orchestration

Hypothesis

The decode path fails because:

  1. prefilled_blocks is not being tracked correctly, causing get_prefilled_cpu_blocks() to return empty
  2. OR the decode attention is not correctly loading/using the prefilled KV from CPU
  3. OR there's a stream synchronization issue specific to decode path