104 lines
3.8 KiB
Markdown
104 lines
3.8 KiB
Markdown
# Chunked Prefill Bug Debug Summary
|
|
|
|
## Problem
|
|
`test_needle.py --enable-offload --input-len 8192` fails with garbage output.
|
|
|
|
The model generates completely wrong tokens instead of the expected "7492".
|
|
|
|
## Investigation Progress
|
|
|
|
### 1. Stream Synchronization Fix (Completed)
|
|
- Replaced Triton `store_kvcache` kernel with pure PyTorch operations
|
|
- Moved `store_kvcache` to `compute_stream` in chunked prefill mode
|
|
- Added sync: `compute_stream.wait_event(offload_done)` after per-layer offload
|
|
- Added sync: `default_stream.wait_stream(compute_stream)` before return
|
|
|
|
### 2. KV Cache Alignment Verification (Completed)
|
|
Created alignment tests to compare K/V tensors between torch reference and nanovllm:
|
|
|
|
**RoPE Alignment:**
|
|
- RoPE implementations match perfectly (max_diff=0.002, cosine ~1.0)
|
|
- Confirmed RoPE is NOT the cause of the bug
|
|
|
|
**K/V Cache Alignment (Chunk 0):**
|
|
- Cosine similarity: ~1.0 for all layers
|
|
- Max diff: 2-7 (grows linearly with position, characteristic of FP16 precision)
|
|
- Mean diff: < 0.001
|
|
- **Conclusion: K/V cache offload is working correctly**
|
|
|
|
### 3. Layer Output Divergence Analysis (Completed)
|
|
Created per-chunk layer output comparison:
|
|
|
|
**Chunk 0 (tokens 0-4096):**
|
|
- All layers pass with excellent cosine similarity (0.999+)
|
|
- Max diff grows in later layers but within acceptable range
|
|
|
|
**Chunk 1 (tokens 4096-8192):**
|
|
- Layers 0-19: OK (cosine ~1.0)
|
|
- Layers 20-27: Diverge (cosine 0.83-0.96, max_diff up to 114)
|
|
- Divergence correlates with later transformer layers
|
|
|
|
### 4. Critical Discovery: Single-Chunk Offload Also Fails
|
|
**Key finding:** Even with input_len=2048 (single chunk, no chunked attention), the model produces garbage output with CPU offload enabled.
|
|
|
|
```
|
|
# Without offload: PASSES
|
|
python tests/test_needle.py --input-len 2048
|
|
# Output: "7492" (correct)
|
|
|
|
# With offload: FAILS
|
|
python tests/test_needle.py --enable-offload --input-len 2048
|
|
# Output: "The Ble White Th G Lopsiswin..." (garbage)
|
|
```
|
|
|
|
**This proves the bug is NOT in:**
|
|
- Chunked attention logic (merge_attention_outputs)
|
|
- Multi-chunk KV loading
|
|
- Ring buffer pipeline
|
|
|
|
**The bug IS in:**
|
|
- The decode path when CPU offload is enabled
|
|
- How prefilled KV is loaded/used during decode
|
|
|
|
### 5. Decode Path Analysis (In Progress)
|
|
The decode path in CPU offload mode:
|
|
1. Prefill writes KV to GPU, offloads to CPU
|
|
2. Decode loads prefilled KV from CPU via `_decode_ring_buffer_pipeline`
|
|
3. Attend to prefilled KV + accumulated decode tokens
|
|
4. Merge results
|
|
|
|
**Observations:**
|
|
- `prefilled_blocks` set is empty after decode (should contain block IDs)
|
|
- CPU cache has valid data (reasonable mean/std values)
|
|
- Decode buffer has zeros (decode tokens not being stored correctly?)
|
|
|
|
## Current Status
|
|
|
|
### Working
|
|
- Stream synchronization fixes
|
|
- K/V cache offload to CPU (verified alignment)
|
|
- RoPE implementation
|
|
- Chunked prefill attention for first chunk
|
|
|
|
### Not Working
|
|
- Decode with CPU offload (even for single-chunk inputs)
|
|
- Multi-chunk attention (divergence in later layers for chunk 1)
|
|
|
|
## Next Steps
|
|
1. Debug why `prefilled_blocks` is empty after decode
|
|
2. Check if decode path correctly loads KV from CPU
|
|
3. Verify decode buffer is being written correctly
|
|
4. Compare decode attention outputs between offload and non-offload modes
|
|
|
|
## Key Files
|
|
- `nanovllm/layers/attention.py` - Main attention implementation with chunked paths
|
|
- `nanovllm/kvcache/offload_engine.py` - CPU-GPU transfer engine
|
|
- `nanovllm/kvcache/hybrid_manager.py` - KV cache management with `prefilled_blocks`
|
|
- `nanovllm/engine/model_runner.py` - Prefill/decode orchestration
|
|
|
|
## Hypothesis
|
|
The decode path fails because:
|
|
1. `prefilled_blocks` is not being tracked correctly, causing `get_prefilled_cpu_blocks()` to return empty
|
|
2. OR the decode attention is not correctly loading/using the prefilled KV from CPU
|
|
3. OR there's a stream synchronization issue specific to decode path
|