- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)
2.8 KiB
Known Issues and Fixes
This document documents bugs that were discovered and fixed in nano-vLLM.
Partial Last Block Bug (FIXED ✓)
Problem
When prefill token count is not an exact multiple of block_size, decode outputs garbage.
Root Cause
_chunked_decode_attention calculated last_block_valid_tokens using len(seq) - 1, which increases during decode. But CPU blocks are fixed after prefill!
# BUG: len(seq) increases each decode step
total_prefill_tokens = len(seq) - 1 # Wrong!
last_block_valid_tokens = total_prefill_tokens % block_size # Reads garbage from CPU
Fix
Cache original prefill length in HybridKVCacheManager.get_prefill_len():
# CORRECT: Use cached prefill length
total_prefill_tokens = kvcache_manager.get_prefill_len(seq) # Fixed value
Files Modified
nanovllm/kvcache/hybrid_manager.py: Added_prefill_lendict andget_prefill_len()methodnanovllm/layers/attention.py: Useget_prefill_len()instead oflen(seq) - 1
Verification
Tested with various prefill lengths (not multiples of block_size):
- 100 tokens (block_size=1024)
- 5000 tokens (block_size=4096)
- 15000 tokens (block_size=4096)
All tests now produce correct output.
Block Size 4096 Race Condition (FIXED ✓)
Problem
block_size=4096 with multiple chunks produced index_copy_(): index out of bounds CUDA error during Chunk 2 processing.
Root Cause
Race condition between default stream and compute stream. In _prepare_chunked_offload_chunk(), slot_mapping tensor was created with non_blocking=True H2D transfer on the default stream. However, store_kvcache runs on compute_stream. Without synchronization, compute_stream could use slot_mapping before its transfer completed, causing corrupted indices.
Fix
Added explicit stream synchronization in attention.py:
if is_chunked_offload:
compute_stream = context.kvcache_manager.offload_engine.compute_stream
if k_cache.numel() and v_cache.numel():
# CRITICAL: Wait for default stream to ensure slot_mapping tensor transfer is complete
compute_stream.wait_stream(torch.cuda.default_stream())
with torch.cuda.stream(compute_stream):
store_kvcache(k, v, k_cache, v_cache, context.slot_mapping)
Verification
Tested block sizes: 512, 1024, 4096, 8192 - all pass.
Files Modified
nanovllm/layers/attention.py: Addedcompute_stream.wait_stream(torch.cuda.default_stream())
Reporting New Issues
If you discover a new bug, please document it here with:
- Problem: Clear description of the issue
- Root Cause: Analysis of why it happens
- Fix: Code changes to resolve it
- Files Modified: List of affected files
- Verification: How the fix was tested
Author: Zijie Tian