Files

Zijie Tian e6e0dc5d7d ✨ feat: add comprehensive RULER benchmark testing

- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics

Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup

Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)

2026-01-18 20:34:06 +08:00

2.8 KiB

Raw Blame History

Known Issues and Fixes

This document documents bugs that were discovered and fixed in nano-vLLM.

Partial Last Block Bug (FIXED ✓)

Problem

When prefill token count is not an exact multiple of block_size, decode outputs garbage.

Root Cause

_chunked_decode_attention calculated last_block_valid_tokens using len(seq) - 1, which increases during decode. But CPU blocks are fixed after prefill!

# BUG: len(seq) increases each decode step
total_prefill_tokens = len(seq) - 1  # Wrong!
last_block_valid_tokens = total_prefill_tokens % block_size  # Reads garbage from CPU

Fix

Cache original prefill length in HybridKVCacheManager.get_prefill_len():

# CORRECT: Use cached prefill length
total_prefill_tokens = kvcache_manager.get_prefill_len(seq)  # Fixed value

Files Modified

nanovllm/kvcache/hybrid_manager.py: Added _prefill_len dict and get_prefill_len() method
nanovllm/layers/attention.py: Use get_prefill_len() instead of len(seq) - 1

Verification

Tested with various prefill lengths (not multiples of block_size):

100 tokens (block_size=1024)
5000 tokens (block_size=4096)
15000 tokens (block_size=4096)

All tests now produce correct output.

Block Size 4096 Race Condition (FIXED ✓)

Problem

block_size=4096 with multiple chunks produced index_copy_(): index out of bounds CUDA error during Chunk 2 processing.

Root Cause

Race condition between default stream and compute stream. In _prepare_chunked_offload_chunk(), slot_mapping tensor was created with non_blocking=True H2D transfer on the default stream. However, store_kvcache runs on compute_stream. Without synchronization, compute_stream could use slot_mapping before its transfer completed, causing corrupted indices.

Fix

Added explicit stream synchronization in attention.py:

if is_chunked_offload:
    compute_stream = context.kvcache_manager.offload_engine.compute_stream
    if k_cache.numel() and v_cache.numel():
        # CRITICAL: Wait for default stream to ensure slot_mapping tensor transfer is complete
        compute_stream.wait_stream(torch.cuda.default_stream())
        with torch.cuda.stream(compute_stream):
            store_kvcache(k, v, k_cache, v_cache, context.slot_mapping)

Verification

Tested block sizes: 512, 1024, 4096, 8192 - all pass.

Files Modified

nanovllm/layers/attention.py: Added compute_stream.wait_stream(torch.cuda.default_stream())

Reporting New Issues

If you discover a new bug, please document it here with:

Problem: Clear description of the issue
Root Cause: Analysis of why it happens
Fix: Code changes to resolve it
Files Modified: List of affected files
Verification: How the fix was tested

Author: Zijie Tian

2.8 KiB Raw Blame History

Known Issues and Fixes

Partial Last Block Bug (FIXED ✓)

Problem

Root Cause

Fix

Files Modified

Verification

Block Size 4096 Race Condition (FIXED ✓)

Problem

Root Cause

Fix

Verification

Files Modified

Reporting New Issues

2.8 KiB

Raw Blame History