- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)
3.7 KiB
Architecture Guide
This document describes the core components and design of nano-vLLM, with detailed focus on the CPU offload system.
Core Components
LLMEngine (llm_engine.py)
Main entry point that runs the prefill-decode loop. Manages the overall inference workflow.
ModelRunner (model_runner.py)
- Loads model weights
- Allocates KV cache
- Manages CUDA graphs for decode acceleration
Scheduler (scheduler.py)
Two-phase scheduling system:
- Prefill phase: Processes prompt tokens
- Decode phase: Generates output tokens autoregressively
BlockManager (block_manager.py)
- Paged attention implementation
- Prefix caching using xxhash
- Default block size: 4096 tokens
Attention (layers/attention.py)
- FlashAttention for efficient computation
- Chunked methods for CPU offload mode
CPU Offload System
Ring Buffer Design
The CPU offload system uses a unified ring buffer to manage GPU memory slots:
GPU Slots: [0] [1] [2] [3] ... (unified ring buffer)
Prefill: slot = chunk_idx % N
Decode: slot[0] = decode, slots[1:] = load previous chunks
Key Files: kvcache/offload_engine.py, kvcache/hybrid_manager.py
Memory Layout
GPU Memory:
[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]
CPU Memory (pinned):
[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
Key Methods
| Method | Purpose |
|---|---|
load_to_slot_layer(slot, layer, cpu_block) |
Async H2D load for specific layer |
offload_slot_to_cpu(slot, cpu_block) |
Async D2H offload |
| Per-slot per-layer CUDA events | Fine-grained synchronization |
Pipeline Architecture
N-way Pipeline with dedicated streams for full compute-transfer overlap:
- Prefill pipeline depth: N-1
- Decode pipeline depth: (N-1)/2
Stream Architecture
Transfer Streams: [slot_0_stream] [slot_1_stream] ... [slot_N_stream]
↓ ↓ ↓
GPU Slots: [slot_0] [slot_1] ... [slot_N]
↓ ↓ ↓
Compute Stream: ←←←←←←←←←←←← [dedicated compute stream] →→→→→→→→→→→→
Key Design Decisions
-
Per-slot transfer streams: Each GPU slot has its own CUDA stream for H2D transfers, enabling parallel loading
-
Dedicated compute stream: Created with
torch.cuda.Stream()(NOTcurrent_stream()) to avoid implicit synchronization with CUDA default stream -
CUDA Events:
ring_slot_ready: Signals transfer completering_slot_compute_done: Signals safe to overwrite slot
Chunked Offload Flow
Prefill Phase:
- For each chunk, assign
slot = chunk_idx % N - Load required KV blocks from CPU to assigned slot
- Compute attention on current chunk
- Offload results back to CPU if needed
Decode Phase:
- Use
slot[0]for active decode computation - Use
slots[1:]to prefetch upcoming chunks - Rotate slots as decoding progresses
Configuration Parameters
| Parameter | Default | Description |
|---|---|---|
kvcache_block_size |
1024 | Tokens per KV cache block |
num_gpu_blocks |
2 | Number of GPU blocks for offload |
num_kv_buffers |
4 | Ring buffer size (1-4), lower = less memory but slower decode |
enable_cpu_offload |
False | Enable CPU offload mode |
Trade-offs
- More GPU blocks: Higher memory usage, faster prefill (fewer transfers)
- Fewer GPU blocks: Lower memory usage, more frequent transfers
- Larger ring buffer: More memory, better prefetch overlap
- Smaller ring buffer: Less memory, potential compute stalls
Author: Zijie Tian