Files

Zijie Tian e6e0dc5d7d ✨ feat: add comprehensive RULER benchmark testing

- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics

Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup

Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)

2026-01-18 20:34:06 +08:00

3.7 KiB

Raw Blame History

Architecture Guide

This document describes the core components and design of nano-vLLM, with detailed focus on the CPU offload system.

Core Components

LLMEngine (`llm_engine.py`)

Main entry point that runs the prefill-decode loop. Manages the overall inference workflow.

ModelRunner (`model_runner.py`)

Loads model weights
Allocates KV cache
Manages CUDA graphs for decode acceleration

Scheduler (`scheduler.py`)

Two-phase scheduling system:

Prefill phase: Processes prompt tokens
Decode phase: Generates output tokens autoregressively

BlockManager (`block_manager.py`)

Paged attention implementation
Prefix caching using xxhash
Default block size: 4096 tokens

Attention (`layers/attention.py`)

FlashAttention for efficient computation
Chunked methods for CPU offload mode

CPU Offload System

Ring Buffer Design

The CPU offload system uses a unified ring buffer to manage GPU memory slots:

GPU Slots: [0]  [1]  [2]  [3]  ...  (unified ring buffer)
Prefill:  slot = chunk_idx % N
Decode:   slot[0] = decode, slots[1:] = load previous chunks

Key Files: kvcache/offload_engine.py, kvcache/hybrid_manager.py

Memory Layout

GPU Memory:

[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]

CPU Memory (pinned):

[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]

Key Methods

Method	Purpose
`load_to_slot_layer(slot, layer, cpu_block)`	Async H2D load for specific layer
`offload_slot_to_cpu(slot, cpu_block)`	Async D2H offload
Per-slot per-layer CUDA events	Fine-grained synchronization

Pipeline Architecture

N-way Pipeline with dedicated streams for full compute-transfer overlap:

Prefill pipeline depth: N-1
Decode pipeline depth: (N-1)/2

Stream Architecture

Transfer Streams: [slot_0_stream] [slot_1_stream] ... [slot_N_stream]
                       ↓              ↓                    ↓
GPU Slots:          [slot_0]      [slot_1]    ...     [slot_N]
                       ↓              ↓                    ↓
Compute Stream:    ←←←←←←←←←←←← [dedicated compute stream] →→→→→→→→→→→→

Key Design Decisions

Per-slot transfer streams: Each GPU slot has its own CUDA stream for H2D transfers, enabling parallel loading
Dedicated compute stream: Created with torch.cuda.Stream() (NOT current_stream()) to avoid implicit synchronization with CUDA default stream
CUDA Events:
- ring_slot_ready: Signals transfer complete
- ring_slot_compute_done: Signals safe to overwrite slot

Chunked Offload Flow

Prefill Phase:

For each chunk, assign slot = chunk_idx % N
Load required KV blocks from CPU to assigned slot
Compute attention on current chunk
Offload results back to CPU if needed

Decode Phase:

Use slot[0] for active decode computation
Use slots[1:] to prefetch upcoming chunks
Rotate slots as decoding progresses

Configuration Parameters

Parameter	Default	Description
`kvcache_block_size`	1024	Tokens per KV cache block
`num_gpu_blocks`	2	Number of GPU blocks for offload
`num_kv_buffers`	4	Ring buffer size (1-4), lower = less memory but slower decode
`enable_cpu_offload`	False	Enable CPU offload mode

Trade-offs

More GPU blocks: Higher memory usage, faster prefill (fewer transfers)
Fewer GPU blocks: Lower memory usage, more frequent transfers
Larger ring buffer: More memory, better prefetch overlap
Smaller ring buffer: Less memory, potential compute stalls

Author: Zijie Tian

3.7 KiB Raw Blame History