Files
nano-vllm/docs/architecture_guide.md
Zijie Tian e6e0dc5d7d feat: add comprehensive RULER benchmark testing
- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics

Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup

Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)
2026-01-18 20:34:06 +08:00

3.7 KiB

Architecture Guide

This document describes the core components and design of nano-vLLM, with detailed focus on the CPU offload system.

Core Components

LLMEngine (llm_engine.py)

Main entry point that runs the prefill-decode loop. Manages the overall inference workflow.

ModelRunner (model_runner.py)

  • Loads model weights
  • Allocates KV cache
  • Manages CUDA graphs for decode acceleration

Scheduler (scheduler.py)

Two-phase scheduling system:

  • Prefill phase: Processes prompt tokens
  • Decode phase: Generates output tokens autoregressively

BlockManager (block_manager.py)

  • Paged attention implementation
  • Prefix caching using xxhash
  • Default block size: 4096 tokens

Attention (layers/attention.py)

  • FlashAttention for efficient computation
  • Chunked methods for CPU offload mode

CPU Offload System

Ring Buffer Design

The CPU offload system uses a unified ring buffer to manage GPU memory slots:

GPU Slots: [0]  [1]  [2]  [3]  ...  (unified ring buffer)
Prefill:  slot = chunk_idx % N
Decode:   slot[0] = decode, slots[1:] = load previous chunks

Key Files: kvcache/offload_engine.py, kvcache/hybrid_manager.py

Memory Layout

GPU Memory:

[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]

CPU Memory (pinned):

[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]

Key Methods

Method Purpose
load_to_slot_layer(slot, layer, cpu_block) Async H2D load for specific layer
offload_slot_to_cpu(slot, cpu_block) Async D2H offload
Per-slot per-layer CUDA events Fine-grained synchronization

Pipeline Architecture

N-way Pipeline with dedicated streams for full compute-transfer overlap:

  • Prefill pipeline depth: N-1
  • Decode pipeline depth: (N-1)/2

Stream Architecture

Transfer Streams: [slot_0_stream] [slot_1_stream] ... [slot_N_stream]
                       ↓              ↓                    ↓
GPU Slots:          [slot_0]      [slot_1]    ...     [slot_N]
                       ↓              ↓                    ↓
Compute Stream:    ←←←←←←←←←←←← [dedicated compute stream] →→→→→→→→→→→→

Key Design Decisions

  1. Per-slot transfer streams: Each GPU slot has its own CUDA stream for H2D transfers, enabling parallel loading

  2. Dedicated compute stream: Created with torch.cuda.Stream() (NOT current_stream()) to avoid implicit synchronization with CUDA default stream

  3. CUDA Events:

    • ring_slot_ready: Signals transfer complete
    • ring_slot_compute_done: Signals safe to overwrite slot

Chunked Offload Flow

Prefill Phase:

  1. For each chunk, assign slot = chunk_idx % N
  2. Load required KV blocks from CPU to assigned slot
  3. Compute attention on current chunk
  4. Offload results back to CPU if needed

Decode Phase:

  1. Use slot[0] for active decode computation
  2. Use slots[1:] to prefetch upcoming chunks
  3. Rotate slots as decoding progresses

Configuration Parameters

Parameter Default Description
kvcache_block_size 1024 Tokens per KV cache block
num_gpu_blocks 2 Number of GPU blocks for offload
num_kv_buffers 4 Ring buffer size (1-4), lower = less memory but slower decode
enable_cpu_offload False Enable CPU offload mode

Trade-offs

  • More GPU blocks: Higher memory usage, faster prefill (fewer transfers)
  • Fewer GPU blocks: Lower memory usage, more frequent transfers
  • Larger ring buffer: More memory, better prefetch overlap
  • Smaller ring buffer: Less memory, potential compute stalls

Author: Zijie Tian