Files
nano-vllm/CLAUDE.md
2025-12-15 01:13:58 +08:00

5.8 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Currently supports Qwen3 models.

Architecture

Core Components

LLMEngine (nanovllm/engine/llm_engine.py):

  • Main entry point, wraps ModelRunner and Scheduler
  • generate() runs prefill-decode loop until all sequences finish

ModelRunner (nanovllm/engine/model_runner.py):

  • Loads model weights, allocates KV cache, captures CUDA graphs
  • Rank 0 is main process; ranks 1+ run via loop() with shared memory events
  • Chunked offload methods: run_chunked_offload_prefill(), run_chunked_offload_decode()

Scheduler (nanovllm/engine/scheduler.py):

  • Two-phase scheduling: prefill (waiting queue) then decode (running queue)

BlockManager (nanovllm/engine/block_manager.py):

  • Paged attention block allocation with prefix caching via xxhash
  • Blocks are 256 tokens by default

Model & Attention

Attention (nanovllm/layers/attention.py):

  • FlashAttention: flash_attn_varlen_func (prefill), flash_attn_with_kvcache (decode)
  • Triton kernel store_kvcache_kernel for KV cache writes
  • Chunked attention methods: _chunked_prefill_attention(), _chunked_decode_attention()

Global Context (nanovllm/utils/context.py):

  • Stores attention metadata via get_context()/set_context()
  • Key fields: cu_seqlens, slot_mapping, block_tables, chunked_seq, kvcache_manager
  • kvcache_manager: Reference to HybridKVCacheManager for chunked attention (set when is_chunked_prefill=True)

CPU Offload System

Overview

When enable_cpu_offload=True, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.

Three-Region GPU Buffer Design

GPU Slots: [0]     [1, 2, 3]        [4, 5]
           ↑           ↑               ↑
        decode     compute         prefetch
        (1 slot)   (N slots)       (M slots)

- Decode slot: New token's KV written here during decode
- Compute region: Load CPU blocks for current chunk computation
- Prefetch region: Async load next chunk while computing current

File: nanovllm/kvcache/offload_engine.py

Key attributes:

  • decode_slot = 0: Fixed slot for decode KV writes
  • compute_slots: List of GPU slots for compute region
  • prefetch_slots: List of GPU slots for prefetch region
  • k_cache_gpu/v_cache_gpu: Shape [num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]
  • k_cache_cpu/v_cache_cpu: Shape [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim] (pinned memory)

Per-Layer Loading (Critical Design)

Problem solved: Original design had layer 0 load ALL layers' KV at once. When layer 0 processed chunk 1, it overwrote chunk 0's data before layer 1+ could read it.

Solution: Each layer independently loads only its own KV data:

# Per-layer methods in OffloadEngine
load_to_compute_layer(layer_id, cpu_block_ids)  # Load single layer to compute region
wait_compute_layer(layer_id)                     # Wait for layer's transfer
load_to_prefetch_layer(layer_id, cpu_block_ids) # Load single layer to prefetch region
wait_prefetch_layer(layer_id)                    # Wait for layer's prefetch

Chunked Prefill Flow

File: nanovllm/layers/attention.py - _chunked_prefill_attention()

For each prefill chunk:
1. Current chunk's KV is written to GPU (compute region slots)
2. Load previous chunks' KV from CPU to prefetch region
3. Compute attention against previous KV (no causal mask)
4. Compute attention against current KV (causal mask)
5. Merge results using online softmax (LSE)
6. Offload current chunk's KV to CPU

Important: Prefill uses ONLY prefetch region to avoid conflict with current chunk's KV being written to compute region.

Chunked Decode Flow (Double Buffering)

File: nanovllm/layers/attention.py - _chunked_decode_attention()

Timeline (async double buffering):
        ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
Load:   │C0 → Compute │ │C1 → Prefetch│ │C2 → Compute │
        └─────────────┘ └─────────────┘ └─────────────┘
                      ↘               ↘               ↘
Compute:               [C0]           [C1]           [C2]

1. Pre-load first chunk to compute region
2. Wait for current buffer, trigger async prefetch of next chunk to OTHER buffer
3. Compute attention, merge results
4. Swap buffers, repeat
5. Finally attend to decode slot (new token's KV)

HybridKVCacheManager

File: nanovllm/kvcache/hybrid_manager.py

Manages both GPU and CPU blocks:

  • allocate(): Allocate GPU block first, fallback to CPU
  • allocate_cpu_only(): Force CPU allocation (for chunked offload mode)
  • get_all_cpu_blocks(seq): Get all CPU block IDs for a sequence
  • get_prefilled_cpu_blocks(seq): Get CPU blocks from previous chunks
  • get_write_slot_for_chunked_offload(seq): Get GPU slot for writing new KV (returns decode_slot)
  • may_offload(): Offload GPU blocks to CPU when decode slot fills

Online Softmax Merge

File: nanovllm/kvcache/chunked_attention.py

When computing attention across multiple chunks, results are merged using log-sum-exp:

def merge_attention_outputs(o1, lse1, o2, lse2):
    # Uses LSE to correctly weight and combine partial attention outputs

Ring Buffer Design (Future Optimization)

Current double-buffering limits pipeline depth. Planned improvement:

  • Unified ring buffer using all GPU slots (except decode)
  • Per-slot per-layer CUDA events for fine-grained sync
  • Deeper pipeline: prefetch N-1 blocks ahead (vs 1 chunk)