5.5 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Currently supports Qwen3 models.
Architecture
Core Components
LLMEngine (nanovllm/engine/llm_engine.py):
- Main entry point, wraps ModelRunner and Scheduler
generate()runs prefill-decode loop until all sequences finish
ModelRunner (nanovllm/engine/model_runner.py):
- Loads model weights, allocates KV cache, captures CUDA graphs
- Rank 0 is main process; ranks 1+ run via
loop()with shared memory events
Scheduler (nanovllm/engine/scheduler.py):
- Two-phase scheduling: prefill (waiting queue) then decode (running queue)
BlockManager (nanovllm/engine/block_manager.py):
- Paged attention block allocation with prefix caching via xxhash
- Blocks are 256 tokens by default
Model & Attention
Attention (nanovllm/layers/attention.py):
- FlashAttention:
flash_attn_varlen_func(prefill),flash_attn_with_kvcache(decode) - Triton kernel
store_kvcache_kernelfor KV cache writes - Chunked attention methods:
_chunked_prefill_attention(),_chunked_decode_attention()
Global Context (nanovllm/utils/context.py):
- Stores attention metadata via
get_context()/set_context() - Key fields:
cu_seqlens,slot_mapping,block_tables,chunked_seq
CPU Offload System
Overview
When enable_cpu_offload=True, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.
Three-Region GPU Buffer Design
GPU Slots: [0] [1, 2, 3] [4, 5]
↑ ↑ ↑
decode compute prefetch
(1 slot) (N slots) (M slots)
- Decode slot: New token's KV written here during decode
- Compute region: Load CPU blocks for current chunk computation
- Prefetch region: Async load next chunk while computing current
File: nanovllm/kvcache/offload_engine.py
Key attributes:
decode_slot = 0: Fixed slot for decode KV writescompute_slots: List of GPU slots for compute regionprefetch_slots: List of GPU slots for prefetch regionk_cache_gpu/v_cache_gpu: Shape[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]k_cache_cpu/v_cache_cpu: Shape[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim](pinned memory)
Per-Layer Loading (Critical Design)
Problem solved: Original design had layer 0 load ALL layers' KV at once. When layer 0 processed chunk 1, it overwrote chunk 0's data before layer 1+ could read it.
Solution: Each layer independently loads only its own KV data:
# Per-layer methods in OffloadEngine
load_to_compute_layer(layer_id, cpu_block_ids) # Load single layer to compute region
wait_compute_layer(layer_id) # Wait for layer's transfer
load_to_prefetch_layer(layer_id, cpu_block_ids) # Load single layer to prefetch region
wait_prefetch_layer(layer_id) # Wait for layer's prefetch
Chunked Prefill Flow
File: nanovllm/layers/attention.py - _chunked_prefill_attention()
For each prefill chunk:
1. Current chunk's KV is written to GPU (compute region slots)
2. Load previous chunks' KV from CPU to prefetch region
3. Compute attention against previous KV (no causal mask)
4. Compute attention against current KV (causal mask)
5. Merge results using online softmax (LSE)
6. Offload current chunk's KV to CPU
Important: Prefill uses ONLY prefetch region to avoid conflict with current chunk's KV being written to compute region.
Chunked Decode Flow (Double Buffering)
File: nanovllm/layers/attention.py - _chunked_decode_attention()
Timeline (async double buffering):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
Load: │C0 → Compute │ │C1 → Prefetch│ │C2 → Compute │
└─────────────┘ └─────────────┘ └─────────────┘
↘ ↘ ↘
Compute: [C0] [C1] [C2]
1. Pre-load first chunk to compute region
2. Wait for current buffer, trigger async prefetch of next chunk to OTHER buffer
3. Compute attention, merge results
4. Swap buffers, repeat
5. Finally attend to decode slot (new token's KV)
HybridKVCacheManager
File: nanovllm/kvcache/hybrid_manager.py
Manages both GPU and CPU blocks:
allocate(): Allocate GPU block first, fallback to CPUallocate_cpu_only(): Force CPU allocation (for chunked prefill)get_all_cpu_blocks(seq): Get all CPU block IDs for a sequenceget_prefilled_cpu_blocks(seq): Get CPU blocks from previous chunksmay_offload(): Offload GPU blocks to CPU when decode slot fills
Online Softmax Merge
File: nanovllm/kvcache/chunked_attention.py
When computing attention across multiple chunks, results are merged using log-sum-exp:
def merge_attention_outputs(o1, lse1, o2, lse2):
# Uses LSE to correctly weight and combine partial attention outputs
Ring Buffer Design (Future Optimization)
Current double-buffering limits pipeline depth. Planned improvement:
- Unified ring buffer using all GPU slots (except decode)
- Per-slot per-layer CUDA events for fine-grained sync
- Deeper pipeline: prefetch N-1 blocks ahead (vs 1 chunk)