From a37f07943cc8e8cf827f5b6654fb32d324edbb51 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Mon, 15 Dec 2025 00:13:27 +0800 Subject: [PATCH] [docs] Update the CLAUDE.md. --- CLAUDE.md | 158 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 132 insertions(+), 26 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index adcd99b..b466f4c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,8 +15,13 @@ pip install -e . # Run example python example.py -# Run benchmark -python bench.py +# Run benchmarks +python bench.py # Standard benchmark +python bench_offload.py # CPU offload benchmark + +# Test chunked attention +CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2 +# Args: num_gpu_blocks input_len output_len num_prefetch_blocks ``` ## Architecture @@ -25,49 +30,150 @@ python bench.py **LLMEngine** (`nanovllm/engine/llm_engine.py`): - Main entry point, wraps ModelRunner and Scheduler -- Handles tokenization and multi-process tensor parallelism coordination -- `generate()` method runs the prefill-decode loop until all sequences finish +- `generate()` runs prefill-decode loop until all sequences finish **ModelRunner** (`nanovllm/engine/model_runner.py`): - Loads model weights, allocates KV cache, captures CUDA graphs -- Rank 0 is the main process; ranks 1+ run in separate processes via `loop()` waiting on shared memory events -- `run()` prepares inputs and executes model forward pass +- Rank 0 is main process; ranks 1+ run via `loop()` with shared memory events **Scheduler** (`nanovllm/engine/scheduler.py`): - Two-phase scheduling: prefill (waiting queue) then decode (running queue) -- Handles preemption when memory is constrained by moving sequences back to waiting **BlockManager** (`nanovllm/engine/block_manager.py`): - Paged attention block allocation with prefix caching via xxhash -- Blocks are 256 tokens by default, tracked with reference counting +- Blocks are 256 tokens by default -**Sequence** (`nanovllm/engine/sequence.py`): -- Tracks token IDs, block table, and sampling parameters per request -- Custom `__getstate__`/`__setstate__` for efficient pickling across processes - -### Model Implementation - -**Qwen3ForCausalLM** (`nanovllm/models/qwen3.py`): -- Standard transformer: embedding → decoder layers → RMSNorm → LM head -- Uses `packed_modules_mapping` for weight loading (q/k/v → qkv_proj, gate/up → gate_up_proj) +### Model & Attention **Attention** (`nanovllm/layers/attention.py`): -- Uses FlashAttention (`flash_attn_varlen_func` for prefill, `flash_attn_with_kvcache` for decode) -- Custom Triton kernel `store_kvcache_kernel` for KV cache writes +- FlashAttention: `flash_attn_varlen_func` (prefill), `flash_attn_with_kvcache` (decode) +- Triton kernel `store_kvcache_kernel` for KV cache writes +- Chunked attention methods: `_chunked_prefill_attention()`, `_chunked_decode_attention()` -**Parallel Layers** (`nanovllm/layers/linear.py`, `embed_head.py`): -- Tensor parallelism via column/row parallel linear layers with custom weight loaders +**Global Context** (`nanovllm/utils/context.py`): +- Stores attention metadata via `get_context()`/`set_context()` +- Key fields: `cu_seqlens`, `slot_mapping`, `block_tables`, `chunked_seq` -### Key Design Patterns +## CPU Offload System -- **Global Context**: `nanovllm/utils/context.py` stores attention metadata (cu_seqlens, slot_mapping, block_tables) accessed via `get_context()`/`set_context()` -- **CUDA Graph Capture**: Decode phase uses captured graphs for batch sizes 1, 2, 4, 8, 16, 32... up to max_num_seqs (capped at 512) -- **Shared Memory IPC**: Tensor parallel workers receive commands via pickled data in SharedMemory, synchronized with Events +### Overview -### Config Defaults +When `enable_cpu_offload=True`, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory. + +### Three-Region GPU Buffer Design + +``` +GPU Slots: [0] [1, 2, 3] [4, 5] + ↑ ↑ ↑ + decode compute prefetch + (1 slot) (N slots) (M slots) + +- Decode slot: New token's KV written here during decode +- Compute region: Load CPU blocks for current chunk computation +- Prefetch region: Async load next chunk while computing current +``` + +**File**: `nanovllm/kvcache/offload_engine.py` + +Key attributes: +- `decode_slot = 0`: Fixed slot for decode KV writes +- `compute_slots`: List of GPU slots for compute region +- `prefetch_slots`: List of GPU slots for prefetch region +- `k_cache_gpu/v_cache_gpu`: Shape `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]` +- `k_cache_cpu/v_cache_cpu`: Shape `[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]` (pinned memory) + +### Per-Layer Loading (Critical Design) + +**Problem solved**: Original design had layer 0 load ALL layers' KV at once. When layer 0 processed chunk 1, it overwrote chunk 0's data before layer 1+ could read it. + +**Solution**: Each layer independently loads only its own KV data: +```python +# Per-layer methods in OffloadEngine +load_to_compute_layer(layer_id, cpu_block_ids) # Load single layer to compute region +wait_compute_layer(layer_id) # Wait for layer's transfer +load_to_prefetch_layer(layer_id, cpu_block_ids) # Load single layer to prefetch region +wait_prefetch_layer(layer_id) # Wait for layer's prefetch +``` + +### Chunked Prefill Flow + +**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()` + +``` +For each prefill chunk: +1. Current chunk's KV is written to GPU (compute region slots) +2. Load previous chunks' KV from CPU to prefetch region +3. Compute attention against previous KV (no causal mask) +4. Compute attention against current KV (causal mask) +5. Merge results using online softmax (LSE) +6. Offload current chunk's KV to CPU +``` + +**Important**: Prefill uses ONLY prefetch region to avoid conflict with current chunk's KV being written to compute region. + +### Chunked Decode Flow (Double Buffering) + +**File**: `nanovllm/layers/attention.py` - `_chunked_decode_attention()` + +``` +Timeline (async double buffering): + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +Load: │C0 → Compute │ │C1 → Prefetch│ │C2 → Compute │ + └─────────────┘ └─────────────┘ └─────────────┘ + ↘ ↘ ↘ +Compute: [C0] [C1] [C2] + +1. Pre-load first chunk to compute region +2. Wait for current buffer, trigger async prefetch of next chunk to OTHER buffer +3. Compute attention, merge results +4. Swap buffers, repeat +5. Finally attend to decode slot (new token's KV) +``` + +### HybridKVCacheManager + +**File**: `nanovllm/kvcache/hybrid_manager.py` + +Manages both GPU and CPU blocks: +- `allocate()`: Allocate GPU block first, fallback to CPU +- `allocate_cpu_only()`: Force CPU allocation (for chunked prefill) +- `get_all_cpu_blocks(seq)`: Get all CPU block IDs for a sequence +- `get_prefilled_cpu_blocks(seq)`: Get CPU blocks from previous chunks +- `may_offload()`: Offload GPU blocks to CPU when decode slot fills + +### Online Softmax Merge + +**File**: `nanovllm/kvcache/chunked_attention.py` + +When computing attention across multiple chunks, results are merged using log-sum-exp: +```python +def merge_attention_outputs(o1, lse1, o2, lse2): + # Uses LSE to correctly weight and combine partial attention outputs +``` + +### Ring Buffer Design (Future Optimization) + +Current double-buffering limits pipeline depth. Planned improvement: +- Unified ring buffer using all GPU slots (except decode) +- Per-slot per-layer CUDA events for fine-grained sync +- Deeper pipeline: prefetch N-1 blocks ahead (vs 1 chunk) + +## Config Defaults - `max_num_batched_tokens`: 16384 - `max_num_seqs`: 512 - `kvcache_block_size`: 256 - `gpu_memory_utilization`: 0.9 - `enforce_eager`: False (enables CUDA graphs) + +## Testing CPU Offload + +```bash +# Basic test with limited GPU blocks to trigger offload +CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2 + +# Verify consistency (run multiple times, output should be identical) +for i in 1 2 3; do + CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 32 2 2>&1 | tail -3 +done +```