[docs] Update the CLAUDE.md.

This commit is contained in:
Zijie Tian
2025-12-15 00:13:27 +08:00
parent 61edb8a344
commit a37f07943c

158
CLAUDE.md
View File

@@ -15,8 +15,13 @@ pip install -e .
# Run example
python example.py
# Run benchmark
python bench.py
# Run benchmarks
python bench.py # Standard benchmark
python bench_offload.py # CPU offload benchmark
# Test chunked attention
CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
# Args: num_gpu_blocks input_len output_len num_prefetch_blocks
```
## Architecture
@@ -25,49 +30,150 @@ python bench.py
**LLMEngine** (`nanovllm/engine/llm_engine.py`):
- Main entry point, wraps ModelRunner and Scheduler
- Handles tokenization and multi-process tensor parallelism coordination
- `generate()` method runs the prefill-decode loop until all sequences finish
- `generate()` runs prefill-decode loop until all sequences finish
**ModelRunner** (`nanovllm/engine/model_runner.py`):
- Loads model weights, allocates KV cache, captures CUDA graphs
- Rank 0 is the main process; ranks 1+ run in separate processes via `loop()` waiting on shared memory events
- `run()` prepares inputs and executes model forward pass
- Rank 0 is main process; ranks 1+ run via `loop()` with shared memory events
**Scheduler** (`nanovllm/engine/scheduler.py`):
- Two-phase scheduling: prefill (waiting queue) then decode (running queue)
- Handles preemption when memory is constrained by moving sequences back to waiting
**BlockManager** (`nanovllm/engine/block_manager.py`):
- Paged attention block allocation with prefix caching via xxhash
- Blocks are 256 tokens by default, tracked with reference counting
- Blocks are 256 tokens by default
**Sequence** (`nanovllm/engine/sequence.py`):
- Tracks token IDs, block table, and sampling parameters per request
- Custom `__getstate__`/`__setstate__` for efficient pickling across processes
### Model Implementation
**Qwen3ForCausalLM** (`nanovllm/models/qwen3.py`):
- Standard transformer: embedding → decoder layers → RMSNorm → LM head
- Uses `packed_modules_mapping` for weight loading (q/k/v → qkv_proj, gate/up → gate_up_proj)
### Model & Attention
**Attention** (`nanovllm/layers/attention.py`):
- Uses FlashAttention (`flash_attn_varlen_func` for prefill, `flash_attn_with_kvcache` for decode)
- Custom Triton kernel `store_kvcache_kernel` for KV cache writes
- FlashAttention: `flash_attn_varlen_func` (prefill), `flash_attn_with_kvcache` (decode)
- Triton kernel `store_kvcache_kernel` for KV cache writes
- Chunked attention methods: `_chunked_prefill_attention()`, `_chunked_decode_attention()`
**Parallel Layers** (`nanovllm/layers/linear.py`, `embed_head.py`):
- Tensor parallelism via column/row parallel linear layers with custom weight loaders
**Global Context** (`nanovllm/utils/context.py`):
- Stores attention metadata via `get_context()`/`set_context()`
- Key fields: `cu_seqlens`, `slot_mapping`, `block_tables`, `chunked_seq`
### Key Design Patterns
## CPU Offload System
- **Global Context**: `nanovllm/utils/context.py` stores attention metadata (cu_seqlens, slot_mapping, block_tables) accessed via `get_context()`/`set_context()`
- **CUDA Graph Capture**: Decode phase uses captured graphs for batch sizes 1, 2, 4, 8, 16, 32... up to max_num_seqs (capped at 512)
- **Shared Memory IPC**: Tensor parallel workers receive commands via pickled data in SharedMemory, synchronized with Events
### Overview
### Config Defaults
When `enable_cpu_offload=True`, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.
### Three-Region GPU Buffer Design
```
GPU Slots: [0] [1, 2, 3] [4, 5]
↑ ↑ ↑
decode compute prefetch
(1 slot) (N slots) (M slots)
- Decode slot: New token's KV written here during decode
- Compute region: Load CPU blocks for current chunk computation
- Prefetch region: Async load next chunk while computing current
```
**File**: `nanovllm/kvcache/offload_engine.py`
Key attributes:
- `decode_slot = 0`: Fixed slot for decode KV writes
- `compute_slots`: List of GPU slots for compute region
- `prefetch_slots`: List of GPU slots for prefetch region
- `k_cache_gpu/v_cache_gpu`: Shape `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
- `k_cache_cpu/v_cache_cpu`: Shape `[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]` (pinned memory)
### Per-Layer Loading (Critical Design)
**Problem solved**: Original design had layer 0 load ALL layers' KV at once. When layer 0 processed chunk 1, it overwrote chunk 0's data before layer 1+ could read it.
**Solution**: Each layer independently loads only its own KV data:
```python
# Per-layer methods in OffloadEngine
load_to_compute_layer(layer_id, cpu_block_ids) # Load single layer to compute region
wait_compute_layer(layer_id) # Wait for layer's transfer
load_to_prefetch_layer(layer_id, cpu_block_ids) # Load single layer to prefetch region
wait_prefetch_layer(layer_id) # Wait for layer's prefetch
```
### Chunked Prefill Flow
**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
```
For each prefill chunk:
1. Current chunk's KV is written to GPU (compute region slots)
2. Load previous chunks' KV from CPU to prefetch region
3. Compute attention against previous KV (no causal mask)
4. Compute attention against current KV (causal mask)
5. Merge results using online softmax (LSE)
6. Offload current chunk's KV to CPU
```
**Important**: Prefill uses ONLY prefetch region to avoid conflict with current chunk's KV being written to compute region.
### Chunked Decode Flow (Double Buffering)
**File**: `nanovllm/layers/attention.py` - `_chunked_decode_attention()`
```
Timeline (async double buffering):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
Load: │C0 → Compute │ │C1 → Prefetch│ │C2 → Compute │
└─────────────┘ └─────────────┘ └─────────────┘
↘ ↘ ↘
Compute: [C0] [C1] [C2]
1. Pre-load first chunk to compute region
2. Wait for current buffer, trigger async prefetch of next chunk to OTHER buffer
3. Compute attention, merge results
4. Swap buffers, repeat
5. Finally attend to decode slot (new token's KV)
```
### HybridKVCacheManager
**File**: `nanovllm/kvcache/hybrid_manager.py`
Manages both GPU and CPU blocks:
- `allocate()`: Allocate GPU block first, fallback to CPU
- `allocate_cpu_only()`: Force CPU allocation (for chunked prefill)
- `get_all_cpu_blocks(seq)`: Get all CPU block IDs for a sequence
- `get_prefilled_cpu_blocks(seq)`: Get CPU blocks from previous chunks
- `may_offload()`: Offload GPU blocks to CPU when decode slot fills
### Online Softmax Merge
**File**: `nanovllm/kvcache/chunked_attention.py`
When computing attention across multiple chunks, results are merged using log-sum-exp:
```python
def merge_attention_outputs(o1, lse1, o2, lse2):
# Uses LSE to correctly weight and combine partial attention outputs
```
### Ring Buffer Design (Future Optimization)
Current double-buffering limits pipeline depth. Planned improvement:
- Unified ring buffer using all GPU slots (except decode)
- Per-slot per-layer CUDA events for fine-grained sync
- Deeper pipeline: prefetch N-1 blocks ahead (vs 1 chunk)
## Config Defaults
- `max_num_batched_tokens`: 16384
- `max_num_seqs`: 512
- `kvcache_block_size`: 256
- `gpu_memory_utilization`: 0.9
- `enforce_eager`: False (enables CUDA graphs)
## Testing CPU Offload
```bash
# Basic test with limited GPU blocks to trigger offload
CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
# Verify consistency (run multiple times, output should be identical)
for i in 1 2 3; do
CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 32 2 2>&1 | tail -3
done
```