[docs] refactor CLAUDE.md.

This commit is contained in:
Zijie Tian
2025-12-15 21:43:33 +08:00
parent dc7807a211
commit 8df0c7517b
2 changed files with 70 additions and 2 deletions

View File

@@ -21,6 +21,6 @@ python bench_offload.py # CPU offload benchmark
- `max_num_batched_tokens`: 16384 - `max_num_batched_tokens`: 16384
- `max_num_seqs`: 512 - `max_num_seqs`: 512
- `kvcache_block_size`: 256 - `kvcache_block_size`: 4096
- `gpu_memory_utilization`: 0.9 - `gpu_memory_utilization`: 0.9
- `enforce_eager`: False (enables CUDA graphs) - `enforce_eager`: False (enables CUDA graphs)

View File

@@ -24,7 +24,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
**BlockManager** (`nanovllm/engine/block_manager.py`): **BlockManager** (`nanovllm/engine/block_manager.py`):
- Paged attention block allocation with prefix caching via xxhash - Paged attention block allocation with prefix caching via xxhash
- Blocks are 256 tokens by default - Blocks are 4096 tokens by default (configurable via `kvcache_block_size`)
### Model & Attention ### Model & Attention
@@ -85,12 +85,40 @@ offload_slot_to_cpu(slot_idx, cpu_block_id) # Async offload to CPU
Each slot has per-layer CUDA events for fine-grained synchronization: Each slot has per-layer CUDA events for fine-grained synchronization:
- `ring_slot_ready[slot_idx][layer_id]`: H2D transfer completion - `ring_slot_ready[slot_idx][layer_id]`: H2D transfer completion
- `ring_slot_offload_done[slot_idx][layer_id]`: D2H transfer completion - `ring_slot_offload_done[slot_idx][layer_id]`: D2H transfer completion
- `ring_slot_compute_done[slot_idx][layer_id]`: Attention compute completion (for safe buffer reuse)
This enables: This enables:
1. Overlapped H2D transfer with attention computation 1. Overlapped H2D transfer with attention computation
2. Each layer independently waits for its own data 2. Each layer independently waits for its own data
3. Pipeline depth = N-1 for prefill (N slots, 1 for writing) 3. Pipeline depth = N-1 for prefill (N slots, 1 for writing)
### Async Pipeline with Double Buffering
**File**: `nanovllm/layers/attention.py` - `_ring_buffer_pipeline_load()`
The async pipeline uses double buffering with `compute_done` events to prevent data races:
```python
# Synchronization flow for safe async pipeline:
1. load_to_slot_layer() waits for compute_done[slot] before overwriting
2. wait_slot_layer() waits for slot_ready[slot] before reading
3. After flash_attn, record_slot_compute_done(slot) allows next load
Timeline with 2 slots (A, B):
Load B0A
Load B1B Load B2A ...
Compute(A) Compute(B) ...
```
**Key**: `load_to_slot_layer` internally waits for `compute_done` before starting transfer, preventing data race where new data overwrites unread data.
### Chunked Prefill Flow (Ring Buffer Pipeline) ### Chunked Prefill Flow (Ring Buffer Pipeline)
**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()` **File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
@@ -163,7 +191,47 @@ def merge_attention_outputs(o1, lse1, o2, lse2):
# Uses LSE to correctly weight and combine partial attention outputs # Uses LSE to correctly weight and combine partial attention outputs
``` ```
### Flash Attention with LSE
**File**: `nanovllm/kvcache/chunked_attention.py` - `flash_attn_with_lse()`
Uses native `flash_attn_func` with `return_attn_probs=True` to get LSE output. This:
- Natively supports GQA (no memory overhead for head replication)
- Avoids `repeat_interleave` which would copy K/V heads (40MB+ per call)
- Returns `(output, lse)` for online softmax merging
### Pipeline Depth ### Pipeline Depth
- **Prefill**: Pipeline depth = N-1 (where N = num_gpu_blocks) - **Prefill**: Pipeline depth = N-1 (where N = num_gpu_blocks)
- **Decode**: Pipeline depth = (N-1)/2 (double buffering within decode_load_slots) - **Decode**: Pipeline depth = (N-1)/2 (double buffering within decode_load_slots)
## Performance Optimizations
### Warmup Model Optimization
**File**: `nanovllm/engine/model_runner.py` - `warmup_model()`
Warmup uses a reasonable sequence length (`block_size * 2`) instead of `max_model_len`:
- Avoids huge intermediate activation memory allocation
- 8192 tokens is sufficient to trigger CUDA kernel JIT compilation
- Prevents OOM during initialization for long-context configs (256K+)
### Memory Considerations
**GQA Head Replication**: The chunked attention uses native `flash_attn_func` which handles GQA internally without memory overhead. Previous implementation used `repeat_interleave` which copied K/V heads, adding ~40MB per attention call.
**Block Size Trade-off**:
- Larger block_size (4096) = fewer H2D transfers, better throughput
- Smaller block_size (256) = finer granularity, less wasted memory
- Current default: 4096 tokens per block
## Configuration Defaults
| Parameter | Default | Description |
|-----------|---------|-------------|
| `kvcache_block_size` | 4096 | Tokens per KV cache block |
| `max_num_batched_tokens` | 16384 | Max tokens per batch |
| `max_num_seqs` | 512 | Max concurrent sequences |
| `gpu_memory_utilization` | 0.9 | GPU memory fraction for KV cache |
| `enforce_eager` | False | Disable CUDA graphs if True |
| `num_prefetch_blocks` | 2 | Ring buffer pipeline depth (deprecated, uses num_gpu_blocks) |