[docs] refactor CLAUDE.md.

2025-12-15 21:43:33 +08:00
parent dc7807a211
commit 8df0c7517b
2 changed files with 70 additions and 2 deletions
--- a/.claude/rules/commands.md
+++ b/.claude/rules/commands.md
@@ -21,6 +21,6 @@ python bench_offload.py            # CPU offload benchmark
 - `max_num_batched_tokens`: 16384
 - `max_num_seqs`: 512
- `kvcache_block_size`: 256
+- `kvcache_block_size`: 4096
 - `gpu_memory_utilization`: 0.9
 - `enforce_eager`: False (enables CUDA graphs)
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -24,7 +24,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 **BlockManager** (`nanovllm/engine/block_manager.py`):
 - Paged attention block allocation with prefix caching via xxhash
- Blocks are 256 tokens by default
+- Blocks are 4096 tokens by default (configurable via `kvcache_block_size`)
 ### Model & Attention
@@ -85,12 +85,40 @@ offload_slot_to_cpu(slot_idx, cpu_block_id)           # Async offload to CPU
 Each slot has per-layer CUDA events for fine-grained synchronization:
 - `ring_slot_ready[slot_idx][layer_id]`: H2D transfer completion
 - `ring_slot_offload_done[slot_idx][layer_id]`: D2H transfer completion
 - `ring_slot_compute_done[slot_idx][layer_id]`: Attention compute completion (for safe buffer reuse)
 This enables:
 1. Overlapped H2D transfer with attention computation
 2. Each layer independently waits for its own data
 3. Pipeline depth = N-1 for prefill (N slots, 1 for writing)
 ### Async Pipeline with Double Buffering
 **File**: `nanovllm/layers/attention.py` - `_ring_buffer_pipeline_load()`
 The async pipeline uses double buffering with `compute_done` events to prevent data races:
 ```python
 # Synchronization flow for safe async pipeline:
 1. load_to_slot_layer() waits for compute_done[slot] before overwriting
 2. wait_slot_layer() waits for slot_ready[slot] before reading
 3. After flash_attn, record_slot_compute_done(slot) allows next load
 Timeline with 2 slots (A, B):
 ┌──────────────┐
 │ Load B0→A    │
 └──────────────┘
               ┌──────────────┐ ┌──────────────┐
               │ Load B1→B    │ │ Load B2→A    │ ...
               └──────────────┘ └──────────────┘
                              ↘               ↘
                ┌──────────────┐ ┌──────────────┐
                │ Compute(A)   │ │ Compute(B)   │ ...
                └──────────────┘ └──────────────┘
 ```
 **Key**: `load_to_slot_layer` internally waits for `compute_done` before starting transfer, preventing data race where new data overwrites unread data.
 ### Chunked Prefill Flow (Ring Buffer Pipeline)
 **File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
@@ -163,7 +191,47 @@ def merge_attention_outputs(o1, lse1, o2, lse2):
    # Uses LSE to correctly weight and combine partial attention outputs
 ```
 ### Flash Attention with LSE
 **File**: `nanovllm/kvcache/chunked_attention.py` - `flash_attn_with_lse()`
 Uses native `flash_attn_func` with `return_attn_probs=True` to get LSE output. This:
 - Natively supports GQA (no memory overhead for head replication)
 - Avoids `repeat_interleave` which would copy K/V heads (40MB+ per call)
 - Returns `(output, lse)` for online softmax merging
 ### Pipeline Depth
 - **Prefill**: Pipeline depth = N-1 (where N = num_gpu_blocks)
 - **Decode**: Pipeline depth = (N-1)/2 (double buffering within decode_load_slots)
 ## Performance Optimizations
 ### Warmup Model Optimization
 **File**: `nanovllm/engine/model_runner.py` - `warmup_model()`
 Warmup uses a reasonable sequence length (`block_size * 2`) instead of `max_model_len`:
 - Avoids huge intermediate activation memory allocation
 - 8192 tokens is sufficient to trigger CUDA kernel JIT compilation
 - Prevents OOM during initialization for long-context configs (256K+)
 ### Memory Considerations
 **GQA Head Replication**: The chunked attention uses native `flash_attn_func` which handles GQA internally without memory overhead. Previous implementation used `repeat_interleave` which copied K/V heads, adding ~40MB per attention call.
 **Block Size Trade-off**:
 - Larger block_size (4096) = fewer H2D transfers, better throughput
 - Smaller block_size (256) = finer granularity, less wasted memory
 - Current default: 4096 tokens per block
 ## Configuration Defaults
 | Parameter | Default | Description |
 |-----------|---------|-------------|
 | `kvcache_block_size` | 4096 | Tokens per KV cache block |
 | `max_num_batched_tokens` | 16384 | Max tokens per batch |
 | `max_num_seqs` | 512 | Max concurrent sequences |
 | `gpu_memory_utilization` | 0.9 | GPU memory fraction for KV cache |
 | `enforce_eager` | False | Disable CUDA graphs if True |
 | `num_prefetch_blocks` | 2 | Ring buffer pipeline depth (deprecated, uses num_gpu_blocks) |