From a37f07943cc8e8cf827f5b6654fb32d324edbb51 Mon Sep 17 00:00:00 2001
From: Zijie Tian <zijietian@mail.xmu.edu.cn>
Date: Mon, 15 Dec 2025 00:13:27 +0800
Subject: [PATCH] [docs] Update the CLAUDE.md.

---
 CLAUDE.md | 158 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 132 insertions(+), 26 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index adcd99b..b466f4c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -15,8 +15,13 @@ pip install -e .
 # Run example
 python example.py
 
-# Run benchmark
-python bench.py
+# Run benchmarks
+python bench.py                    # Standard benchmark
+python bench_offload.py            # CPU offload benchmark
+
+# Test chunked attention
+CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
+# Args: num_gpu_blocks input_len output_len num_prefetch_blocks
 ```
 
 ## Architecture
@@ -25,49 +30,150 @@ python bench.py
 
 **LLMEngine** (`nanovllm/engine/llm_engine.py`):
 - Main entry point, wraps ModelRunner and Scheduler
-- Handles tokenization and multi-process tensor parallelism coordination
-- `generate()` method runs the prefill-decode loop until all sequences finish
+- `generate()` runs prefill-decode loop until all sequences finish
 
 **ModelRunner** (`nanovllm/engine/model_runner.py`):
 - Loads model weights, allocates KV cache, captures CUDA graphs
-- Rank 0 is the main process; ranks 1+ run in separate processes via `loop()` waiting on shared memory events
-- `run()` prepares inputs and executes model forward pass
+- Rank 0 is main process; ranks 1+ run via `loop()` with shared memory events
 
 **Scheduler** (`nanovllm/engine/scheduler.py`):
 - Two-phase scheduling: prefill (waiting queue) then decode (running queue)
-- Handles preemption when memory is constrained by moving sequences back to waiting
 
 **BlockManager** (`nanovllm/engine/block_manager.py`):
 - Paged attention block allocation with prefix caching via xxhash
-- Blocks are 256 tokens by default, tracked with reference counting
+- Blocks are 256 tokens by default
 
-**Sequence** (`nanovllm/engine/sequence.py`):
-- Tracks token IDs, block table, and sampling parameters per request
-- Custom `__getstate__`/`__setstate__` for efficient pickling across processes
-
-### Model Implementation
-
-**Qwen3ForCausalLM** (`nanovllm/models/qwen3.py`):
-- Standard transformer: embedding → decoder layers → RMSNorm → LM head
-- Uses `packed_modules_mapping` for weight loading (q/k/v → qkv_proj, gate/up → gate_up_proj)
+### Model & Attention
 
 **Attention** (`nanovllm/layers/attention.py`):
-- Uses FlashAttention (`flash_attn_varlen_func` for prefill, `flash_attn_with_kvcache` for decode)
-- Custom Triton kernel `store_kvcache_kernel` for KV cache writes
+- FlashAttention: `flash_attn_varlen_func` (prefill), `flash_attn_with_kvcache` (decode)
+- Triton kernel `store_kvcache_kernel` for KV cache writes
+- Chunked attention methods: `_chunked_prefill_attention()`, `_chunked_decode_attention()`
 
-**Parallel Layers** (`nanovllm/layers/linear.py`, `embed_head.py`):
-- Tensor parallelism via column/row parallel linear layers with custom weight loaders
+**Global Context** (`nanovllm/utils/context.py`):
+- Stores attention metadata via `get_context()`/`set_context()`
+- Key fields: `cu_seqlens`, `slot_mapping`, `block_tables`, `chunked_seq`
 
-### Key Design Patterns
+## CPU Offload System
 
-- **Global Context**: `nanovllm/utils/context.py` stores attention metadata (cu_seqlens, slot_mapping, block_tables) accessed via `get_context()`/`set_context()`
-- **CUDA Graph Capture**: Decode phase uses captured graphs for batch sizes 1, 2, 4, 8, 16, 32... up to max_num_seqs (capped at 512)
-- **Shared Memory IPC**: Tensor parallel workers receive commands via pickled data in SharedMemory, synchronized with Events
+### Overview
 
-### Config Defaults
+When `enable_cpu_offload=True`, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.
+
+### Three-Region GPU Buffer Design
+
+```
+GPU Slots: [0]     [1, 2, 3]        [4, 5]
+           ↑           ↑               ↑
+        decode     compute         prefetch
+        (1 slot)   (N slots)       (M slots)
+
+- Decode slot: New token's KV written here during decode
+- Compute region: Load CPU blocks for current chunk computation
+- Prefetch region: Async load next chunk while computing current
+```
+
+**File**: `nanovllm/kvcache/offload_engine.py`
+
+Key attributes:
+- `decode_slot = 0`: Fixed slot for decode KV writes
+- `compute_slots`: List of GPU slots for compute region
+- `prefetch_slots`: List of GPU slots for prefetch region
+- `k_cache_gpu/v_cache_gpu`: Shape `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
+- `k_cache_cpu/v_cache_cpu`: Shape `[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]` (pinned memory)
+
+### Per-Layer Loading (Critical Design)
+
+**Problem solved**: Original design had layer 0 load ALL layers' KV at once. When layer 0 processed chunk 1, it overwrote chunk 0's data before layer 1+ could read it.
+
+**Solution**: Each layer independently loads only its own KV data:
+```python
+# Per-layer methods in OffloadEngine
+load_to_compute_layer(layer_id, cpu_block_ids)  # Load single layer to compute region
+wait_compute_layer(layer_id)                     # Wait for layer's transfer
+load_to_prefetch_layer(layer_id, cpu_block_ids) # Load single layer to prefetch region
+wait_prefetch_layer(layer_id)                    # Wait for layer's prefetch
+```
+
+### Chunked Prefill Flow
+
+**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
+
+```
+For each prefill chunk:
+1. Current chunk's KV is written to GPU (compute region slots)
+2. Load previous chunks' KV from CPU to prefetch region
+3. Compute attention against previous KV (no causal mask)
+4. Compute attention against current KV (causal mask)
+5. Merge results using online softmax (LSE)
+6. Offload current chunk's KV to CPU
+```
+
+**Important**: Prefill uses ONLY prefetch region to avoid conflict with current chunk's KV being written to compute region.
+
+### Chunked Decode Flow (Double Buffering)
+
+**File**: `nanovllm/layers/attention.py` - `_chunked_decode_attention()`
+
+```
+Timeline (async double buffering):
+        ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+Load:   │C0 → Compute │ │C1 → Prefetch│ │C2 → Compute │
+        └─────────────┘ └─────────────┘ └─────────────┘
+                      ↘               ↘               ↘
+Compute:               [C0]           [C1]           [C2]
+
+1. Pre-load first chunk to compute region
+2. Wait for current buffer, trigger async prefetch of next chunk to OTHER buffer
+3. Compute attention, merge results
+4. Swap buffers, repeat
+5. Finally attend to decode slot (new token's KV)
+```
+
+### HybridKVCacheManager
+
+**File**: `nanovllm/kvcache/hybrid_manager.py`
+
+Manages both GPU and CPU blocks:
+- `allocate()`: Allocate GPU block first, fallback to CPU
+- `allocate_cpu_only()`: Force CPU allocation (for chunked prefill)
+- `get_all_cpu_blocks(seq)`: Get all CPU block IDs for a sequence
+- `get_prefilled_cpu_blocks(seq)`: Get CPU blocks from previous chunks
+- `may_offload()`: Offload GPU blocks to CPU when decode slot fills
+
+### Online Softmax Merge
+
+**File**: `nanovllm/kvcache/chunked_attention.py`
+
+When computing attention across multiple chunks, results are merged using log-sum-exp:
+```python
+def merge_attention_outputs(o1, lse1, o2, lse2):
+    # Uses LSE to correctly weight and combine partial attention outputs
+```
+
+### Ring Buffer Design (Future Optimization)
+
+Current double-buffering limits pipeline depth. Planned improvement:
+- Unified ring buffer using all GPU slots (except decode)
+- Per-slot per-layer CUDA events for fine-grained sync
+- Deeper pipeline: prefetch N-1 blocks ahead (vs 1 chunk)
+
+## Config Defaults
 
 - `max_num_batched_tokens`: 16384
 - `max_num_seqs`: 512
 - `kvcache_block_size`: 256
 - `gpu_memory_utilization`: 0.9
 - `enforce_eager`: False (enables CUDA graphs)
+
+## Testing CPU Offload
+
+```bash
+# Basic test with limited GPU blocks to trigger offload
+CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
+
+# Verify consistency (run multiple times, output should be identical)
+for i in 1 2 3; do
+  CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 32 2 2>&1 | tail -3
+done
+```