Remove all chunked prefill related documentation (ring buffer, sgDMA, Triton merge kernels, known issues) and replace with layer-wise offload system documentation including: - Design philosophy and benefits - Memory layout and per-layer KV size table - Prefill and decode flow pseudocode - Critical implementation details (sync offload, causal=False for decode) - Helper methods in HybridKVCacheManager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
347 lines
12 KiB
Markdown
347 lines
12 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file provides guidance to Claude Code when working with this repository.
|
||
|
||
## Overview
|
||
|
||
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
|
||
|
||
## GPU Mutex for Multi-Instance Debugging
|
||
|
||
**IMPORTANT**: When running multiple Claude instances for parallel debugging, only one GPU (cuda:0) is available. Before executing ANY command that uses the GPU (python scripts, benchmarks, tests), Claude MUST:
|
||
|
||
1. **Check GPU availability** by running:
|
||
```bash
|
||
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
|
||
```
|
||
|
||
2. **If processes are running on GPU**:
|
||
- Wait and retry every 10 seconds until GPU is free
|
||
- Use this polling loop:
|
||
```bash
|
||
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
|
||
echo "GPU busy, waiting 10s..."
|
||
sleep 10
|
||
done
|
||
```
|
||
|
||
3. **Only proceed** when `nvidia-smi --query-compute-apps=pid --format=csv,noheader` returns empty output
|
||
|
||
**Example workflow**:
|
||
```bash
|
||
# First check if GPU is in use
|
||
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
|
||
|
||
# If output is empty, proceed with your command
|
||
python bench_offload.py
|
||
|
||
# If output shows processes, wait until they finish
|
||
```
|
||
|
||
**Note**: This applies to ALL GPU operations including:
|
||
- Running tests (`python tests/test_*.py`)
|
||
- Running benchmarks (`python bench*.py`)
|
||
- Running examples (`python example.py`)
|
||
- Any script that imports torch/cuda
|
||
|
||
## Multi-Instance Development with PYTHONPATH
|
||
|
||
**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
|
||
|
||
**Use PYTHONPATH directly** - no pip install needed:
|
||
|
||
```bash
|
||
# Set PYTHONPATH to point to the project root directory
|
||
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
|
||
|
||
# Example: running tests
|
||
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||
```
|
||
|
||
**Benefits**:
|
||
- No `pip install` required
|
||
- Code changes take effect immediately (no reinstall needed)
|
||
- Each worktree is completely isolated
|
||
|
||
**For shell session** (optional):
|
||
```bash
|
||
export PYTHONPATH=/path/to/your/worktree:$PYTHONPATH
|
||
python tests/test_needle.py # PYTHONPATH already set
|
||
```
|
||
|
||
## Sparse Attention
|
||
|
||
For sparse attention related content (block sparse attention, MInference, FlexPrefill, XAttention, AvgPool, etc.), refer to [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md).
|
||
|
||
### Quest Sparse Policy
|
||
|
||
**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
|
||
|
||
Quest policy selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
|
||
|
||
**Scoring Mechanism**:
|
||
```python
|
||
score_min = torch.einsum('hd,bhd->bh', q, key_min) # [num_blocks, kv_heads]
|
||
score_max = torch.einsum('hd,bhd->bh', q, key_max) # [num_blocks, kv_heads]
|
||
scores = torch.maximum(score_min, score_max).mean(dim=-1) # [num_blocks] ← averaged!
|
||
```
|
||
|
||
**Critical Limitation - No Per-Head Scheduling**:
|
||
|
||
The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
|
||
|
||
```
|
||
Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
|
||
Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
|
||
Block C: both heads moderately need (+2, +2) → avg = +2 → selected
|
||
```
|
||
|
||
**Why Per-Head Scheduling is Infeasible**:
|
||
1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
|
||
2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
|
||
3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
|
||
|
||
**Policy Types**:
|
||
- `FullAttentionPolicy`: `supports_prefill=True, supports_decode=True` - loads all blocks
|
||
- `QuestPolicy`: `supports_prefill=False, supports_decode=True` - decode-only Top-K selection
|
||
|
||
## Architecture
|
||
|
||
### Core Components
|
||
|
||
- **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop
|
||
- **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs, layer-wise offload
|
||
- **Scheduler** (`scheduler.py`): Two-phase scheduling (prefill → decode)
|
||
- **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096
|
||
- **Attention** (`layers/attention.py`): FlashAttention for standard inference
|
||
|
||
## PyTorch Hooks for Debugging
|
||
|
||
### Hook Positions in Qwen3
|
||
|
||
```
|
||
decoder_layer
|
||
├── input_layernorm (RMSNorm)
|
||
├── self_attn (Qwen3Attention) ← Hook here for attention I/O after o_proj
|
||
│ ├── q_proj → q_norm → RoPE
|
||
│ ├── k_proj → k_norm → RoPE
|
||
│ ├── v_proj
|
||
│ ├── attn (Attention) ← Hook here for Q/K/V tensors
|
||
│ │ └── FlashAttention / SDPA
|
||
│ └── o_proj
|
||
├── post_attention_layernorm (RMSNorm)
|
||
└── mlp (Qwen3MLP)
|
||
```
|
||
|
||
### Hook Types & Data Shapes
|
||
|
||
| Hook Position | Type | Captured Data |
|
||
|---------------|------|---------------|
|
||
| `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj |
|
||
| `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE |
|
||
| `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj |
|
||
|
||
### Example: Capture Attention Outputs
|
||
|
||
```python
|
||
storage = {}
|
||
|
||
def make_hook(layer_id: int, storage: dict):
|
||
def hook(module, inputs, output):
|
||
if isinstance(output, tuple):
|
||
attn_output = output[0]
|
||
else:
|
||
attn_output = output
|
||
# nanovllm shape: [num_tokens, hidden_size] -> add batch dim
|
||
if attn_output.dim() == 2:
|
||
attn_output = attn_output.unsqueeze(0)
|
||
storage[layer_id] = attn_output.detach().clone()
|
||
return hook
|
||
|
||
# Register hooks
|
||
hooks = []
|
||
for layer_idx, layer in enumerate(model.model.layers):
|
||
hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
|
||
|
||
# Run inference...
|
||
|
||
# Cleanup
|
||
for hook in hooks:
|
||
hook.remove()
|
||
```
|
||
|
||
### Reference Implementation
|
||
|
||
Key files:
|
||
- `tests/modeling_qwen3.py`: Reference Qwen3 implementation (torch + transformers only)
|
||
- `tests/test_needle_ref.py`: Reference needle test using custom Qwen3
|
||
- `tests/test_needle.py`: Needle-in-haystack test for nanovllm
|
||
|
||
### Common Pitfalls
|
||
|
||
1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]`
|
||
2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj
|
||
3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]`
|
||
|
||
## Layer-wise CPU Offload System
|
||
|
||
### Design Philosophy
|
||
|
||
Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:
|
||
|
||
```
|
||
Layer 0: [full sequence] → compute → offload K,V to CPU
|
||
Layer 1: [full sequence] → compute → offload K,V to CPU
|
||
...
|
||
Layer N: [full sequence] → compute → offload K,V to CPU
|
||
```
|
||
|
||
**Benefits**:
|
||
- Supports MInference sparse attention (requires full KV access per layer)
|
||
- Simpler memory management (one layer's KV in GPU at a time)
|
||
- Peak GPU memory = one layer's KV cache + attention workspace
|
||
|
||
### Key Files
|
||
|
||
- `nanovllm/engine/model_runner.py`: Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`)
|
||
- `nanovllm/kvcache/hybrid_manager.py`: CPU block management helpers
|
||
- `nanovllm/kvcache/offload_engine.py`: CPU/GPU cache storage
|
||
|
||
### Memory Layout
|
||
|
||
**CPU Cache** (pinned memory):
|
||
```python
|
||
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||
```
|
||
|
||
**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):
|
||
|
||
| Context Length | KV per Layer |
|
||
|----------------|--------------|
|
||
| 128K tokens | 512 MB |
|
||
| 256K tokens | 1 GB |
|
||
| 512K tokens | 2 GB |
|
||
| 1M tokens | 4 GB |
|
||
|
||
### Prefill Flow
|
||
|
||
```python
|
||
def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
|
||
# 1. Embedding
|
||
hidden_states = self.model.model.embed_tokens(input_ids)
|
||
|
||
# 2. Process each layer
|
||
for layer_id in range(num_layers):
|
||
# QKV projection + norms + RoPE
|
||
q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
|
||
k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
|
||
v = v_proj(hidden_states)
|
||
|
||
# Full FlashAttention (entire sequence)
|
||
attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)
|
||
|
||
# MLP
|
||
hidden_states = mlp(attn_out + residual)
|
||
|
||
# Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
|
||
self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)
|
||
|
||
# 3. Final norm + sampling
|
||
return sampled_tokens
|
||
```
|
||
|
||
### Decode Flow
|
||
|
||
```python
|
||
def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
|
||
# For each layer:
|
||
for layer_id in range(num_layers):
|
||
# 1. Load all prefilled KV from CPU
|
||
for block_idx, cpu_block_id in enumerate(cpu_block_table):
|
||
k_block = offload_engine.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda")
|
||
v_block = offload_engine.v_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda")
|
||
|
||
# 2. Compute new Q,K,V for current token
|
||
q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
|
||
k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
|
||
v_new = v_proj(hidden_states)
|
||
|
||
# 3. Concatenate and compute attention
|
||
k_full = torch.cat([k_prefill, k_new], dim=0)
|
||
v_full = torch.cat([v_prefill, v_new], dim=0)
|
||
attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
|
||
# Note: causal=False because single query token should attend to ALL keys
|
||
```
|
||
|
||
### Critical Implementation Details
|
||
|
||
**1. Synchronous Offload Required**
|
||
|
||
Async offload with `non_blocking=True` causes memory reuse bugs:
|
||
```python
|
||
# BUG: PyTorch may reuse k,v GPU memory before async copy completes
|
||
offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)
|
||
|
||
# CORRECT: Synchronous copy ensures data integrity
|
||
offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end]) # sync
|
||
```
|
||
|
||
**2. Decode Attention: causal=False**
|
||
|
||
During decode, the single query token must attend to ALL keys (not just preceding ones):
|
||
```python
|
||
# Prefill: causal=True (each token only attends to previous tokens)
|
||
attn_out = flash_attn_varlen_func(..., causal=True)
|
||
|
||
# Decode: causal=False (query at position N attends to all N-1 prefill + itself)
|
||
attn_out = flash_attn_varlen_func(..., causal=False)
|
||
```
|
||
|
||
### Helper Methods in HybridKVCacheManager
|
||
|
||
```python
|
||
# Get all CPU blocks for a sequence
|
||
cpu_blocks = manager.get_all_cpu_blocks(seq) # List[int]
|
||
|
||
# Get only prefilled (offloaded) CPU blocks
|
||
prefilled_blocks = manager.get_prefilled_cpu_blocks(seq) # List[int]
|
||
|
||
# Get cached prefill length (doesn't change during decode)
|
||
prefill_len = manager.get_prefill_len(seq) # int
|
||
|
||
# Get decode start position
|
||
decode_pos = manager.get_decode_start_pos(seq) # int
|
||
```
|
||
|
||
## Configuration
|
||
|
||
| Parameter | Default | Notes |
|
||
|-----------|---------|-------|
|
||
| `kvcache_block_size` | 4096 | Tokens per block |
|
||
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
|
||
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
|
||
| `enable_cpu_offload` | False | Enable for long context |
|
||
|
||
## Benchmarking
|
||
|
||
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
|
||
|
||
**Common Issues**:
|
||
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
|
||
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
|
||
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
|
||
|
||
**Model Limits**:
|
||
- Qwen3-0.6B/4B: 40960 tokens
|
||
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
|
||
|
||
**Performance (Qwen3-0.6B)**:
|
||
- GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
|
||
- CPU Offload (16K): ~14k tok/s (prefill)
|
||
- CPU Offload (32K): ~13k tok/s (prefill)
|
||
|
||
---
|
||
|
||
**Author**: Zijie Tian
|