From 105201b902e2e03ed39faea48ad9e6a181e0b16a Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Thu, 8 Jan 2026 21:19:38 +0800 Subject: [PATCH] [claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST --- .claude/rules/doc-management.md | 105 +++++++++ .claude/rules/no-extra-docs.md | 46 ++-- CLAUDE.md | 269 +--------------------- docs/architecture_guide.md | 189 +++++++++++++++ docs/debugging_guide.md | 142 ++++++++++++ docs/layerwise_offload_memory_analysis.md | 138 +++++++++++ docs/sparse_attention_guide.md | 39 ++++ 7 files changed, 649 insertions(+), 279 deletions(-) create mode 100644 .claude/rules/doc-management.md create mode 100644 docs/architecture_guide.md create mode 100644 docs/debugging_guide.md diff --git a/.claude/rules/doc-management.md b/.claude/rules/doc-management.md new file mode 100644 index 0000000..dd84110 --- /dev/null +++ b/.claude/rules/doc-management.md @@ -0,0 +1,105 @@ +# Documentation Management + +## CLAUDE.md Content Policy + +**CLAUDE.md should only contain operational requirements:** +- Environment setup (PYTHONPATH, GPU mutex) +- Execution requirements (how to run tests/benchmarks) +- Quick configuration reference +- Documentation index (links to detailed docs) + +**Technical details should go to docs/:** +- Architecture and design explanations +- Implementation details and code flows +- Debugging techniques +- Memory analysis and profiling +- Algorithm explanations + +## When Adding New Technical Content + +Follow this workflow: + +### Step 1: Analyze and Document + +If doing technical analysis (e.g., memory profiling): +1. Calculate theoretical values using formulas +2. Run actual tests to measure real values +3. Compare theoretical vs actual (expect < 10% error for valid models) +4. Document findings with both theory and empirical validation + +### Step 2: Create/Update docs/ + +Create a new doc or update existing one in `docs/`: +``` +docs/ +├── architecture_guide.md # Core components, design, flows +├── sparse_attention_guide.md # Sparse attention methods +├── layerwise_offload_memory_analysis.md # Memory analysis +├── debugging_guide.md # Debugging techniques +└── _guide.md # New technical topic +``` + +### Step 3: Update CLAUDE.md Documentation Index + +Add entry to the Documentation Index table: +```markdown +| Document | Purpose | +|----------|---------| +| [`docs/new_doc.md`](docs/new_doc.md) | Brief description | +``` + +### Step 4: Refactor if Needed + +If CLAUDE.md grows too large (> 150 lines), refactor: +1. Identify technical details that can be moved +2. Create appropriate doc in docs/ +3. Replace detailed content with reference link +4. Keep only operational essentials in CLAUDE.md + +## Documentation Structure Template + +For new technical docs: + +```markdown +# Topic Guide + +Brief overview of what this document covers. + +## Section 1: Concepts +- Key concepts and terminology + +## Section 2: Implementation +- Code locations +- Key methods/functions + +## Section 3: Details +- Detailed explanations +- Code examples + +## Section 4: Validation (if applicable) +- Theoretical analysis +- Empirical measurements +- Comparison table +``` + +## Memory Analysis Template + +When documenting memory behavior: + +```markdown +## Theoretical Calculation + +| Component | Formula | Size | +|-----------|---------|------| +| Buffer X | `param1 × param2 × dtype_size` | X MB | + +## Empirical Validation + +| Metric | Theoretical | Actual | Error | +|--------|-------------|--------|-------| +| Peak memory | X GB | Y GB | Z% | + +## Key Findings +1. Finding 1 +2. Finding 2 +``` diff --git a/.claude/rules/no-extra-docs.md b/.claude/rules/no-extra-docs.md index 87a806b..165f949 100644 --- a/.claude/rules/no-extra-docs.md +++ b/.claude/rules/no-extra-docs.md @@ -2,39 +2,47 @@ ## Do Not Create Unnecessary Documentation -**IMPORTANT**: Do NOT create extra markdown documentation files unless explicitly requested by the user. +**IMPORTANT**: Do NOT create extra markdown documentation files proactively unless: +1. User explicitly requests documentation +2. Refactoring CLAUDE.md to move technical details to docs/ (see `doc-management.md`) ### What NOT to do: -- ❌ Do NOT create README files proactively -- ❌ Do NOT create analysis documents (*.md) after completing tasks -- ❌ Do NOT create tutorial/guide documents -- ❌ Do NOT create summary documents +- Do NOT create README files proactively +- Do NOT create standalone analysis documents after completing tasks +- Do NOT create summary documents without request ### What TO do: -- ✅ Only create documentation when user explicitly asks for it -- ✅ Provide information directly in conversation instead -- ✅ Update existing documentation if changes require it -- ✅ Add inline code comments where necessary +- Provide information directly in conversation by default +- When user requests documentation, follow `doc-management.md` workflow +- Update existing docs in `docs/` when code changes affect them +- Keep CLAUDE.md concise (< 150 lines), move technical details to docs/ -### Exceptions: +### Documentation Locations: -Documentation is acceptable ONLY when: -1. User explicitly requests "create a README" or "write documentation" -2. Updating existing documentation to reflect code changes -3. Adding inline comments/docstrings to code itself +| Type | Location | +|------|----------| +| Operational requirements | CLAUDE.md | +| Technical details | docs/*.md | +| Code comments | Inline in source | ### Examples: -**Bad** (Don't do this): +**Proactive docs (Don't do)**: ``` User: "Profile the code" -Assistant: [Creates profiling_results.md after profiling] +Assistant: [Creates profiling_results.md without being asked] ``` -**Good** (Do this instead): +**On-request docs (Do this)**: ``` -User: "Profile the code" -Assistant: [Runs profiling, shows results in conversation] +User: "Profile the code and document the findings" +Assistant: [Runs profiling, creates/updates docs/memory_analysis.md] +``` + +**Refactoring (Do this)**: +``` +User: "CLAUDE.md is too long, refactor it" +Assistant: [Moves technical sections to docs/, updates CLAUDE.md index] ``` diff --git a/CLAUDE.md b/CLAUDE.md index b3514a6..986bc7c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,17 +27,6 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L 3. **Only proceed** when `nvidia-smi --query-compute-apps=pid --format=csv,noheader` returns empty output -**Example workflow**: -```bash -# First check if GPU is in use -nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader - -# If output is empty, proceed with your command -python bench_offload.py - -# If output shows processes, wait until they finish -``` - **Note**: This applies to ALL GPU operations including: - Running tests (`python tests/test_*.py`) - Running benchmarks (`python bench*.py`) @@ -63,256 +52,14 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py - Code changes take effect immediately (no reinstall needed) - Each worktree is completely isolated -**For shell session** (optional): -```bash -export PYTHONPATH=/path/to/your/worktree:$PYTHONPATH -python tests/test_needle.py # PYTHONPATH already set -``` +## Documentation Index -## Sparse Attention - -For sparse attention related content (block sparse attention, MInference, FlexPrefill, XAttention, AvgPool, etc.), refer to [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md). - -### Quest Sparse Policy - -**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py` - -Quest policy selects Top-K blocks based on query-key similarity bounds using min/max key metadata. - -**Scoring Mechanism**: -```python -score_min = torch.einsum('hd,bhd->bh', q, key_min) # [num_blocks, kv_heads] -score_max = torch.einsum('hd,bhd->bh', q, key_max) # [num_blocks, kv_heads] -scores = torch.maximum(score_min, score_max).mean(dim=-1) # [num_blocks] ← averaged! -``` - -**Critical Limitation - No Per-Head Scheduling**: - -The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads: - -``` -Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected -Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected -Block C: both heads moderately need (+2, +2) → avg = +2 → selected -``` - -**Why Per-Head Scheduling is Infeasible**: -1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]` -2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch -3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded - -**Policy Types**: -- `FullAttentionPolicy`: `supports_prefill=True, supports_decode=True` - loads all blocks -- `QuestPolicy`: `supports_prefill=False, supports_decode=True` - decode-only Top-K selection - -## Architecture - -### Core Components - -- **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop -- **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs, layer-wise offload -- **Scheduler** (`scheduler.py`): Two-phase scheduling (prefill → decode) -- **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096 -- **Attention** (`layers/attention.py`): FlashAttention for standard inference - -## PyTorch Hooks for Debugging - -### Hook Positions in Qwen3 - -``` -decoder_layer -├── input_layernorm (RMSNorm) -├── self_attn (Qwen3Attention) ← Hook here for attention I/O after o_proj -│ ├── q_proj → q_norm → RoPE -│ ├── k_proj → k_norm → RoPE -│ ├── v_proj -│ ├── attn (Attention) ← Hook here for Q/K/V tensors -│ │ └── FlashAttention / SDPA -│ └── o_proj -├── post_attention_layernorm (RMSNorm) -└── mlp (Qwen3MLP) -``` - -### Hook Types & Data Shapes - -| Hook Position | Type | Captured Data | -|---------------|------|---------------| -| `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj | -| `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE | -| `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj | - -### Example: Capture Attention Outputs - -```python -storage = {} - -def make_hook(layer_id: int, storage: dict): - def hook(module, inputs, output): - if isinstance(output, tuple): - attn_output = output[0] - else: - attn_output = output - # nanovllm shape: [num_tokens, hidden_size] -> add batch dim - if attn_output.dim() == 2: - attn_output = attn_output.unsqueeze(0) - storage[layer_id] = attn_output.detach().clone() - return hook - -# Register hooks -hooks = [] -for layer_idx, layer in enumerate(model.model.layers): - hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage))) - -# Run inference... - -# Cleanup -for hook in hooks: - hook.remove() -``` - -### Reference Implementation - -Key files: -- `tests/modeling_qwen3.py`: Reference Qwen3 implementation (torch + transformers only) -- `tests/test_needle_ref.py`: Reference needle test using custom Qwen3 -- `tests/test_needle.py`: Needle-in-haystack test for nanovllm - -### Common Pitfalls - -1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]` -2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj -3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]` - -## Layer-wise CPU Offload System - -### Design Philosophy - -Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time: - -``` -Layer 0: [full sequence] → compute → offload K,V to CPU -Layer 1: [full sequence] → compute → offload K,V to CPU -... -Layer N: [full sequence] → compute → offload K,V to CPU -``` - -**Benefits**: -- Supports MInference sparse attention (requires full KV access per layer) -- Simpler memory management (one layer's KV in GPU at a time) -- Peak GPU memory = one layer's KV cache + attention workspace - -### Key Files - -- `nanovllm/engine/model_runner.py`: Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`) -- `nanovllm/kvcache/hybrid_manager.py`: CPU block management helpers -- `nanovllm/kvcache/offload_engine.py`: CPU/GPU cache storage - -### Memory Layout - -**CPU Cache** (pinned memory): -```python -k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim] -v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim] -``` - -**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token): - -| Context Length | KV per Layer | -|----------------|--------------| -| 128K tokens | 512 MB | -| 256K tokens | 1 GB | -| 512K tokens | 2 GB | -| 1M tokens | 4 GB | - -### Prefill Flow - -```python -def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]: - # 1. Embedding - hidden_states = self.model.model.embed_tokens(input_ids) - - # 2. Process each layer - for layer_id in range(num_layers): - # QKV projection + norms + RoPE - q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin) - k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin) - v = v_proj(hidden_states) - - # Full FlashAttention (entire sequence) - attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True) - - # MLP - hidden_states = mlp(attn_out + residual) - - # Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs) - self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens) - - # 3. Final norm + sampling - return sampled_tokens -``` - -### Decode Flow - -```python -def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]: - # For each layer: - for layer_id in range(num_layers): - # 1. Load all prefilled KV from CPU - for block_idx, cpu_block_id in enumerate(cpu_block_table): - k_block = offload_engine.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda") - v_block = offload_engine.v_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda") - - # 2. Compute new Q,K,V for current token - q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin) - k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin) - v_new = v_proj(hidden_states) - - # 3. Concatenate and compute attention - k_full = torch.cat([k_prefill, k_new], dim=0) - v_full = torch.cat([v_prefill, v_new], dim=0) - attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False) - # Note: causal=False because single query token should attend to ALL keys -``` - -### Critical Implementation Details - -**1. Synchronous Offload Required** - -Async offload with `non_blocking=True` causes memory reuse bugs: -```python -# BUG: PyTorch may reuse k,v GPU memory before async copy completes -offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True) - -# CORRECT: Synchronous copy ensures data integrity -offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end]) # sync -``` - -**2. Decode Attention: causal=False** - -During decode, the single query token must attend to ALL keys (not just preceding ones): -```python -# Prefill: causal=True (each token only attends to previous tokens) -attn_out = flash_attn_varlen_func(..., causal=True) - -# Decode: causal=False (query at position N attends to all N-1 prefill + itself) -attn_out = flash_attn_varlen_func(..., causal=False) -``` - -### Helper Methods in HybridKVCacheManager - -```python -# Get all CPU blocks for a sequence -cpu_blocks = manager.get_all_cpu_blocks(seq) # List[int] - -# Get only prefilled (offloaded) CPU blocks -prefilled_blocks = manager.get_prefilled_cpu_blocks(seq) # List[int] - -# Get cached prefill length (doesn't change during decode) -prefill_len = manager.get_prefill_len(seq) # int - -# Get decode start position -decode_pos = manager.get_decode_start_pos(seq) # int -``` +| Document | Purpose | +|----------|---------| +| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details | +| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow | +| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) | +| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling | ## Configuration @@ -322,6 +69,8 @@ decode_pos = manager.get_decode_start_pos(seq) # int | `max_num_batched_tokens` | 16384 | Set = max_model_len for long context | | `gpu_memory_utilization` | 0.9 | GPU memory fraction | | `enable_cpu_offload` | False | Enable for long context | +| `num_gpu_blocks` | 2 | GPU blocks for offload mode | +| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline | ## Benchmarking diff --git a/docs/architecture_guide.md b/docs/architecture_guide.md new file mode 100644 index 0000000..47bacff --- /dev/null +++ b/docs/architecture_guide.md @@ -0,0 +1,189 @@ +# Architecture Guide + +This document describes the core architecture and layer-wise CPU offload system of nano-vLLM. + +## Core Components + +| Component | File | Purpose | +|-----------|------|---------| +| **LLMEngine** | `llm_engine.py` | Main entry, runs prefill-decode loop | +| **ModelRunner** | `model_runner.py` | Loads weights, allocates KV cache, CUDA graphs, layer-wise offload | +| **Scheduler** | `scheduler.py` | Two-phase scheduling (prefill → decode) | +| **BlockManager** | `block_manager.py` | Paged attention with prefix caching (xxhash), default block size 4096 | +| **Attention** | `layers/attention.py` | FlashAttention for standard inference | + +## Layer-wise CPU Offload System + +### Design Philosophy + +Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time: + +``` +Layer 0: [full sequence] → compute → offload K,V to CPU +Layer 1: [full sequence] → compute → offload K,V to CPU +... +Layer N: [full sequence] → compute → offload K,V to CPU +``` + +**Benefits**: +- Supports MInference sparse attention (requires full KV access per layer) +- Simpler memory management (one layer's KV in GPU at a time) +- Peak GPU memory = one layer's KV cache + attention workspace + +### Key Files + +| File | Purpose | +|------|---------| +| `nanovllm/engine/model_runner.py` | Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`) | +| `nanovllm/kvcache/hybrid_manager.py` | CPU block management helpers | +| `nanovllm/kvcache/offload_engine.py` | CPU/GPU cache storage, ring buffer, async transfers | + +### Memory Layout + +**CPU Cache** (pinned memory): +```python +k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim] +v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim] +``` + +**GPU Ring Buffer** (for decode H2D pipeline): +```python +layer_k_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim] +layer_v_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim] +``` + +**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token): + +| Context Length | KV per Layer | +|----------------|--------------| +| 128K tokens | 512 MB | +| 256K tokens | 1 GB | +| 512K tokens | 2 GB | +| 1M tokens | 4 GB | + +--- + +## Prefill Flow + +```python +def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]: + # 1. Embedding + hidden_states = self.model.model.embed_tokens(input_ids) + + # 2. Process each layer + for layer_id in range(num_layers): + # QKV projection + norms + RoPE + q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin) + k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin) + v = v_proj(hidden_states) + + # Full FlashAttention (entire sequence) + attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True) + + # MLP + hidden_states = mlp(attn_out + residual) + + # Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs) + self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens) + + # 3. Final norm + sampling + return sampled_tokens +``` + +--- + +## Decode Flow + +```python +def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]: + # Ring buffer pipeline: preload first N layers + for i in range(num_buffers): + offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens) + + # For each layer: + for layer_id in range(num_layers): + current_buffer = layer_id % num_buffers + + # 1. Wait for buffer load to complete + offload_engine.wait_buffer_load(current_buffer) + + # 2. Get prefilled KV from ring buffer + k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens) + + # 3. Compute new Q,K,V for current token + q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin) + k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin) + v_new = v_proj(hidden_states) + + # 4. Concatenate and compute attention + k_full = torch.cat([k_prefill, k_new], dim=0) + v_full = torch.cat([v_prefill, v_new], dim=0) + attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False) + # Note: causal=False because single query token should attend to ALL keys + + # 5. Mark buffer done, start loading next layer + offload_engine.record_buffer_compute_done(current_buffer) + if layer_id + num_buffers < num_layers: + offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...) +``` + +--- + +## Critical Implementation Details + +### 1. Synchronous Offload Required + +Async offload with `non_blocking=True` causes memory reuse bugs: + +```python +# BUG: PyTorch may reuse k,v GPU memory before async copy completes +offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True) + +# CORRECT: Synchronous copy ensures data integrity +offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end]) # sync +``` + +### 2. Decode Attention: causal=False + +During decode, the single query token must attend to ALL keys (not just preceding ones): + +```python +# Prefill: causal=True (each token only attends to previous tokens) +attn_out = flash_attn_varlen_func(..., causal=True) + +# Decode: causal=False (query at position N attends to all N-1 prefill + itself) +attn_out = flash_attn_varlen_func(..., causal=False) +``` + +### 3. Ring Buffer Synchronization + +The ring buffer pipeline requires careful ordering: + +```python +# CORRECT order: +offload_engine.store_decode_kv(layer_id, pos, k_new, v_new) # Store new KV +offload_engine.record_buffer_compute_done(current_buffer) # Mark done FIRST +offload_engine.load_layer_kv_to_buffer(...) # THEN start next load + +# BUG: Starting load before marking done causes race condition +offload_engine.load_layer_kv_to_buffer(...) # WRONG: buffer still in use! +offload_engine.record_buffer_compute_done(current_buffer) +``` + +--- + +## Helper Methods in HybridKVCacheManager + +```python +# Get all CPU blocks for a sequence +cpu_blocks = manager.get_all_cpu_blocks(seq) # List[int] + +# Get only prefilled (offloaded) CPU blocks +prefilled_blocks = manager.get_prefilled_cpu_blocks(seq) # List[int] + +# Get cached prefill length (doesn't change during decode) +prefill_len = manager.get_prefill_len(seq) # int + +# Get decode start position +decode_pos = manager.get_decode_start_pos(seq) # int +``` diff --git a/docs/debugging_guide.md b/docs/debugging_guide.md new file mode 100644 index 0000000..efc00be --- /dev/null +++ b/docs/debugging_guide.md @@ -0,0 +1,142 @@ +# Debugging Guide + +This document provides debugging techniques for nano-vLLM, including PyTorch hooks for capturing intermediate tensors. + +## PyTorch Hooks for Debugging + +### Hook Positions in Qwen3 + +``` +decoder_layer +├── input_layernorm (RMSNorm) +├── self_attn (Qwen3Attention) ← Hook here for attention I/O after o_proj +│ ├── q_proj → q_norm → RoPE +│ ├── k_proj → k_norm → RoPE +│ ├── v_proj +│ ├── attn (Attention) ← Hook here for Q/K/V tensors +│ │ └── FlashAttention / SDPA +│ └── o_proj +├── post_attention_layernorm (RMSNorm) +└── mlp (Qwen3MLP) +``` + +### Hook Types & Data Shapes + +| Hook Position | Type | Captured Data | +|---------------|------|---------------| +| `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj | +| `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE | +| `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj | + +### Example: Capture Attention Outputs + +```python +storage = {} + +def make_hook(layer_id: int, storage: dict): + def hook(module, inputs, output): + if isinstance(output, tuple): + attn_output = output[0] + else: + attn_output = output + # nanovllm shape: [num_tokens, hidden_size] -> add batch dim + if attn_output.dim() == 2: + attn_output = attn_output.unsqueeze(0) + storage[layer_id] = attn_output.detach().clone() + return hook + +# Register hooks +hooks = [] +for layer_idx, layer in enumerate(model.model.layers): + hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage))) + +# Run inference... + +# Cleanup +for hook in hooks: + hook.remove() +``` + +### Reference Implementation + +Key files for comparison testing: + +| File | Purpose | +|------|---------| +| `tests/modeling_qwen3.py` | Reference Qwen3 implementation (torch + transformers only) | +| `tests/test_needle_ref.py` | Reference needle test using custom Qwen3 | +| `tests/test_needle.py` | Needle-in-haystack test for nanovllm | + +### Common Pitfalls + +1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]` +2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj +3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]` + +--- + +## Memory Debugging + +### Track Peak GPU Memory + +```python +import torch + +# Reset stats before operation +torch.cuda.reset_peak_memory_stats() +torch.cuda.empty_cache() + +# Run operation +outputs = llm.generate([prompt], sampling_params) + +# Check peak +peak_gb = torch.cuda.max_memory_allocated() / 1024**3 +print(f"Peak GPU memory: {peak_gb:.2f} GB") +``` + +### Monitor Memory During Execution + +```python +import torch + +def memory_snapshot(): + allocated = torch.cuda.memory_allocated() / 1024**3 + reserved = torch.cuda.memory_reserved() / 1024**3 + print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB") + +# Add snapshots at key points in your code +``` + +--- + +## Comparing Outputs + +### Needle-in-Haystack Test + +```bash +# Test with CPU offload +PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --enable-offload --input-len 8192 + +# Test without CPU offload (GPU-only) +PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --input-len 8192 + +# Compare with reference implementation +PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle_ref.py --input-len 8192 +``` + +### Tensor Comparison + +```python +def compare_tensors(a, b, name, rtol=1e-3, atol=1e-5): + if a.shape != b.shape: + print(f"{name}: Shape mismatch {a.shape} vs {b.shape}") + return False + + diff = (a - b).abs() + max_diff = diff.max().item() + mean_diff = diff.mean().item() + + close = torch.allclose(a, b, rtol=rtol, atol=atol) + print(f"{name}: max_diff={max_diff:.6f}, mean_diff={mean_diff:.6f}, close={close}") + return close +``` diff --git a/docs/layerwise_offload_memory_analysis.md b/docs/layerwise_offload_memory_analysis.md index 5a5adb3..560ff39 100644 --- a/docs/layerwise_offload_memory_analysis.md +++ b/docs/layerwise_offload_memory_analysis.md @@ -407,3 +407,141 @@ k_full = seq_len * kv_dim * dtype_size v_full = k_full # = 256 MB # Total: 512 MB ``` + +--- + +## 8. Empirical Validation + +This section validates the theoretical memory analysis against actual measurements. + +### 8.1 Test Configuration + +```bash +python tests/test_needle.py --enable-offload --input-len 100000 --block-size 1024 +``` + +**Parameters:** +- Model: Qwen3-4B-Instruct +- `seq_len = 100000` (actual tokens: 99925) +- `block_size = 1024` +- `max_model_len = 131072` +- `num_kv_buffers = 4` + +### 8.2 Theoretical Peak Memory Calculation + +#### Step 1: Model Load Memory + +| Component | Formula | Size | +|-----------|---------|------| +| Model weights | ~4B params × 2 bytes | ~8 GB | +| Ring buffer | 2 × 4 × 131072 × 1024 × 2 | 2048 MB | +| Decode buffer | 2 × 36 × 1024 × 1024 × 2 | 144 MB | +| **Subtotal** | | **~10.2 GB** | + +#### Step 2: Prefill Activation Peak (per-layer) + +| Component | Formula | Size | +|-----------|---------|------| +| hidden_states | 100000 × 2560 × 2 | 512 MB | +| residual | 100000 × 2560 × 2 | 512 MB | +| MLP gate_up | 100000 × 27392 × 2 | **5478 MB** | +| MLP silu×gate | 100000 × 13696 × 2 | 2739 MB | +| Other intermediates (qkv, RoPE, attn) | ~1-2 GB | ~1500 MB | +| **Subtotal** | | **~10 GB** | + +#### Step 3: Total Peak + +``` +Total Peak = Model Load + Activation Peak + = 10.2 GB + 10 GB + = ~20.2 GB +``` + +### 8.3 Actual Measurement Results + +```python +import torch +torch.cuda.reset_peak_memory_stats() +# ... run inference ... +peak = torch.cuda.max_memory_allocated() +``` + +| Metric | Value | +|--------|-------| +| After model load | 9.82 GB | +| Peak during inference | **20.02 GB** | +| Activation peak (delta) | 10.20 GB | + +### 8.4 Comparison: Theory vs Actual + +| Component | Theoretical | Actual | Error | +|-----------|-------------|--------|-------| +| Model load memory | ~10.2 GB | 9.82 GB | -3.7% | +| Activation peak | ~10 GB | 10.20 GB | +2.0% | +| **Total peak** | **~20.2 GB** | **20.02 GB** | **< 1%** | + +### 8.5 Key Findings + +1. **Theoretical model is accurate**: < 5% error in all components. + +2. **MLP gate_up is the dominant temporary**: + - Size: 5.35 GB (for 100k tokens) + - Accounts for ~50% of activation peak + - Formula: `seq_len × 2 × intermediate_size × dtype_size` + +3. **Memory scaling with sequence length**: + | seq_len | Model Load | Activation Peak | Total Peak | + |---------|------------|-----------------|------------| + | 8k | ~10 GB | ~0.8 GB | ~11 GB | + | 32k | ~10 GB | ~3.2 GB | ~13 GB | + | 64k | ~10 GB | ~6.4 GB | ~16 GB | + | 100k | ~10 GB | ~10 GB | ~20 GB | + | 128k | ~10 GB | ~13 GB | ~23 GB | + +4. **Decode memory is much smaller**: + - Per-step: ~512 MB for k_full + v_full (at 100k context) + - Does not grow with decode steps (constant per layer) + +### 8.6 Memory Profiling Script + +To reproduce the measurement: + +```python +import os +os.environ["NANOVLLM_LOG_LEVEL"] = "INFO" + +import torch +from nanovllm import LLM, SamplingParams +from tests.utils import generate_needle_prompt + +# Reset memory stats +torch.cuda.reset_peak_memory_stats() +torch.cuda.empty_cache() + +# Initialize LLM +llm = LLM( + "path/to/model", + enforce_eager=True, + max_model_len=131072, + max_num_batched_tokens=131072, + enable_cpu_offload=True, + kvcache_block_size=1024, + num_gpu_blocks=2, +) + +after_load = torch.cuda.memory_allocated() +print(f"After model load: {after_load / 1024**3:.2f} GB") + +# Generate prompt and run inference +prompt, expected = generate_needle_prompt( + tokenizer=llm.tokenizer, + target_length=100000, + needle_position=0.5, +) + +torch.cuda.reset_peak_memory_stats() +outputs = llm.generate([prompt], SamplingParams(max_tokens=32)) + +peak = torch.cuda.max_memory_allocated() +print(f"Peak during inference: {peak / 1024**3:.2f} GB") +``` diff --git a/docs/sparse_attention_guide.md b/docs/sparse_attention_guide.md index 5d441a6..ba3c0df 100644 --- a/docs/sparse_attention_guide.md +++ b/docs/sparse_attention_guide.md @@ -440,3 +440,42 @@ Required libraries: - `minference`: For MInference vertical_slash kernel Docker image `tzj/xattn:v0.5` has all dependencies pre-installed. + +--- + +## Quest Sparse Policy (nano-vLLM) + +**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py` + +Quest policy is used in nano-vLLM for CPU offload mode. It selects Top-K blocks based on query-key similarity bounds using min/max key metadata. + +### Scoring Mechanism + +```python +score_min = torch.einsum('hd,bhd->bh', q, key_min) # [num_blocks, kv_heads] +score_max = torch.einsum('hd,bhd->bh', q, key_max) # [num_blocks, kv_heads] +scores = torch.maximum(score_min, score_max).mean(dim=-1) # [num_blocks] ← averaged! +``` + +### Critical Limitation - No Per-Head Scheduling + +The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads: + +``` +Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected +Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected +Block C: both heads moderately need (+2, +2) → avg = +2 → selected +``` + +### Why Per-Head Scheduling is Infeasible + +1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]` +2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch +3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded + +### Policy Types + +| Policy | `supports_prefill` | `supports_decode` | Description | +|--------|-------------------|-------------------|-------------| +| `FullAttentionPolicy` | True | True | Loads all blocks (baseline) | +| `QuestPolicy` | False | True | Decode-only Top-K selection |