[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00
parent a8c9f0d837
commit 105201b902
7 changed files with 649 additions and 279 deletions
--- a/.claude/rules/doc-management.md
+++ b/.claude/rules/doc-management.md
@@ -0,0 +1,105 @@
 # Documentation Management
 ## CLAUDE.md Content Policy
 **CLAUDE.md should only contain operational requirements:**
 - Environment setup (PYTHONPATH, GPU mutex)
 - Execution requirements (how to run tests/benchmarks)
 - Quick configuration reference
 - Documentation index (links to detailed docs)
 **Technical details should go to docs/:**
 - Architecture and design explanations
 - Implementation details and code flows
 - Debugging techniques
 - Memory analysis and profiling
 - Algorithm explanations
 ## When Adding New Technical Content
 Follow this workflow:
 ### Step 1: Analyze and Document
 If doing technical analysis (e.g., memory profiling):
 1. Calculate theoretical values using formulas
 2. Run actual tests to measure real values
 3. Compare theoretical vs actual (expect < 10% error for valid models)
 4. Document findings with both theory and empirical validation
 ### Step 2: Create/Update docs/
 Create a new doc or update existing one in `docs/`:
 ```
 docs/
 ├── architecture_guide.md      # Core components, design, flows
 ├── sparse_attention_guide.md  # Sparse attention methods
 ├── layerwise_offload_memory_analysis.md  # Memory analysis
 ├── debugging_guide.md         # Debugging techniques
 └── <new_topic>_guide.md       # New technical topic
 ```
 ### Step 3: Update CLAUDE.md Documentation Index
 Add entry to the Documentation Index table:
 ```markdown
 | Document | Purpose |
 |----------|---------|
 | [`docs/new_doc.md`](docs/new_doc.md) | Brief description |
 ```
 ### Step 4: Refactor if Needed
 If CLAUDE.md grows too large (> 150 lines), refactor:
 1. Identify technical details that can be moved
 2. Create appropriate doc in docs/
 3. Replace detailed content with reference link
 4. Keep only operational essentials in CLAUDE.md
 ## Documentation Structure Template
 For new technical docs:
 ```markdown
 # Topic Guide
 Brief overview of what this document covers.
 ## Section 1: Concepts
 - Key concepts and terminology
 ## Section 2: Implementation
 - Code locations
 - Key methods/functions
 ## Section 3: Details
 - Detailed explanations
 - Code examples
 ## Section 4: Validation (if applicable)
 - Theoretical analysis
 - Empirical measurements
 - Comparison table
 ```
 ## Memory Analysis Template
 When documenting memory behavior:
 ```markdown
 ## Theoretical Calculation
 | Component | Formula | Size |
 |-----------|---------|------|
 | Buffer X | `param1 × param2 × dtype_size` | X MB |
 ## Empirical Validation
 | Metric | Theoretical | Actual | Error |
 |--------|-------------|--------|-------|
 | Peak memory | X GB | Y GB | Z% |
 ## Key Findings
 1. Finding 1
 2. Finding 2
 ```
--- a/.claude/rules/no-extra-docs.md
+++ b/.claude/rules/no-extra-docs.md
@@ -2,39 +2,47 @@
 ## Do Not Create Unnecessary Documentation
-**IMPORTANT**: Do NOT create extra markdown documentation files unless explicitly requested by the user.
+**IMPORTANT**: Do NOT create extra markdown documentation files proactively unless:
 1. User explicitly requests documentation
 2. Refactoring CLAUDE.md to move technical details to docs/ (see `doc-management.md`)
 ### What NOT to do:
- ❌ Do NOT create README files proactively
+- Do NOT create README files proactively
- ❌ Do NOT create analysis documents (*.md) after completing tasks
+- Do NOT create standalone analysis documents after completing tasks
- ❌ Do NOT create tutorial/guide documents
+- Do NOT create summary documents without request
 - ❌ Do NOT create summary documents
 ### What TO do:
- ✅ Only create documentation when user explicitly asks for it
+- Provide information directly in conversation by default
- ✅ Provide information directly in conversation instead
+- When user requests documentation, follow `doc-management.md` workflow
- ✅ Update existing documentation if changes require it
+- Update existing docs in `docs/` when code changes affect them
- ✅ Add inline code comments where necessary
+- Keep CLAUDE.md concise (< 150 lines), move technical details to docs/
-### Exceptions:
+### Documentation Locations:
-Documentation is acceptable ONLY when:
+| Type | Location |
-1. User explicitly requests "create a README" or "write documentation"
+|------|----------|
-2. Updating existing documentation to reflect code changes
+| Operational requirements | CLAUDE.md |
-3. Adding inline comments/docstrings to code itself
+| Technical details | docs/*.md |
 | Code comments | Inline in source |
 ### Examples:
-**Bad** (Don't do this):
+**Proactive docs (Don't do)**:
 ```
 User: "Profile the code"
-Assistant: [Creates profiling_results.md after profiling]
+Assistant: [Creates profiling_results.md without being asked]
 ```
-**Good** (Do this instead):
+**On-request docs (Do this)**:
 ```
-User: "Profile the code"
+User: "Profile the code and document the findings"
-Assistant: [Runs profiling, shows results in conversation]
+Assistant: [Runs profiling, creates/updates docs/memory_analysis.md]
 ```
 **Refactoring (Do this)**:
 ```
 User: "CLAUDE.md is too long, refactor it"
 Assistant: [Moves technical sections to docs/, updates CLAUDE.md index]
 ```
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -27,17 +27,6 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 3. **Only proceed** when `nvidia-smi --query-compute-apps=pid --format=csv,noheader` returns empty output
 **Example workflow**:
 ```bash
 # First check if GPU is in use
 nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
 # If output is empty, proceed with your command
 python bench_offload.py
 # If output shows processes, wait until they finish
 ```
 **Note**: This applies to ALL GPU operations including:
 - Running tests (`python tests/test_*.py`)
 - Running benchmarks (`python bench*.py`)
@@ -63,256 +52,14 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 - Code changes take effect immediately (no reinstall needed)
 - Each worktree is completely isolated
-**For shell session** (optional):
+## Documentation Index
 ```bash
 export PYTHONPATH=/path/to/your/worktree:$PYTHONPATH
 python tests/test_needle.py  # PYTHONPATH already set
 ```
-## Sparse Attention
+| Document | Purpose |
-
+|----------|---------|
-For sparse attention related content (block sparse attention, MInference, FlexPrefill, XAttention, AvgPool, etc.), refer to [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md).
+| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
-
+| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
-### Quest Sparse Policy
+| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
-
+| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
 **Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
 Quest policy selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
 **Scoring Mechanism**:
 ```python
 score_min = torch.einsum('hd,bhd->bh', q, key_min)  # [num_blocks, kv_heads]
 score_max = torch.einsum('hd,bhd->bh', q, key_max)  # [num_blocks, kv_heads]
 scores = torch.maximum(score_min, score_max).mean(dim=-1)  # [num_blocks] ← averaged!
 ```
 **Critical Limitation - No Per-Head Scheduling**:
 The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
 ```
 Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
 Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
 Block C: both heads moderately need (+2, +2) → avg = +2 → selected
 ```
 **Why Per-Head Scheduling is Infeasible**:
 1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
 2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
 3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
 **Policy Types**:
 - `FullAttentionPolicy`: `supports_prefill=True, supports_decode=True` - loads all blocks
 - `QuestPolicy`: `supports_prefill=False, supports_decode=True` - decode-only Top-K selection
 ## Architecture
 ### Core Components
 - **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop
 - **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs, layer-wise offload
 - **Scheduler** (`scheduler.py`): Two-phase scheduling (prefill → decode)
 - **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096
 - **Attention** (`layers/attention.py`): FlashAttention for standard inference
 ## PyTorch Hooks for Debugging
 ### Hook Positions in Qwen3
 ```
 decoder_layer
 ├── input_layernorm (RMSNorm)
 ├── self_attn (Qwen3Attention)          ← Hook here for attention I/O after o_proj
 │   ├── q_proj → q_norm → RoPE
 │   ├── k_proj → k_norm → RoPE
 │   ├── v_proj
 │   ├── attn (Attention)                ← Hook here for Q/K/V tensors
 │   │   └── FlashAttention / SDPA
 │   └── o_proj
 ├── post_attention_layernorm (RMSNorm)
 └── mlp (Qwen3MLP)
 ```
 ### Hook Types & Data Shapes
 | Hook Position | Type | Captured Data |
 |---------------|------|---------------|
 | `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj |
 | `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE |
 | `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj |
 ### Example: Capture Attention Outputs
 ```python
 storage = {}
 def make_hook(layer_id: int, storage: dict):
    def hook(module, inputs, output):
        if isinstance(output, tuple):
            attn_output = output[0]
        else:
            attn_output = output
        # nanovllm shape: [num_tokens, hidden_size] -> add batch dim
        if attn_output.dim() == 2:
            attn_output = attn_output.unsqueeze(0)
        storage[layer_id] = attn_output.detach().clone()
    return hook
 # Register hooks
 hooks = []
 for layer_idx, layer in enumerate(model.model.layers):
    hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
 # Run inference...
 # Cleanup
 for hook in hooks:
    hook.remove()
 ```
 ### Reference Implementation
 Key files:
 - `tests/modeling_qwen3.py`: Reference Qwen3 implementation (torch + transformers only)
 - `tests/test_needle_ref.py`: Reference needle test using custom Qwen3
 - `tests/test_needle.py`: Needle-in-haystack test for nanovllm
 ### Common Pitfalls
 1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]`
 2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj
 3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]`
 ## Layer-wise CPU Offload System
 ### Design Philosophy
 Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:
 ```
 Layer 0: [full sequence] → compute → offload K,V to CPU
 Layer 1: [full sequence] → compute → offload K,V to CPU
 ...
 Layer N: [full sequence] → compute → offload K,V to CPU
 ```
 **Benefits**:
 - Supports MInference sparse attention (requires full KV access per layer)
 - Simpler memory management (one layer's KV in GPU at a time)
 - Peak GPU memory = one layer's KV cache + attention workspace
 ### Key Files
 - `nanovllm/engine/model_runner.py`: Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`)
 - `nanovllm/kvcache/hybrid_manager.py`: CPU block management helpers
 - `nanovllm/kvcache/offload_engine.py`: CPU/GPU cache storage
 ### Memory Layout
 **CPU Cache** (pinned memory):
 ```python
 k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
 v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
 ```
 **Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):
 | Context Length | KV per Layer |
 |----------------|--------------|
 | 128K tokens | 512 MB |
 | 256K tokens | 1 GB |
 | 512K tokens | 2 GB |
 | 1M tokens | 4 GB |
 ### Prefill Flow
 ```python
 def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
    # 1. Embedding
    hidden_states = self.model.model.embed_tokens(input_ids)
    # 2. Process each layer
    for layer_id in range(num_layers):
        # QKV projection + norms + RoPE
        q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
        k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
        v = v_proj(hidden_states)
        # Full FlashAttention (entire sequence)
        attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)
        # MLP
        hidden_states = mlp(attn_out + residual)
        # Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
        self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)
    # 3. Final norm + sampling
    return sampled_tokens
 ```
 ### Decode Flow
 ```python
 def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
    # For each layer:
    for layer_id in range(num_layers):
        # 1. Load all prefilled KV from CPU
        for block_idx, cpu_block_id in enumerate(cpu_block_table):
            k_block = offload_engine.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda")
            v_block = offload_engine.v_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda")
        # 2. Compute new Q,K,V for current token
        q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
        k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
        v_new = v_proj(hidden_states)
        # 3. Concatenate and compute attention
        k_full = torch.cat([k_prefill, k_new], dim=0)
        v_full = torch.cat([v_prefill, v_new], dim=0)
        attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
        # Note: causal=False because single query token should attend to ALL keys
 ```
 ### Critical Implementation Details
 **1. Synchronous Offload Required**
 Async offload with `non_blocking=True` causes memory reuse bugs:
 ```python
 # BUG: PyTorch may reuse k,v GPU memory before async copy completes
 offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)
 # CORRECT: Synchronous copy ensures data integrity
 offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end])  # sync
 ```
 **2. Decode Attention: causal=False**
 During decode, the single query token must attend to ALL keys (not just preceding ones):
 ```python
 # Prefill: causal=True (each token only attends to previous tokens)
 attn_out = flash_attn_varlen_func(..., causal=True)
 # Decode: causal=False (query at position N attends to all N-1 prefill + itself)
 attn_out = flash_attn_varlen_func(..., causal=False)
 ```
 ### Helper Methods in HybridKVCacheManager
 ```python
 # Get all CPU blocks for a sequence
 cpu_blocks = manager.get_all_cpu_blocks(seq)  # List[int]
 # Get only prefilled (offloaded) CPU blocks
 prefilled_blocks = manager.get_prefilled_cpu_blocks(seq)  # List[int]
 # Get cached prefill length (doesn't change during decode)
 prefill_len = manager.get_prefill_len(seq)  # int
 # Get decode start position
 decode_pos = manager.get_decode_start_pos(seq)  # int
 ```
 ## Configuration
@@ -322,6 +69,8 @@ decode_pos = manager.get_decode_start_pos(seq)  # int
 | `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
 | `gpu_memory_utilization` | 0.9 | GPU memory fraction |
 | `enable_cpu_offload` | False | Enable for long context |
 | `num_gpu_blocks` | 2 | GPU blocks for offload mode |
 | `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
 ## Benchmarking
--- a/docs/architecture_guide.md
+++ b/docs/architecture_guide.md
@@ -0,0 +1,189 @@
 # Architecture Guide
 This document describes the core architecture and layer-wise CPU offload system of nano-vLLM.
 ## Core Components
 | Component | File | Purpose |
 |-----------|------|---------|
 | **LLMEngine** | `llm_engine.py` | Main entry, runs prefill-decode loop |
 | **ModelRunner** | `model_runner.py` | Loads weights, allocates KV cache, CUDA graphs, layer-wise offload |
 | **Scheduler** | `scheduler.py` | Two-phase scheduling (prefill → decode) |
 | **BlockManager** | `block_manager.py` | Paged attention with prefix caching (xxhash), default block size 4096 |
 | **Attention** | `layers/attention.py` | FlashAttention for standard inference |
 ## Layer-wise CPU Offload System
 ### Design Philosophy
 Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:
 ```
 Layer 0: [full sequence] → compute → offload K,V to CPU
 Layer 1: [full sequence] → compute → offload K,V to CPU
 ...
 Layer N: [full sequence] → compute → offload K,V to CPU
 ```
 **Benefits**:
 - Supports MInference sparse attention (requires full KV access per layer)
 - Simpler memory management (one layer's KV in GPU at a time)
 - Peak GPU memory = one layer's KV cache + attention workspace
 ### Key Files
 | File | Purpose |
 |------|---------|
 | `nanovllm/engine/model_runner.py` | Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`) |
 | `nanovllm/kvcache/hybrid_manager.py` | CPU block management helpers |
 | `nanovllm/kvcache/offload_engine.py` | CPU/GPU cache storage, ring buffer, async transfers |
 ### Memory Layout
 **CPU Cache** (pinned memory):
 ```python
 k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
 v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
 ```
 **GPU Ring Buffer** (for decode H2D pipeline):
 ```python
 layer_k_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
 layer_v_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
 ```
 **Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):
 | Context Length | KV per Layer |
 |----------------|--------------|
 | 128K tokens | 512 MB |
 | 256K tokens | 1 GB |
 | 512K tokens | 2 GB |
 | 1M tokens | 4 GB |
 ---
 ## Prefill Flow
 ```python
 def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
    # 1. Embedding
    hidden_states = self.model.model.embed_tokens(input_ids)
    # 2. Process each layer
    for layer_id in range(num_layers):
        # QKV projection + norms + RoPE
        q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
        k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
        v = v_proj(hidden_states)
        # Full FlashAttention (entire sequence)
        attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)
        # MLP
        hidden_states = mlp(attn_out + residual)
        # Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
        self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)
    # 3. Final norm + sampling
    return sampled_tokens
 ```
 ---
 ## Decode Flow
 ```python
 def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
    # Ring buffer pipeline: preload first N layers
    for i in range(num_buffers):
        offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)
    # For each layer:
    for layer_id in range(num_layers):
        current_buffer = layer_id % num_buffers
        # 1. Wait for buffer load to complete
        offload_engine.wait_buffer_load(current_buffer)
        # 2. Get prefilled KV from ring buffer
        k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
        # 3. Compute new Q,K,V for current token
        q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
        k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
        v_new = v_proj(hidden_states)
        # 4. Concatenate and compute attention
        k_full = torch.cat([k_prefill, k_new], dim=0)
        v_full = torch.cat([v_prefill, v_new], dim=0)
        attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
        # Note: causal=False because single query token should attend to ALL keys
        # 5. Mark buffer done, start loading next layer
        offload_engine.record_buffer_compute_done(current_buffer)
        if layer_id + num_buffers < num_layers:
            offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
 ```
 ---
 ## Critical Implementation Details
 ### 1. Synchronous Offload Required
 Async offload with `non_blocking=True` causes memory reuse bugs:
 ```python
 # BUG: PyTorch may reuse k,v GPU memory before async copy completes
 offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)
 # CORRECT: Synchronous copy ensures data integrity
 offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end])  # sync
 ```
 ### 2. Decode Attention: causal=False
 During decode, the single query token must attend to ALL keys (not just preceding ones):
 ```python
 # Prefill: causal=True (each token only attends to previous tokens)
 attn_out = flash_attn_varlen_func(..., causal=True)
 # Decode: causal=False (query at position N attends to all N-1 prefill + itself)
 attn_out = flash_attn_varlen_func(..., causal=False)
 ```
 ### 3. Ring Buffer Synchronization
 The ring buffer pipeline requires careful ordering:
 ```python
 # CORRECT order:
 offload_engine.store_decode_kv(layer_id, pos, k_new, v_new)  # Store new KV
 offload_engine.record_buffer_compute_done(current_buffer)     # Mark done FIRST
 offload_engine.load_layer_kv_to_buffer(...)                   # THEN start next load
 # BUG: Starting load before marking done causes race condition
 offload_engine.load_layer_kv_to_buffer(...)  # WRONG: buffer still in use!
 offload_engine.record_buffer_compute_done(current_buffer)
 ```
 ---
 ## Helper Methods in HybridKVCacheManager
 ```python
 # Get all CPU blocks for a sequence
 cpu_blocks = manager.get_all_cpu_blocks(seq)  # List[int]
 # Get only prefilled (offloaded) CPU blocks
 prefilled_blocks = manager.get_prefilled_cpu_blocks(seq)  # List[int]
 # Get cached prefill length (doesn't change during decode)
 prefill_len = manager.get_prefill_len(seq)  # int
 # Get decode start position
 decode_pos = manager.get_decode_start_pos(seq)  # int
 ```
--- a/docs/debugging_guide.md
+++ b/docs/debugging_guide.md
@@ -0,0 +1,142 @@
 # Debugging Guide
 This document provides debugging techniques for nano-vLLM, including PyTorch hooks for capturing intermediate tensors.
 ## PyTorch Hooks for Debugging
 ### Hook Positions in Qwen3
 ```
 decoder_layer
 ├── input_layernorm (RMSNorm)
 ├── self_attn (Qwen3Attention)          ← Hook here for attention I/O after o_proj
 │   ├── q_proj → q_norm → RoPE
 │   ├── k_proj → k_norm → RoPE
 │   ├── v_proj
 │   ├── attn (Attention)                ← Hook here for Q/K/V tensors
 │   │   └── FlashAttention / SDPA
 │   └── o_proj
 ├── post_attention_layernorm (RMSNorm)
 └── mlp (Qwen3MLP)
 ```
 ### Hook Types & Data Shapes
 | Hook Position | Type | Captured Data |
 |---------------|------|---------------|
 | `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj |
 | `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE |
 | `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj |
 ### Example: Capture Attention Outputs
 ```python
 storage = {}
 def make_hook(layer_id: int, storage: dict):
    def hook(module, inputs, output):
        if isinstance(output, tuple):
            attn_output = output[0]
        else:
            attn_output = output
        # nanovllm shape: [num_tokens, hidden_size] -> add batch dim
        if attn_output.dim() == 2:
            attn_output = attn_output.unsqueeze(0)
        storage[layer_id] = attn_output.detach().clone()
    return hook
 # Register hooks
 hooks = []
 for layer_idx, layer in enumerate(model.model.layers):
    hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
 # Run inference...
 # Cleanup
 for hook in hooks:
    hook.remove()
 ```
 ### Reference Implementation
 Key files for comparison testing:
 | File | Purpose |
 |------|---------|
 | `tests/modeling_qwen3.py` | Reference Qwen3 implementation (torch + transformers only) |
 | `tests/test_needle_ref.py` | Reference needle test using custom Qwen3 |
 | `tests/test_needle.py` | Needle-in-haystack test for nanovllm |
 ### Common Pitfalls
 1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]`
 2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj
 3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]`
 ---
 ## Memory Debugging
 ### Track Peak GPU Memory
 ```python
 import torch
 # Reset stats before operation
 torch.cuda.reset_peak_memory_stats()
 torch.cuda.empty_cache()
 # Run operation
 outputs = llm.generate([prompt], sampling_params)
 # Check peak
 peak_gb = torch.cuda.max_memory_allocated() / 1024**3
 print(f"Peak GPU memory: {peak_gb:.2f} GB")
 ```
 ### Monitor Memory During Execution
 ```python
 import torch
 def memory_snapshot():
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
 # Add snapshots at key points in your code
 ```
 ---
 ## Comparing Outputs
 ### Needle-in-Haystack Test
 ```bash
 # Test with CPU offload
 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --enable-offload --input-len 8192
 # Test without CPU offload (GPU-only)
 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --input-len 8192
 # Compare with reference implementation
 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle_ref.py --input-len 8192
 ```
 ### Tensor Comparison
 ```python
 def compare_tensors(a, b, name, rtol=1e-3, atol=1e-5):
    if a.shape != b.shape:
        print(f"{name}: Shape mismatch {a.shape} vs {b.shape}")
        return False
    diff = (a - b).abs()
    max_diff = diff.max().item()
    mean_diff = diff.mean().item()
    close = torch.allclose(a, b, rtol=rtol, atol=atol)
    print(f"{name}: max_diff={max_diff:.6f}, mean_diff={mean_diff:.6f}, close={close}")
    return close
 ```
--- a/docs/layerwise_offload_memory_analysis.md
+++ b/docs/layerwise_offload_memory_analysis.md
@@ -407,3 +407,141 @@ k_full = seq_len * kv_dim * dtype_size
 v_full = k_full  # = 256 MB
 # Total: 512 MB
 ```
 ---
 ## 8. Empirical Validation
 This section validates the theoretical memory analysis against actual measurements.
 ### 8.1 Test Configuration
 ```bash
 python tests/test_needle.py --enable-offload --input-len 100000 --block-size 1024
 ```
 **Parameters:**
 - Model: Qwen3-4B-Instruct
 - `seq_len = 100000` (actual tokens: 99925)
 - `block_size = 1024`
 - `max_model_len = 131072`
 - `num_kv_buffers = 4`
 ### 8.2 Theoretical Peak Memory Calculation
 #### Step 1: Model Load Memory
 | Component | Formula | Size |
 |-----------|---------|------|
 | Model weights | ~4B params × 2 bytes | ~8 GB |
 | Ring buffer | 2 × 4 × 131072 × 1024 × 2 | 2048 MB |
 | Decode buffer | 2 × 36 × 1024 × 1024 × 2 | 144 MB |
 | **Subtotal** | | **~10.2 GB** |
 #### Step 2: Prefill Activation Peak (per-layer)
 | Component | Formula | Size |
 |-----------|---------|------|
 | hidden_states | 100000 × 2560 × 2 | 512 MB |
 | residual | 100000 × 2560 × 2 | 512 MB |
 | MLP gate_up | 100000 × 27392 × 2 | **5478 MB** |
 | MLP silu×gate | 100000 × 13696 × 2 | 2739 MB |
 | Other intermediates (qkv, RoPE, attn) | ~1-2 GB | ~1500 MB |
 | **Subtotal** | | **~10 GB** |
 #### Step 3: Total Peak
 ```
 Total Peak = Model Load + Activation Peak
           = 10.2 GB + 10 GB
           = ~20.2 GB
 ```
 ### 8.3 Actual Measurement Results
 ```python
 import torch
 torch.cuda.reset_peak_memory_stats()
 # ... run inference ...
 peak = torch.cuda.max_memory_allocated()
 ```
 | Metric | Value |
 |--------|-------|
 | After model load | 9.82 GB |
 | Peak during inference | **20.02 GB** |
 | Activation peak (delta) | 10.20 GB |
 ### 8.4 Comparison: Theory vs Actual
 | Component | Theoretical | Actual | Error |
 |-----------|-------------|--------|-------|
 | Model load memory | ~10.2 GB | 9.82 GB | -3.7% |
 | Activation peak | ~10 GB | 10.20 GB | +2.0% |
 | **Total peak** | **~20.2 GB** | **20.02 GB** | **< 1%** |
 ### 8.5 Key Findings
 1. **Theoretical model is accurate**: < 5% error in all components.
 2. **MLP gate_up is the dominant temporary**:
   - Size: 5.35 GB (for 100k tokens)
   - Accounts for ~50% of activation peak
   - Formula: `seq_len × 2 × intermediate_size × dtype_size`
 3. **Memory scaling with sequence length**:
   | seq_len | Model Load | Activation Peak | Total Peak |
   |---------|------------|-----------------|------------|
   | 8k | ~10 GB | ~0.8 GB | ~11 GB |
   | 32k | ~10 GB | ~3.2 GB | ~13 GB |
   | 64k | ~10 GB | ~6.4 GB | ~16 GB |
   | 100k | ~10 GB | ~10 GB | ~20 GB |
   | 128k | ~10 GB | ~13 GB | ~23 GB |
 4. **Decode memory is much smaller**:
   - Per-step: ~512 MB for k_full + v_full (at 100k context)
   - Does not grow with decode steps (constant per layer)
 ### 8.6 Memory Profiling Script
 To reproduce the measurement:
 ```python
 import os
 os.environ["NANOVLLM_LOG_LEVEL"] = "INFO"
 import torch
 from nanovllm import LLM, SamplingParams
 from tests.utils import generate_needle_prompt
 # Reset memory stats
 torch.cuda.reset_peak_memory_stats()
 torch.cuda.empty_cache()
 # Initialize LLM
 llm = LLM(
    "path/to/model",
    enforce_eager=True,
    max_model_len=131072,
    max_num_batched_tokens=131072,
    enable_cpu_offload=True,
    kvcache_block_size=1024,
    num_gpu_blocks=2,
 )
 after_load = torch.cuda.memory_allocated()
 print(f"After model load: {after_load / 1024**3:.2f} GB")
 # Generate prompt and run inference
 prompt, expected = generate_needle_prompt(
    tokenizer=llm.tokenizer,
    target_length=100000,
    needle_position=0.5,
 )
 torch.cuda.reset_peak_memory_stats()
 outputs = llm.generate([prompt], SamplingParams(max_tokens=32))
 peak = torch.cuda.max_memory_allocated()
 print(f"Peak during inference: {peak / 1024**3:.2f} GB")
 ```
--- a/docs/sparse_attention_guide.md
+++ b/docs/sparse_attention_guide.md
@@ -440,3 +440,42 @@ Required libraries:
 - `minference`: For MInference vertical_slash kernel
 Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
 ---
 ## Quest Sparse Policy (nano-vLLM)
 **Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
 Quest policy is used in nano-vLLM for CPU offload mode. It selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
 ### Scoring Mechanism
 ```python
 score_min = torch.einsum('hd,bhd->bh', q, key_min)  # [num_blocks, kv_heads]
 score_max = torch.einsum('hd,bhd->bh', q, key_max)  # [num_blocks, kv_heads]
 scores = torch.maximum(score_min, score_max).mean(dim=-1)  # [num_blocks] ← averaged!
 ```
 ### Critical Limitation - No Per-Head Scheduling
 The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
 ```
 Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
 Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
 Block C: both heads moderately need (+2, +2) → avg = +2 → selected
 ```
 ### Why Per-Head Scheduling is Infeasible
 1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
 2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
 3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
 ### Policy Types
 | Policy | `supports_prefill` | `supports_decode` | Description |
 |--------|-------------------|-------------------|-------------|
 | `FullAttentionPolicy` | True | True | Loads all blocks (baseline) |
 | `QuestPolicy` | False | True | Decode-only Top-K selection |