[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00
parent a8c9f0d837
commit 105201b902
7 changed files with 649 additions and 279 deletions
--- a/docs/architecture_guide.md
+++ b/docs/architecture_guide.md
@@ -0,0 +1,189 @@
+# Architecture Guide
+
+This document describes the core architecture and layer-wise CPU offload system of nano-vLLM.
+
+## Core Components
+
+| Component | File | Purpose |
+|-----------|------|---------|
+| **LLMEngine** | `llm_engine.py` | Main entry, runs prefill-decode loop |
+| **ModelRunner** | `model_runner.py` | Loads weights, allocates KV cache, CUDA graphs, layer-wise offload |
+| **Scheduler** | `scheduler.py` | Two-phase scheduling (prefill → decode) |
+| **BlockManager** | `block_manager.py` | Paged attention with prefix caching (xxhash), default block size 4096 |
+| **Attention** | `layers/attention.py` | FlashAttention for standard inference |
+
+## Layer-wise CPU Offload System
+
+### Design Philosophy
+
+Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:
+
+```
+Layer 0: [full sequence] → compute → offload K,V to CPU
+Layer 1: [full sequence] → compute → offload K,V to CPU
+...
+Layer N: [full sequence] → compute → offload K,V to CPU
+```
+
+**Benefits**:
+- Supports MInference sparse attention (requires full KV access per layer)
+- Simpler memory management (one layer's KV in GPU at a time)
+- Peak GPU memory = one layer's KV cache + attention workspace
+
+### Key Files
+
+| File | Purpose |
+|------|---------|
+| `nanovllm/engine/model_runner.py` | Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`) |
+| `nanovllm/kvcache/hybrid_manager.py` | CPU block management helpers |
+| `nanovllm/kvcache/offload_engine.py` | CPU/GPU cache storage, ring buffer, async transfers |
+
+### Memory Layout
+
+**CPU Cache** (pinned memory):
+```python
+k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
+v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
+```
+
+**GPU Ring Buffer** (for decode H2D pipeline):
+```python
+layer_k_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
+layer_v_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
+```
+
+**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):
+
+| Context Length | KV per Layer |
+|----------------|--------------|
+| 128K tokens | 512 MB |
+| 256K tokens | 1 GB |
+| 512K tokens | 2 GB |
+| 1M tokens | 4 GB |
+
+---
+
+## Prefill Flow
+
+```python
+def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
+    # 1. Embedding
+    hidden_states = self.model.model.embed_tokens(input_ids)
+
+    # 2. Process each layer
+    for layer_id in range(num_layers):
+        # QKV projection + norms + RoPE
+        q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
+        k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
+        v = v_proj(hidden_states)
+
+        # Full FlashAttention (entire sequence)
+        attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)
+
+        # MLP
+        hidden_states = mlp(attn_out + residual)
+
+        # Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
+        self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)
+
+    # 3. Final norm + sampling
+    return sampled_tokens
+```
+
+---
+
+## Decode Flow
+
+```python
+def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
+    # Ring buffer pipeline: preload first N layers
+    for i in range(num_buffers):
+        offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)
+
+    # For each layer:
+    for layer_id in range(num_layers):
+        current_buffer = layer_id % num_buffers
+
+        # 1. Wait for buffer load to complete
+        offload_engine.wait_buffer_load(current_buffer)
+
+        # 2. Get prefilled KV from ring buffer
+        k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
+
+        # 3. Compute new Q,K,V for current token
+        q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
+        k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
+        v_new = v_proj(hidden_states)
+
+        # 4. Concatenate and compute attention
+        k_full = torch.cat([k_prefill, k_new], dim=0)
+        v_full = torch.cat([v_prefill, v_new], dim=0)
+        attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
+        # Note: causal=False because single query token should attend to ALL keys
+
+        # 5. Mark buffer done, start loading next layer
+        offload_engine.record_buffer_compute_done(current_buffer)
+        if layer_id + num_buffers < num_layers:
+            offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
+```
+
+---
+
+## Critical Implementation Details
+
+### 1. Synchronous Offload Required
+
+Async offload with `non_blocking=True` causes memory reuse bugs:
+
+```python
+# BUG: PyTorch may reuse k,v GPU memory before async copy completes
+offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)
+
+# CORRECT: Synchronous copy ensures data integrity
+offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end])  # sync
+```
+
+### 2. Decode Attention: causal=False
+
+During decode, the single query token must attend to ALL keys (not just preceding ones):
+
+```python
+# Prefill: causal=True (each token only attends to previous tokens)
+attn_out = flash_attn_varlen_func(..., causal=True)
+
+# Decode: causal=False (query at position N attends to all N-1 prefill + itself)
+attn_out = flash_attn_varlen_func(..., causal=False)
+```
+
+### 3. Ring Buffer Synchronization
+
+The ring buffer pipeline requires careful ordering:
+
+```python
+# CORRECT order:
+offload_engine.store_decode_kv(layer_id, pos, k_new, v_new)  # Store new KV
+offload_engine.record_buffer_compute_done(current_buffer)     # Mark done FIRST
+offload_engine.load_layer_kv_to_buffer(...)                   # THEN start next load
+
+# BUG: Starting load before marking done causes race condition
+offload_engine.load_layer_kv_to_buffer(...)  # WRONG: buffer still in use!
+offload_engine.record_buffer_compute_done(current_buffer)
+```
+
+---
+
+## Helper Methods in HybridKVCacheManager
+
+```python
+# Get all CPU blocks for a sequence
+cpu_blocks = manager.get_all_cpu_blocks(seq)  # List[int]
+
+# Get only prefilled (offloaded) CPU blocks
+prefilled_blocks = manager.get_prefilled_cpu_blocks(seq)  # List[int]
+
+# Get cached prefill length (doesn't change during decode)
+prefill_len = manager.get_prefill_len(seq)  # int
+
+# Get decode start position
+decode_pos = manager.get_decode_start_pos(seq)  # int
+```
--- a/docs/debugging_guide.md
+++ b/docs/debugging_guide.md
@@ -0,0 +1,142 @@
+# Debugging Guide
+
+This document provides debugging techniques for nano-vLLM, including PyTorch hooks for capturing intermediate tensors.
+
+## PyTorch Hooks for Debugging
+
+### Hook Positions in Qwen3
+
+```
+decoder_layer
+├── input_layernorm (RMSNorm)
+├── self_attn (Qwen3Attention)          ← Hook here for attention I/O after o_proj
+│   ├── q_proj → q_norm → RoPE
+│   ├── k_proj → k_norm → RoPE
+│   ├── v_proj
+│   ├── attn (Attention)                ← Hook here for Q/K/V tensors
+│   │   └── FlashAttention / SDPA
+│   └── o_proj
+├── post_attention_layernorm (RMSNorm)
+└── mlp (Qwen3MLP)
+```
+
+### Hook Types & Data Shapes
+
+| Hook Position | Type | Captured Data |
+|---------------|------|---------------|
+| `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj |
+| `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE |
+| `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj |
+
+### Example: Capture Attention Outputs
+
+```python
+storage = {}
+
+def make_hook(layer_id: int, storage: dict):
+    def hook(module, inputs, output):
+        if isinstance(output, tuple):
+            attn_output = output[0]
+        else:
+            attn_output = output
+        # nanovllm shape: [num_tokens, hidden_size] -> add batch dim
+        if attn_output.dim() == 2:
+            attn_output = attn_output.unsqueeze(0)
+        storage[layer_id] = attn_output.detach().clone()
+    return hook
+
+# Register hooks
+hooks = []
+for layer_idx, layer in enumerate(model.model.layers):
+    hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
+
+# Run inference...
+
+# Cleanup
+for hook in hooks:
+    hook.remove()
+```
+
+### Reference Implementation
+
+Key files for comparison testing:
+
+| File | Purpose |
+|------|---------|
+| `tests/modeling_qwen3.py` | Reference Qwen3 implementation (torch + transformers only) |
+| `tests/test_needle_ref.py` | Reference needle test using custom Qwen3 |
+| `tests/test_needle.py` | Needle-in-haystack test for nanovllm |
+
+### Common Pitfalls
+
+1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]`
+2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj
+3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]`
+
+---
+
+## Memory Debugging
+
+### Track Peak GPU Memory
+
+```python
+import torch
+
+# Reset stats before operation
+torch.cuda.reset_peak_memory_stats()
+torch.cuda.empty_cache()
+
+# Run operation
+outputs = llm.generate([prompt], sampling_params)
+
+# Check peak
+peak_gb = torch.cuda.max_memory_allocated() / 1024**3
+print(f"Peak GPU memory: {peak_gb:.2f} GB")
+```
+
+### Monitor Memory During Execution
+
+```python
+import torch
+
+def memory_snapshot():
+    allocated = torch.cuda.memory_allocated() / 1024**3
+    reserved = torch.cuda.memory_reserved() / 1024**3
+    print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
+
+# Add snapshots at key points in your code
+```
+
+---
+
+## Comparing Outputs
+
+### Needle-in-Haystack Test
+
+```bash
+# Test with CPU offload
+PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --enable-offload --input-len 8192
+
+# Test without CPU offload (GPU-only)
+PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --input-len 8192
+
+# Compare with reference implementation
+PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle_ref.py --input-len 8192
+```
+
+### Tensor Comparison
+
+```python
+def compare_tensors(a, b, name, rtol=1e-3, atol=1e-5):
+    if a.shape != b.shape:
+        print(f"{name}: Shape mismatch {a.shape} vs {b.shape}")
+        return False
+
+    diff = (a - b).abs()
+    max_diff = diff.max().item()
+    mean_diff = diff.mean().item()
+
+    close = torch.allclose(a, b, rtol=rtol, atol=atol)
+    print(f"{name}: max_diff={max_diff:.6f}, mean_diff={mean_diff:.6f}, close={close}")
+    return close
+```
--- a/docs/layerwise_offload_memory_analysis.md
+++ b/docs/layerwise_offload_memory_analysis.md
@@ -407,3 +407,141 @@ k_full = seq_len * kv_dim * dtype_size
 v_full = k_full  # = 256 MB
 # Total: 512 MB
 ```
+
+---
+
+## 8. Empirical Validation
+
+This section validates the theoretical memory analysis against actual measurements.
+
+### 8.1 Test Configuration
+
+```bash
+python tests/test_needle.py --enable-offload --input-len 100000 --block-size 1024
+```
+
+**Parameters:**
+- Model: Qwen3-4B-Instruct
+- `seq_len = 100000` (actual tokens: 99925)
+- `block_size = 1024`
+- `max_model_len = 131072`
+- `num_kv_buffers = 4`
+
+### 8.2 Theoretical Peak Memory Calculation
+
+#### Step 1: Model Load Memory
+
+| Component | Formula | Size |
+|-----------|---------|------|
+| Model weights | ~4B params × 2 bytes | ~8 GB |
+| Ring buffer | 2 × 4 × 131072 × 1024 × 2 | 2048 MB |
+| Decode buffer | 2 × 36 × 1024 × 1024 × 2 | 144 MB |
+| **Subtotal** | | **~10.2 GB** |
+
+#### Step 2: Prefill Activation Peak (per-layer)
+
+| Component | Formula | Size |
+|-----------|---------|------|
+| hidden_states | 100000 × 2560 × 2 | 512 MB |
+| residual | 100000 × 2560 × 2 | 512 MB |
+| MLP gate_up | 100000 × 27392 × 2 | **5478 MB** |
+| MLP silu×gate | 100000 × 13696 × 2 | 2739 MB |
+| Other intermediates (qkv, RoPE, attn) | ~1-2 GB | ~1500 MB |
+| **Subtotal** | | **~10 GB** |
+
+#### Step 3: Total Peak
+
+```
+Total Peak = Model Load + Activation Peak
+           = 10.2 GB + 10 GB
+           = ~20.2 GB
+```
+
+### 8.3 Actual Measurement Results
+
+```python
+import torch
+torch.cuda.reset_peak_memory_stats()
+# ... run inference ...
+peak = torch.cuda.max_memory_allocated()
+```
+
+| Metric | Value |
+|--------|-------|
+| After model load | 9.82 GB |
+| Peak during inference | **20.02 GB** |
+| Activation peak (delta) | 10.20 GB |
+
+### 8.4 Comparison: Theory vs Actual
+
+| Component | Theoretical | Actual | Error |
+|-----------|-------------|--------|-------|
+| Model load memory | ~10.2 GB | 9.82 GB | -3.7% |
+| Activation peak | ~10 GB | 10.20 GB | +2.0% |
+| **Total peak** | **~20.2 GB** | **20.02 GB** | **< 1%** |
+
+### 8.5 Key Findings
+
+1. **Theoretical model is accurate**: < 5% error in all components.
+
+2. **MLP gate_up is the dominant temporary**:
+   - Size: 5.35 GB (for 100k tokens)
+   - Accounts for ~50% of activation peak
+   - Formula: `seq_len × 2 × intermediate_size × dtype_size`
+
+3. **Memory scaling with sequence length**:
+   | seq_len | Model Load | Activation Peak | Total Peak |
+   |---------|------------|-----------------|------------|
+   | 8k | ~10 GB | ~0.8 GB | ~11 GB |
+   | 32k | ~10 GB | ~3.2 GB | ~13 GB |
+   | 64k | ~10 GB | ~6.4 GB | ~16 GB |
+   | 100k | ~10 GB | ~10 GB | ~20 GB |
+   | 128k | ~10 GB | ~13 GB | ~23 GB |
+
+4. **Decode memory is much smaller**:
+   - Per-step: ~512 MB for k_full + v_full (at 100k context)
+   - Does not grow with decode steps (constant per layer)
+
+### 8.6 Memory Profiling Script
+
+To reproduce the measurement:
+
+```python
+import os
+os.environ["NANOVLLM_LOG_LEVEL"] = "INFO"
+
+import torch
+from nanovllm import LLM, SamplingParams
+from tests.utils import generate_needle_prompt
+
+# Reset memory stats
+torch.cuda.reset_peak_memory_stats()
+torch.cuda.empty_cache()
+
+# Initialize LLM
+llm = LLM(
+    "path/to/model",
+    enforce_eager=True,
+    max_model_len=131072,
+    max_num_batched_tokens=131072,
+    enable_cpu_offload=True,
+    kvcache_block_size=1024,
+    num_gpu_blocks=2,
+)
+
+after_load = torch.cuda.memory_allocated()
+print(f"After model load: {after_load / 1024**3:.2f} GB")
+
+# Generate prompt and run inference
+prompt, expected = generate_needle_prompt(
+    tokenizer=llm.tokenizer,
+    target_length=100000,
+    needle_position=0.5,
+)
+
+torch.cuda.reset_peak_memory_stats()
+outputs = llm.generate([prompt], SamplingParams(max_tokens=32))
+
+peak = torch.cuda.max_memory_allocated()
+print(f"Peak during inference: {peak / 1024**3:.2f} GB")
+```
--- a/docs/sparse_attention_guide.md
+++ b/docs/sparse_attention_guide.md
@@ -440,3 +440,42 @@ Required libraries:
 - `minference`: For MInference vertical_slash kernel

 Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
+
+---
+
+## Quest Sparse Policy (nano-vLLM)
+
+**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
+
+Quest policy is used in nano-vLLM for CPU offload mode. It selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
+
+### Scoring Mechanism
+
+```python
+score_min = torch.einsum('hd,bhd->bh', q, key_min)  # [num_blocks, kv_heads]
+score_max = torch.einsum('hd,bhd->bh', q, key_max)  # [num_blocks, kv_heads]
+scores = torch.maximum(score_min, score_max).mean(dim=-1)  # [num_blocks] ← averaged!
+```
+
+### Critical Limitation - No Per-Head Scheduling
+
+The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
+
+```
+Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
+Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
+Block C: both heads moderately need (+2, +2) → avg = +2 → selected
+```
+
+### Why Per-Head Scheduling is Infeasible
+
+1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
+2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
+3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
+
+### Policy Types
+
+| Policy | `supports_prefill` | `supports_decode` | Description |
+|--------|-------------------|-------------------|-------------|
+| `FullAttentionPolicy` | True | True | Loads all blocks (baseline) |
+| `QuestPolicy` | False | True | Decode-only Top-K selection |