[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST
This commit is contained in:
105
.claude/rules/doc-management.md
Normal file
105
.claude/rules/doc-management.md
Normal file
@@ -0,0 +1,105 @@
|
|||||||
|
# Documentation Management
|
||||||
|
|
||||||
|
## CLAUDE.md Content Policy
|
||||||
|
|
||||||
|
**CLAUDE.md should only contain operational requirements:**
|
||||||
|
- Environment setup (PYTHONPATH, GPU mutex)
|
||||||
|
- Execution requirements (how to run tests/benchmarks)
|
||||||
|
- Quick configuration reference
|
||||||
|
- Documentation index (links to detailed docs)
|
||||||
|
|
||||||
|
**Technical details should go to docs/:**
|
||||||
|
- Architecture and design explanations
|
||||||
|
- Implementation details and code flows
|
||||||
|
- Debugging techniques
|
||||||
|
- Memory analysis and profiling
|
||||||
|
- Algorithm explanations
|
||||||
|
|
||||||
|
## When Adding New Technical Content
|
||||||
|
|
||||||
|
Follow this workflow:
|
||||||
|
|
||||||
|
### Step 1: Analyze and Document
|
||||||
|
|
||||||
|
If doing technical analysis (e.g., memory profiling):
|
||||||
|
1. Calculate theoretical values using formulas
|
||||||
|
2. Run actual tests to measure real values
|
||||||
|
3. Compare theoretical vs actual (expect < 10% error for valid models)
|
||||||
|
4. Document findings with both theory and empirical validation
|
||||||
|
|
||||||
|
### Step 2: Create/Update docs/
|
||||||
|
|
||||||
|
Create a new doc or update existing one in `docs/`:
|
||||||
|
```
|
||||||
|
docs/
|
||||||
|
├── architecture_guide.md # Core components, design, flows
|
||||||
|
├── sparse_attention_guide.md # Sparse attention methods
|
||||||
|
├── layerwise_offload_memory_analysis.md # Memory analysis
|
||||||
|
├── debugging_guide.md # Debugging techniques
|
||||||
|
└── <new_topic>_guide.md # New technical topic
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Update CLAUDE.md Documentation Index
|
||||||
|
|
||||||
|
Add entry to the Documentation Index table:
|
||||||
|
```markdown
|
||||||
|
| Document | Purpose |
|
||||||
|
|----------|---------|
|
||||||
|
| [`docs/new_doc.md`](docs/new_doc.md) | Brief description |
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Refactor if Needed
|
||||||
|
|
||||||
|
If CLAUDE.md grows too large (> 150 lines), refactor:
|
||||||
|
1. Identify technical details that can be moved
|
||||||
|
2. Create appropriate doc in docs/
|
||||||
|
3. Replace detailed content with reference link
|
||||||
|
4. Keep only operational essentials in CLAUDE.md
|
||||||
|
|
||||||
|
## Documentation Structure Template
|
||||||
|
|
||||||
|
For new technical docs:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# Topic Guide
|
||||||
|
|
||||||
|
Brief overview of what this document covers.
|
||||||
|
|
||||||
|
## Section 1: Concepts
|
||||||
|
- Key concepts and terminology
|
||||||
|
|
||||||
|
## Section 2: Implementation
|
||||||
|
- Code locations
|
||||||
|
- Key methods/functions
|
||||||
|
|
||||||
|
## Section 3: Details
|
||||||
|
- Detailed explanations
|
||||||
|
- Code examples
|
||||||
|
|
||||||
|
## Section 4: Validation (if applicable)
|
||||||
|
- Theoretical analysis
|
||||||
|
- Empirical measurements
|
||||||
|
- Comparison table
|
||||||
|
```
|
||||||
|
|
||||||
|
## Memory Analysis Template
|
||||||
|
|
||||||
|
When documenting memory behavior:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Theoretical Calculation
|
||||||
|
|
||||||
|
| Component | Formula | Size |
|
||||||
|
|-----------|---------|------|
|
||||||
|
| Buffer X | `param1 × param2 × dtype_size` | X MB |
|
||||||
|
|
||||||
|
## Empirical Validation
|
||||||
|
|
||||||
|
| Metric | Theoretical | Actual | Error |
|
||||||
|
|--------|-------------|--------|-------|
|
||||||
|
| Peak memory | X GB | Y GB | Z% |
|
||||||
|
|
||||||
|
## Key Findings
|
||||||
|
1. Finding 1
|
||||||
|
2. Finding 2
|
||||||
|
```
|
||||||
@@ -2,39 +2,47 @@
|
|||||||
|
|
||||||
## Do Not Create Unnecessary Documentation
|
## Do Not Create Unnecessary Documentation
|
||||||
|
|
||||||
**IMPORTANT**: Do NOT create extra markdown documentation files unless explicitly requested by the user.
|
**IMPORTANT**: Do NOT create extra markdown documentation files proactively unless:
|
||||||
|
1. User explicitly requests documentation
|
||||||
|
2. Refactoring CLAUDE.md to move technical details to docs/ (see `doc-management.md`)
|
||||||
|
|
||||||
### What NOT to do:
|
### What NOT to do:
|
||||||
|
|
||||||
- ❌ Do NOT create README files proactively
|
- Do NOT create README files proactively
|
||||||
- ❌ Do NOT create analysis documents (*.md) after completing tasks
|
- Do NOT create standalone analysis documents after completing tasks
|
||||||
- ❌ Do NOT create tutorial/guide documents
|
- Do NOT create summary documents without request
|
||||||
- ❌ Do NOT create summary documents
|
|
||||||
|
|
||||||
### What TO do:
|
### What TO do:
|
||||||
|
|
||||||
- ✅ Only create documentation when user explicitly asks for it
|
- Provide information directly in conversation by default
|
||||||
- ✅ Provide information directly in conversation instead
|
- When user requests documentation, follow `doc-management.md` workflow
|
||||||
- ✅ Update existing documentation if changes require it
|
- Update existing docs in `docs/` when code changes affect them
|
||||||
- ✅ Add inline code comments where necessary
|
- Keep CLAUDE.md concise (< 150 lines), move technical details to docs/
|
||||||
|
|
||||||
### Exceptions:
|
### Documentation Locations:
|
||||||
|
|
||||||
Documentation is acceptable ONLY when:
|
| Type | Location |
|
||||||
1. User explicitly requests "create a README" or "write documentation"
|
|------|----------|
|
||||||
2. Updating existing documentation to reflect code changes
|
| Operational requirements | CLAUDE.md |
|
||||||
3. Adding inline comments/docstrings to code itself
|
| Technical details | docs/*.md |
|
||||||
|
| Code comments | Inline in source |
|
||||||
|
|
||||||
### Examples:
|
### Examples:
|
||||||
|
|
||||||
**Bad** (Don't do this):
|
**Proactive docs (Don't do)**:
|
||||||
```
|
```
|
||||||
User: "Profile the code"
|
User: "Profile the code"
|
||||||
Assistant: [Creates profiling_results.md after profiling]
|
Assistant: [Creates profiling_results.md without being asked]
|
||||||
```
|
```
|
||||||
|
|
||||||
**Good** (Do this instead):
|
**On-request docs (Do this)**:
|
||||||
```
|
```
|
||||||
User: "Profile the code"
|
User: "Profile the code and document the findings"
|
||||||
Assistant: [Runs profiling, shows results in conversation]
|
Assistant: [Runs profiling, creates/updates docs/memory_analysis.md]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Refactoring (Do this)**:
|
||||||
|
```
|
||||||
|
User: "CLAUDE.md is too long, refactor it"
|
||||||
|
Assistant: [Moves technical sections to docs/, updates CLAUDE.md index]
|
||||||
```
|
```
|
||||||
|
|||||||
269
CLAUDE.md
269
CLAUDE.md
@@ -27,17 +27,6 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
|||||||
|
|
||||||
3. **Only proceed** when `nvidia-smi --query-compute-apps=pid --format=csv,noheader` returns empty output
|
3. **Only proceed** when `nvidia-smi --query-compute-apps=pid --format=csv,noheader` returns empty output
|
||||||
|
|
||||||
**Example workflow**:
|
|
||||||
```bash
|
|
||||||
# First check if GPU is in use
|
|
||||||
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
|
|
||||||
|
|
||||||
# If output is empty, proceed with your command
|
|
||||||
python bench_offload.py
|
|
||||||
|
|
||||||
# If output shows processes, wait until they finish
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note**: This applies to ALL GPU operations including:
|
**Note**: This applies to ALL GPU operations including:
|
||||||
- Running tests (`python tests/test_*.py`)
|
- Running tests (`python tests/test_*.py`)
|
||||||
- Running benchmarks (`python bench*.py`)
|
- Running benchmarks (`python bench*.py`)
|
||||||
@@ -63,256 +52,14 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
|||||||
- Code changes take effect immediately (no reinstall needed)
|
- Code changes take effect immediately (no reinstall needed)
|
||||||
- Each worktree is completely isolated
|
- Each worktree is completely isolated
|
||||||
|
|
||||||
**For shell session** (optional):
|
## Documentation Index
|
||||||
```bash
|
|
||||||
export PYTHONPATH=/path/to/your/worktree:$PYTHONPATH
|
|
||||||
python tests/test_needle.py # PYTHONPATH already set
|
|
||||||
```
|
|
||||||
|
|
||||||
## Sparse Attention
|
| Document | Purpose |
|
||||||
|
|----------|---------|
|
||||||
For sparse attention related content (block sparse attention, MInference, FlexPrefill, XAttention, AvgPool, etc.), refer to [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md).
|
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
|
||||||
|
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
|
||||||
### Quest Sparse Policy
|
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
||||||
|
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
||||||
**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
|
|
||||||
|
|
||||||
Quest policy selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
|
|
||||||
|
|
||||||
**Scoring Mechanism**:
|
|
||||||
```python
|
|
||||||
score_min = torch.einsum('hd,bhd->bh', q, key_min) # [num_blocks, kv_heads]
|
|
||||||
score_max = torch.einsum('hd,bhd->bh', q, key_max) # [num_blocks, kv_heads]
|
|
||||||
scores = torch.maximum(score_min, score_max).mean(dim=-1) # [num_blocks] ← averaged!
|
|
||||||
```
|
|
||||||
|
|
||||||
**Critical Limitation - No Per-Head Scheduling**:
|
|
||||||
|
|
||||||
The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
|
|
||||||
|
|
||||||
```
|
|
||||||
Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
|
|
||||||
Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
|
|
||||||
Block C: both heads moderately need (+2, +2) → avg = +2 → selected
|
|
||||||
```
|
|
||||||
|
|
||||||
**Why Per-Head Scheduling is Infeasible**:
|
|
||||||
1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
|
|
||||||
2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
|
|
||||||
3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
|
|
||||||
|
|
||||||
**Policy Types**:
|
|
||||||
- `FullAttentionPolicy`: `supports_prefill=True, supports_decode=True` - loads all blocks
|
|
||||||
- `QuestPolicy`: `supports_prefill=False, supports_decode=True` - decode-only Top-K selection
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
### Core Components
|
|
||||||
|
|
||||||
- **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop
|
|
||||||
- **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs, layer-wise offload
|
|
||||||
- **Scheduler** (`scheduler.py`): Two-phase scheduling (prefill → decode)
|
|
||||||
- **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096
|
|
||||||
- **Attention** (`layers/attention.py`): FlashAttention for standard inference
|
|
||||||
|
|
||||||
## PyTorch Hooks for Debugging
|
|
||||||
|
|
||||||
### Hook Positions in Qwen3
|
|
||||||
|
|
||||||
```
|
|
||||||
decoder_layer
|
|
||||||
├── input_layernorm (RMSNorm)
|
|
||||||
├── self_attn (Qwen3Attention) ← Hook here for attention I/O after o_proj
|
|
||||||
│ ├── q_proj → q_norm → RoPE
|
|
||||||
│ ├── k_proj → k_norm → RoPE
|
|
||||||
│ ├── v_proj
|
|
||||||
│ ├── attn (Attention) ← Hook here for Q/K/V tensors
|
|
||||||
│ │ └── FlashAttention / SDPA
|
|
||||||
│ └── o_proj
|
|
||||||
├── post_attention_layernorm (RMSNorm)
|
|
||||||
└── mlp (Qwen3MLP)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Hook Types & Data Shapes
|
|
||||||
|
|
||||||
| Hook Position | Type | Captured Data |
|
|
||||||
|---------------|------|---------------|
|
|
||||||
| `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj |
|
|
||||||
| `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE |
|
|
||||||
| `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj |
|
|
||||||
|
|
||||||
### Example: Capture Attention Outputs
|
|
||||||
|
|
||||||
```python
|
|
||||||
storage = {}
|
|
||||||
|
|
||||||
def make_hook(layer_id: int, storage: dict):
|
|
||||||
def hook(module, inputs, output):
|
|
||||||
if isinstance(output, tuple):
|
|
||||||
attn_output = output[0]
|
|
||||||
else:
|
|
||||||
attn_output = output
|
|
||||||
# nanovllm shape: [num_tokens, hidden_size] -> add batch dim
|
|
||||||
if attn_output.dim() == 2:
|
|
||||||
attn_output = attn_output.unsqueeze(0)
|
|
||||||
storage[layer_id] = attn_output.detach().clone()
|
|
||||||
return hook
|
|
||||||
|
|
||||||
# Register hooks
|
|
||||||
hooks = []
|
|
||||||
for layer_idx, layer in enumerate(model.model.layers):
|
|
||||||
hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
|
|
||||||
|
|
||||||
# Run inference...
|
|
||||||
|
|
||||||
# Cleanup
|
|
||||||
for hook in hooks:
|
|
||||||
hook.remove()
|
|
||||||
```
|
|
||||||
|
|
||||||
### Reference Implementation
|
|
||||||
|
|
||||||
Key files:
|
|
||||||
- `tests/modeling_qwen3.py`: Reference Qwen3 implementation (torch + transformers only)
|
|
||||||
- `tests/test_needle_ref.py`: Reference needle test using custom Qwen3
|
|
||||||
- `tests/test_needle.py`: Needle-in-haystack test for nanovllm
|
|
||||||
|
|
||||||
### Common Pitfalls
|
|
||||||
|
|
||||||
1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]`
|
|
||||||
2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj
|
|
||||||
3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]`
|
|
||||||
|
|
||||||
## Layer-wise CPU Offload System
|
|
||||||
|
|
||||||
### Design Philosophy
|
|
||||||
|
|
||||||
Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:
|
|
||||||
|
|
||||||
```
|
|
||||||
Layer 0: [full sequence] → compute → offload K,V to CPU
|
|
||||||
Layer 1: [full sequence] → compute → offload K,V to CPU
|
|
||||||
...
|
|
||||||
Layer N: [full sequence] → compute → offload K,V to CPU
|
|
||||||
```
|
|
||||||
|
|
||||||
**Benefits**:
|
|
||||||
- Supports MInference sparse attention (requires full KV access per layer)
|
|
||||||
- Simpler memory management (one layer's KV in GPU at a time)
|
|
||||||
- Peak GPU memory = one layer's KV cache + attention workspace
|
|
||||||
|
|
||||||
### Key Files
|
|
||||||
|
|
||||||
- `nanovllm/engine/model_runner.py`: Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`)
|
|
||||||
- `nanovllm/kvcache/hybrid_manager.py`: CPU block management helpers
|
|
||||||
- `nanovllm/kvcache/offload_engine.py`: CPU/GPU cache storage
|
|
||||||
|
|
||||||
### Memory Layout
|
|
||||||
|
|
||||||
**CPU Cache** (pinned memory):
|
|
||||||
```python
|
|
||||||
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
|
||||||
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
|
||||||
```
|
|
||||||
|
|
||||||
**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):
|
|
||||||
|
|
||||||
| Context Length | KV per Layer |
|
|
||||||
|----------------|--------------|
|
|
||||||
| 128K tokens | 512 MB |
|
|
||||||
| 256K tokens | 1 GB |
|
|
||||||
| 512K tokens | 2 GB |
|
|
||||||
| 1M tokens | 4 GB |
|
|
||||||
|
|
||||||
### Prefill Flow
|
|
||||||
|
|
||||||
```python
|
|
||||||
def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
|
|
||||||
# 1. Embedding
|
|
||||||
hidden_states = self.model.model.embed_tokens(input_ids)
|
|
||||||
|
|
||||||
# 2. Process each layer
|
|
||||||
for layer_id in range(num_layers):
|
|
||||||
# QKV projection + norms + RoPE
|
|
||||||
q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
|
|
||||||
k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
|
|
||||||
v = v_proj(hidden_states)
|
|
||||||
|
|
||||||
# Full FlashAttention (entire sequence)
|
|
||||||
attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)
|
|
||||||
|
|
||||||
# MLP
|
|
||||||
hidden_states = mlp(attn_out + residual)
|
|
||||||
|
|
||||||
# Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
|
|
||||||
self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)
|
|
||||||
|
|
||||||
# 3. Final norm + sampling
|
|
||||||
return sampled_tokens
|
|
||||||
```
|
|
||||||
|
|
||||||
### Decode Flow
|
|
||||||
|
|
||||||
```python
|
|
||||||
def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
|
|
||||||
# For each layer:
|
|
||||||
for layer_id in range(num_layers):
|
|
||||||
# 1. Load all prefilled KV from CPU
|
|
||||||
for block_idx, cpu_block_id in enumerate(cpu_block_table):
|
|
||||||
k_block = offload_engine.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda")
|
|
||||||
v_block = offload_engine.v_cache_cpu[layer_id, cpu_block_id, :valid_tokens].to("cuda")
|
|
||||||
|
|
||||||
# 2. Compute new Q,K,V for current token
|
|
||||||
q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
|
|
||||||
k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
|
|
||||||
v_new = v_proj(hidden_states)
|
|
||||||
|
|
||||||
# 3. Concatenate and compute attention
|
|
||||||
k_full = torch.cat([k_prefill, k_new], dim=0)
|
|
||||||
v_full = torch.cat([v_prefill, v_new], dim=0)
|
|
||||||
attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
|
|
||||||
# Note: causal=False because single query token should attend to ALL keys
|
|
||||||
```
|
|
||||||
|
|
||||||
### Critical Implementation Details
|
|
||||||
|
|
||||||
**1. Synchronous Offload Required**
|
|
||||||
|
|
||||||
Async offload with `non_blocking=True` causes memory reuse bugs:
|
|
||||||
```python
|
|
||||||
# BUG: PyTorch may reuse k,v GPU memory before async copy completes
|
|
||||||
offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)
|
|
||||||
|
|
||||||
# CORRECT: Synchronous copy ensures data integrity
|
|
||||||
offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end]) # sync
|
|
||||||
```
|
|
||||||
|
|
||||||
**2. Decode Attention: causal=False**
|
|
||||||
|
|
||||||
During decode, the single query token must attend to ALL keys (not just preceding ones):
|
|
||||||
```python
|
|
||||||
# Prefill: causal=True (each token only attends to previous tokens)
|
|
||||||
attn_out = flash_attn_varlen_func(..., causal=True)
|
|
||||||
|
|
||||||
# Decode: causal=False (query at position N attends to all N-1 prefill + itself)
|
|
||||||
attn_out = flash_attn_varlen_func(..., causal=False)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Helper Methods in HybridKVCacheManager
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Get all CPU blocks for a sequence
|
|
||||||
cpu_blocks = manager.get_all_cpu_blocks(seq) # List[int]
|
|
||||||
|
|
||||||
# Get only prefilled (offloaded) CPU blocks
|
|
||||||
prefilled_blocks = manager.get_prefilled_cpu_blocks(seq) # List[int]
|
|
||||||
|
|
||||||
# Get cached prefill length (doesn't change during decode)
|
|
||||||
prefill_len = manager.get_prefill_len(seq) # int
|
|
||||||
|
|
||||||
# Get decode start position
|
|
||||||
decode_pos = manager.get_decode_start_pos(seq) # int
|
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
@@ -322,6 +69,8 @@ decode_pos = manager.get_decode_start_pos(seq) # int
|
|||||||
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
|
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
|
||||||
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
|
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
|
||||||
| `enable_cpu_offload` | False | Enable for long context |
|
| `enable_cpu_offload` | False | Enable for long context |
|
||||||
|
| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
|
||||||
|
| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
|
||||||
|
|
||||||
## Benchmarking
|
## Benchmarking
|
||||||
|
|
||||||
|
|||||||
189
docs/architecture_guide.md
Normal file
189
docs/architecture_guide.md
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
# Architecture Guide
|
||||||
|
|
||||||
|
This document describes the core architecture and layer-wise CPU offload system of nano-vLLM.
|
||||||
|
|
||||||
|
## Core Components
|
||||||
|
|
||||||
|
| Component | File | Purpose |
|
||||||
|
|-----------|------|---------|
|
||||||
|
| **LLMEngine** | `llm_engine.py` | Main entry, runs prefill-decode loop |
|
||||||
|
| **ModelRunner** | `model_runner.py` | Loads weights, allocates KV cache, CUDA graphs, layer-wise offload |
|
||||||
|
| **Scheduler** | `scheduler.py` | Two-phase scheduling (prefill → decode) |
|
||||||
|
| **BlockManager** | `block_manager.py` | Paged attention with prefix caching (xxhash), default block size 4096 |
|
||||||
|
| **Attention** | `layers/attention.py` | FlashAttention for standard inference |
|
||||||
|
|
||||||
|
## Layer-wise CPU Offload System
|
||||||
|
|
||||||
|
### Design Philosophy
|
||||||
|
|
||||||
|
Unlike chunked prefill (which processes chunks across all layers), **layer-wise offload** processes the entire sequence through one layer at a time:
|
||||||
|
|
||||||
|
```
|
||||||
|
Layer 0: [full sequence] → compute → offload K,V to CPU
|
||||||
|
Layer 1: [full sequence] → compute → offload K,V to CPU
|
||||||
|
...
|
||||||
|
Layer N: [full sequence] → compute → offload K,V to CPU
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- Supports MInference sparse attention (requires full KV access per layer)
|
||||||
|
- Simpler memory management (one layer's KV in GPU at a time)
|
||||||
|
- Peak GPU memory = one layer's KV cache + attention workspace
|
||||||
|
|
||||||
|
### Key Files
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `nanovllm/engine/model_runner.py` | Main implementation (`run_layerwise_offload_prefill`, `run_layerwise_offload_decode`) |
|
||||||
|
| `nanovllm/kvcache/hybrid_manager.py` | CPU block management helpers |
|
||||||
|
| `nanovllm/kvcache/offload_engine.py` | CPU/GPU cache storage, ring buffer, async transfers |
|
||||||
|
|
||||||
|
### Memory Layout
|
||||||
|
|
||||||
|
**CPU Cache** (pinned memory):
|
||||||
|
```python
|
||||||
|
k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||||
|
v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||||
|
```
|
||||||
|
|
||||||
|
**GPU Ring Buffer** (for decode H2D pipeline):
|
||||||
|
```python
|
||||||
|
layer_k_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
|
||||||
|
layer_v_cache: [num_kv_buffers, max_seq_len, kv_heads, head_dim]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Per-layer KV size** (Qwen3-4B: 8 kv_heads × 128 head_dim × 2 bytes × 2 for K+V = 4KB/token):
|
||||||
|
|
||||||
|
| Context Length | KV per Layer |
|
||||||
|
|----------------|--------------|
|
||||||
|
| 128K tokens | 512 MB |
|
||||||
|
| 256K tokens | 1 GB |
|
||||||
|
| 512K tokens | 2 GB |
|
||||||
|
| 1M tokens | 4 GB |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prefill Flow
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_layerwise_offload_prefill(self, seqs: list[Sequence]) -> list[int]:
|
||||||
|
# 1. Embedding
|
||||||
|
hidden_states = self.model.model.embed_tokens(input_ids)
|
||||||
|
|
||||||
|
# 2. Process each layer
|
||||||
|
for layer_id in range(num_layers):
|
||||||
|
# QKV projection + norms + RoPE
|
||||||
|
q = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
|
||||||
|
k = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
|
||||||
|
v = v_proj(hidden_states)
|
||||||
|
|
||||||
|
# Full FlashAttention (entire sequence)
|
||||||
|
attn_out = flash_attn_varlen_func(q, k, v, cu_seqlens, max_seqlen, causal=True)
|
||||||
|
|
||||||
|
# MLP
|
||||||
|
hidden_states = mlp(attn_out + residual)
|
||||||
|
|
||||||
|
# Synchronous offload to CPU (CRITICAL: must be sync to avoid memory reuse bugs)
|
||||||
|
self._offload_layer_kv_to_cpu_sync(layer_id, k, v, cpu_block_ids, total_tokens)
|
||||||
|
|
||||||
|
# 3. Final norm + sampling
|
||||||
|
return sampled_tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decode Flow
|
||||||
|
|
||||||
|
```python
|
||||||
|
def run_layerwise_offload_decode(self, seqs: list[Sequence]) -> list[int]:
|
||||||
|
# Ring buffer pipeline: preload first N layers
|
||||||
|
for i in range(num_buffers):
|
||||||
|
offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)
|
||||||
|
|
||||||
|
# For each layer:
|
||||||
|
for layer_id in range(num_layers):
|
||||||
|
current_buffer = layer_id % num_buffers
|
||||||
|
|
||||||
|
# 1. Wait for buffer load to complete
|
||||||
|
offload_engine.wait_buffer_load(current_buffer)
|
||||||
|
|
||||||
|
# 2. Get prefilled KV from ring buffer
|
||||||
|
k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
|
||||||
|
|
||||||
|
# 3. Compute new Q,K,V for current token
|
||||||
|
q_new = apply_rotary_pos_emb(q_proj(hidden_states), cos, sin)
|
||||||
|
k_new = apply_rotary_pos_emb(k_proj(hidden_states), cos, sin)
|
||||||
|
v_new = v_proj(hidden_states)
|
||||||
|
|
||||||
|
# 4. Concatenate and compute attention
|
||||||
|
k_full = torch.cat([k_prefill, k_new], dim=0)
|
||||||
|
v_full = torch.cat([v_prefill, v_new], dim=0)
|
||||||
|
attn_out = flash_attn_varlen_func(q_new, k_full, v_full, ..., causal=False)
|
||||||
|
# Note: causal=False because single query token should attend to ALL keys
|
||||||
|
|
||||||
|
# 5. Mark buffer done, start loading next layer
|
||||||
|
offload_engine.record_buffer_compute_done(current_buffer)
|
||||||
|
if layer_id + num_buffers < num_layers:
|
||||||
|
offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Implementation Details
|
||||||
|
|
||||||
|
### 1. Synchronous Offload Required
|
||||||
|
|
||||||
|
Async offload with `non_blocking=True` causes memory reuse bugs:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# BUG: PyTorch may reuse k,v GPU memory before async copy completes
|
||||||
|
offload_engine.k_cache_cpu[layer_id, block_id].copy_(k[start:end], non_blocking=True)
|
||||||
|
|
||||||
|
# CORRECT: Synchronous copy ensures data integrity
|
||||||
|
offload_engine.k_cache_cpu[layer_id, block_id, :size].copy_(k[start:end]) # sync
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Decode Attention: causal=False
|
||||||
|
|
||||||
|
During decode, the single query token must attend to ALL keys (not just preceding ones):
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Prefill: causal=True (each token only attends to previous tokens)
|
||||||
|
attn_out = flash_attn_varlen_func(..., causal=True)
|
||||||
|
|
||||||
|
# Decode: causal=False (query at position N attends to all N-1 prefill + itself)
|
||||||
|
attn_out = flash_attn_varlen_func(..., causal=False)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Ring Buffer Synchronization
|
||||||
|
|
||||||
|
The ring buffer pipeline requires careful ordering:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# CORRECT order:
|
||||||
|
offload_engine.store_decode_kv(layer_id, pos, k_new, v_new) # Store new KV
|
||||||
|
offload_engine.record_buffer_compute_done(current_buffer) # Mark done FIRST
|
||||||
|
offload_engine.load_layer_kv_to_buffer(...) # THEN start next load
|
||||||
|
|
||||||
|
# BUG: Starting load before marking done causes race condition
|
||||||
|
offload_engine.load_layer_kv_to_buffer(...) # WRONG: buffer still in use!
|
||||||
|
offload_engine.record_buffer_compute_done(current_buffer)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Helper Methods in HybridKVCacheManager
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Get all CPU blocks for a sequence
|
||||||
|
cpu_blocks = manager.get_all_cpu_blocks(seq) # List[int]
|
||||||
|
|
||||||
|
# Get only prefilled (offloaded) CPU blocks
|
||||||
|
prefilled_blocks = manager.get_prefilled_cpu_blocks(seq) # List[int]
|
||||||
|
|
||||||
|
# Get cached prefill length (doesn't change during decode)
|
||||||
|
prefill_len = manager.get_prefill_len(seq) # int
|
||||||
|
|
||||||
|
# Get decode start position
|
||||||
|
decode_pos = manager.get_decode_start_pos(seq) # int
|
||||||
|
```
|
||||||
142
docs/debugging_guide.md
Normal file
142
docs/debugging_guide.md
Normal file
@@ -0,0 +1,142 @@
|
|||||||
|
# Debugging Guide
|
||||||
|
|
||||||
|
This document provides debugging techniques for nano-vLLM, including PyTorch hooks for capturing intermediate tensors.
|
||||||
|
|
||||||
|
## PyTorch Hooks for Debugging
|
||||||
|
|
||||||
|
### Hook Positions in Qwen3
|
||||||
|
|
||||||
|
```
|
||||||
|
decoder_layer
|
||||||
|
├── input_layernorm (RMSNorm)
|
||||||
|
├── self_attn (Qwen3Attention) ← Hook here for attention I/O after o_proj
|
||||||
|
│ ├── q_proj → q_norm → RoPE
|
||||||
|
│ ├── k_proj → k_norm → RoPE
|
||||||
|
│ ├── v_proj
|
||||||
|
│ ├── attn (Attention) ← Hook here for Q/K/V tensors
|
||||||
|
│ │ └── FlashAttention / SDPA
|
||||||
|
│ └── o_proj
|
||||||
|
├── post_attention_layernorm (RMSNorm)
|
||||||
|
└── mlp (Qwen3MLP)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Hook Types & Data Shapes
|
||||||
|
|
||||||
|
| Hook Position | Type | Captured Data |
|
||||||
|
|---------------|------|---------------|
|
||||||
|
| `self_attn` | post | `[batch, seq_len, hidden_size]` - after o_proj |
|
||||||
|
| `self_attn.attn` | pre | Q,K,V: `[seq_len, num_heads, head_dim]` - after RoPE |
|
||||||
|
| `self_attn.attn` | post | `[seq_len, num_heads, head_dim]` - before o_proj |
|
||||||
|
|
||||||
|
### Example: Capture Attention Outputs
|
||||||
|
|
||||||
|
```python
|
||||||
|
storage = {}
|
||||||
|
|
||||||
|
def make_hook(layer_id: int, storage: dict):
|
||||||
|
def hook(module, inputs, output):
|
||||||
|
if isinstance(output, tuple):
|
||||||
|
attn_output = output[0]
|
||||||
|
else:
|
||||||
|
attn_output = output
|
||||||
|
# nanovllm shape: [num_tokens, hidden_size] -> add batch dim
|
||||||
|
if attn_output.dim() == 2:
|
||||||
|
attn_output = attn_output.unsqueeze(0)
|
||||||
|
storage[layer_id] = attn_output.detach().clone()
|
||||||
|
return hook
|
||||||
|
|
||||||
|
# Register hooks
|
||||||
|
hooks = []
|
||||||
|
for layer_idx, layer in enumerate(model.model.layers):
|
||||||
|
hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
|
||||||
|
|
||||||
|
# Run inference...
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
for hook in hooks:
|
||||||
|
hook.remove()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Reference Implementation
|
||||||
|
|
||||||
|
Key files for comparison testing:
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `tests/modeling_qwen3.py` | Reference Qwen3 implementation (torch + transformers only) |
|
||||||
|
| `tests/test_needle_ref.py` | Reference needle test using custom Qwen3 |
|
||||||
|
| `tests/test_needle.py` | Needle-in-haystack test for nanovllm |
|
||||||
|
|
||||||
|
### Common Pitfalls
|
||||||
|
|
||||||
|
1. **Shape mismatch**: nanovllm uses `[num_tokens, ...]` while torch uses `[batch, seq_len, ...]`
|
||||||
|
2. **Hook position**: `self_attn` captures after o_proj, `self_attn.attn` captures before o_proj
|
||||||
|
3. **Output format**: nanovllm returns tuple `(attn_output, None)`, handle with `output[0]`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Memory Debugging
|
||||||
|
|
||||||
|
### Track Peak GPU Memory
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
|
||||||
|
# Reset stats before operation
|
||||||
|
torch.cuda.reset_peak_memory_stats()
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
|
||||||
|
# Run operation
|
||||||
|
outputs = llm.generate([prompt], sampling_params)
|
||||||
|
|
||||||
|
# Check peak
|
||||||
|
peak_gb = torch.cuda.max_memory_allocated() / 1024**3
|
||||||
|
print(f"Peak GPU memory: {peak_gb:.2f} GB")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Monitor Memory During Execution
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
|
||||||
|
def memory_snapshot():
|
||||||
|
allocated = torch.cuda.memory_allocated() / 1024**3
|
||||||
|
reserved = torch.cuda.memory_reserved() / 1024**3
|
||||||
|
print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
|
||||||
|
|
||||||
|
# Add snapshots at key points in your code
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparing Outputs
|
||||||
|
|
||||||
|
### Needle-in-Haystack Test
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test with CPU offload
|
||||||
|
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --enable-offload --input-len 8192
|
||||||
|
|
||||||
|
# Test without CPU offload (GPU-only)
|
||||||
|
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --input-len 8192
|
||||||
|
|
||||||
|
# Compare with reference implementation
|
||||||
|
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle_ref.py --input-len 8192
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tensor Comparison
|
||||||
|
|
||||||
|
```python
|
||||||
|
def compare_tensors(a, b, name, rtol=1e-3, atol=1e-5):
|
||||||
|
if a.shape != b.shape:
|
||||||
|
print(f"{name}: Shape mismatch {a.shape} vs {b.shape}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
diff = (a - b).abs()
|
||||||
|
max_diff = diff.max().item()
|
||||||
|
mean_diff = diff.mean().item()
|
||||||
|
|
||||||
|
close = torch.allclose(a, b, rtol=rtol, atol=atol)
|
||||||
|
print(f"{name}: max_diff={max_diff:.6f}, mean_diff={mean_diff:.6f}, close={close}")
|
||||||
|
return close
|
||||||
|
```
|
||||||
@@ -407,3 +407,141 @@ k_full = seq_len * kv_dim * dtype_size
|
|||||||
v_full = k_full # = 256 MB
|
v_full = k_full # = 256 MB
|
||||||
# Total: 512 MB
|
# Total: 512 MB
|
||||||
```
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Empirical Validation
|
||||||
|
|
||||||
|
This section validates the theoretical memory analysis against actual measurements.
|
||||||
|
|
||||||
|
### 8.1 Test Configuration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python tests/test_needle.py --enable-offload --input-len 100000 --block-size 1024
|
||||||
|
```
|
||||||
|
|
||||||
|
**Parameters:**
|
||||||
|
- Model: Qwen3-4B-Instruct
|
||||||
|
- `seq_len = 100000` (actual tokens: 99925)
|
||||||
|
- `block_size = 1024`
|
||||||
|
- `max_model_len = 131072`
|
||||||
|
- `num_kv_buffers = 4`
|
||||||
|
|
||||||
|
### 8.2 Theoretical Peak Memory Calculation
|
||||||
|
|
||||||
|
#### Step 1: Model Load Memory
|
||||||
|
|
||||||
|
| Component | Formula | Size |
|
||||||
|
|-----------|---------|------|
|
||||||
|
| Model weights | ~4B params × 2 bytes | ~8 GB |
|
||||||
|
| Ring buffer | 2 × 4 × 131072 × 1024 × 2 | 2048 MB |
|
||||||
|
| Decode buffer | 2 × 36 × 1024 × 1024 × 2 | 144 MB |
|
||||||
|
| **Subtotal** | | **~10.2 GB** |
|
||||||
|
|
||||||
|
#### Step 2: Prefill Activation Peak (per-layer)
|
||||||
|
|
||||||
|
| Component | Formula | Size |
|
||||||
|
|-----------|---------|------|
|
||||||
|
| hidden_states | 100000 × 2560 × 2 | 512 MB |
|
||||||
|
| residual | 100000 × 2560 × 2 | 512 MB |
|
||||||
|
| MLP gate_up | 100000 × 27392 × 2 | **5478 MB** |
|
||||||
|
| MLP silu×gate | 100000 × 13696 × 2 | 2739 MB |
|
||||||
|
| Other intermediates (qkv, RoPE, attn) | ~1-2 GB | ~1500 MB |
|
||||||
|
| **Subtotal** | | **~10 GB** |
|
||||||
|
|
||||||
|
#### Step 3: Total Peak
|
||||||
|
|
||||||
|
```
|
||||||
|
Total Peak = Model Load + Activation Peak
|
||||||
|
= 10.2 GB + 10 GB
|
||||||
|
= ~20.2 GB
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.3 Actual Measurement Results
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
torch.cuda.reset_peak_memory_stats()
|
||||||
|
# ... run inference ...
|
||||||
|
peak = torch.cuda.max_memory_allocated()
|
||||||
|
```
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| After model load | 9.82 GB |
|
||||||
|
| Peak during inference | **20.02 GB** |
|
||||||
|
| Activation peak (delta) | 10.20 GB |
|
||||||
|
|
||||||
|
### 8.4 Comparison: Theory vs Actual
|
||||||
|
|
||||||
|
| Component | Theoretical | Actual | Error |
|
||||||
|
|-----------|-------------|--------|-------|
|
||||||
|
| Model load memory | ~10.2 GB | 9.82 GB | -3.7% |
|
||||||
|
| Activation peak | ~10 GB | 10.20 GB | +2.0% |
|
||||||
|
| **Total peak** | **~20.2 GB** | **20.02 GB** | **< 1%** |
|
||||||
|
|
||||||
|
### 8.5 Key Findings
|
||||||
|
|
||||||
|
1. **Theoretical model is accurate**: < 5% error in all components.
|
||||||
|
|
||||||
|
2. **MLP gate_up is the dominant temporary**:
|
||||||
|
- Size: 5.35 GB (for 100k tokens)
|
||||||
|
- Accounts for ~50% of activation peak
|
||||||
|
- Formula: `seq_len × 2 × intermediate_size × dtype_size`
|
||||||
|
|
||||||
|
3. **Memory scaling with sequence length**:
|
||||||
|
| seq_len | Model Load | Activation Peak | Total Peak |
|
||||||
|
|---------|------------|-----------------|------------|
|
||||||
|
| 8k | ~10 GB | ~0.8 GB | ~11 GB |
|
||||||
|
| 32k | ~10 GB | ~3.2 GB | ~13 GB |
|
||||||
|
| 64k | ~10 GB | ~6.4 GB | ~16 GB |
|
||||||
|
| 100k | ~10 GB | ~10 GB | ~20 GB |
|
||||||
|
| 128k | ~10 GB | ~13 GB | ~23 GB |
|
||||||
|
|
||||||
|
4. **Decode memory is much smaller**:
|
||||||
|
- Per-step: ~512 MB for k_full + v_full (at 100k context)
|
||||||
|
- Does not grow with decode steps (constant per layer)
|
||||||
|
|
||||||
|
### 8.6 Memory Profiling Script
|
||||||
|
|
||||||
|
To reproduce the measurement:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import os
|
||||||
|
os.environ["NANOVLLM_LOG_LEVEL"] = "INFO"
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from nanovllm import LLM, SamplingParams
|
||||||
|
from tests.utils import generate_needle_prompt
|
||||||
|
|
||||||
|
# Reset memory stats
|
||||||
|
torch.cuda.reset_peak_memory_stats()
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
|
||||||
|
# Initialize LLM
|
||||||
|
llm = LLM(
|
||||||
|
"path/to/model",
|
||||||
|
enforce_eager=True,
|
||||||
|
max_model_len=131072,
|
||||||
|
max_num_batched_tokens=131072,
|
||||||
|
enable_cpu_offload=True,
|
||||||
|
kvcache_block_size=1024,
|
||||||
|
num_gpu_blocks=2,
|
||||||
|
)
|
||||||
|
|
||||||
|
after_load = torch.cuda.memory_allocated()
|
||||||
|
print(f"After model load: {after_load / 1024**3:.2f} GB")
|
||||||
|
|
||||||
|
# Generate prompt and run inference
|
||||||
|
prompt, expected = generate_needle_prompt(
|
||||||
|
tokenizer=llm.tokenizer,
|
||||||
|
target_length=100000,
|
||||||
|
needle_position=0.5,
|
||||||
|
)
|
||||||
|
|
||||||
|
torch.cuda.reset_peak_memory_stats()
|
||||||
|
outputs = llm.generate([prompt], SamplingParams(max_tokens=32))
|
||||||
|
|
||||||
|
peak = torch.cuda.max_memory_allocated()
|
||||||
|
print(f"Peak during inference: {peak / 1024**3:.2f} GB")
|
||||||
|
```
|
||||||
|
|||||||
@@ -440,3 +440,42 @@ Required libraries:
|
|||||||
- `minference`: For MInference vertical_slash kernel
|
- `minference`: For MInference vertical_slash kernel
|
||||||
|
|
||||||
Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
|
Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quest Sparse Policy (nano-vLLM)
|
||||||
|
|
||||||
|
**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
|
||||||
|
|
||||||
|
Quest policy is used in nano-vLLM for CPU offload mode. It selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
|
||||||
|
|
||||||
|
### Scoring Mechanism
|
||||||
|
|
||||||
|
```python
|
||||||
|
score_min = torch.einsum('hd,bhd->bh', q, key_min) # [num_blocks, kv_heads]
|
||||||
|
score_max = torch.einsum('hd,bhd->bh', q, key_max) # [num_blocks, kv_heads]
|
||||||
|
scores = torch.maximum(score_min, score_max).mean(dim=-1) # [num_blocks] ← averaged!
|
||||||
|
```
|
||||||
|
|
||||||
|
### Critical Limitation - No Per-Head Scheduling
|
||||||
|
|
||||||
|
The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
|
||||||
|
|
||||||
|
```
|
||||||
|
Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
|
||||||
|
Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
|
||||||
|
Block C: both heads moderately need (+2, +2) → avg = +2 → selected
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why Per-Head Scheduling is Infeasible
|
||||||
|
|
||||||
|
1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
|
||||||
|
2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
|
||||||
|
3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
|
||||||
|
|
||||||
|
### Policy Types
|
||||||
|
|
||||||
|
| Policy | `supports_prefill` | `supports_decode` | Description |
|
||||||
|
|--------|-------------------|-------------------|-------------|
|
||||||
|
| `FullAttentionPolicy` | True | True | Loads all blocks (baseline) |
|
||||||
|
| `QuestPolicy` | False | True | Decode-only Top-K selection |
|
||||||
|
|||||||
Reference in New Issue
Block a user