[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00
parent a8c9f0d837
commit 105201b902
7 changed files with 649 additions and 279 deletions
--- a/docs/sparse_attention_guide.md
+++ b/docs/sparse_attention_guide.md
@@ -440,3 +440,42 @@ Required libraries:
 - `minference`: For MInference vertical_slash kernel

 Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
+
+---
+
+## Quest Sparse Policy (nano-vLLM)
+
+**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
+
+Quest policy is used in nano-vLLM for CPU offload mode. It selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
+
+### Scoring Mechanism
+
+```python
+score_min = torch.einsum('hd,bhd->bh', q, key_min)  # [num_blocks, kv_heads]
+score_max = torch.einsum('hd,bhd->bh', q, key_max)  # [num_blocks, kv_heads]
+scores = torch.maximum(score_min, score_max).mean(dim=-1)  # [num_blocks] ← averaged!
+```
+
+### Critical Limitation - No Per-Head Scheduling
+
+The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
+
+```
+Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
+Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
+Block C: both heads moderately need (+2, +2) → avg = +2 → selected
+```
+
+### Why Per-Head Scheduling is Infeasible
+
+1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
+2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
+3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
+
+### Policy Types
+
+| Policy | `supports_prefill` | `supports_decode` | Description |
+|--------|-------------------|-------------------|-------------|
+| `FullAttentionPolicy` | True | True | Loads all blocks (baseline) |
+| `QuestPolicy` | False | True | Decode-only Top-K selection |