[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST
This commit is contained in:
@@ -440,3 +440,42 @@ Required libraries:
|
||||
- `minference`: For MInference vertical_slash kernel
|
||||
|
||||
Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
|
||||
|
||||
---
|
||||
|
||||
## Quest Sparse Policy (nano-vLLM)
|
||||
|
||||
**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
|
||||
|
||||
Quest policy is used in nano-vLLM for CPU offload mode. It selects Top-K blocks based on query-key similarity bounds using min/max key metadata.
|
||||
|
||||
### Scoring Mechanism
|
||||
|
||||
```python
|
||||
score_min = torch.einsum('hd,bhd->bh', q, key_min) # [num_blocks, kv_heads]
|
||||
score_max = torch.einsum('hd,bhd->bh', q, key_max) # [num_blocks, kv_heads]
|
||||
scores = torch.maximum(score_min, score_max).mean(dim=-1) # [num_blocks] ← averaged!
|
||||
```
|
||||
|
||||
### Critical Limitation - No Per-Head Scheduling
|
||||
|
||||
The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
|
||||
|
||||
```
|
||||
Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
|
||||
Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
|
||||
Block C: both heads moderately need (+2, +2) → avg = +2 → selected
|
||||
```
|
||||
|
||||
### Why Per-Head Scheduling is Infeasible
|
||||
|
||||
1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
|
||||
2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
|
||||
3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
|
||||
|
||||
### Policy Types
|
||||
|
||||
| Policy | `supports_prefill` | `supports_decode` | Description |
|
||||
|--------|-------------------|-------------------|-------------|
|
||||
| `FullAttentionPolicy` | True | True | Loads all blocks (baseline) |
|
||||
| `QuestPolicy` | False | True | Decode-only Top-K selection |
|
||||
|
||||
Reference in New Issue
Block a user