✨ feat: add comprehensive RULER benchmark testing

- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)
2026-01-18 20:34:06 +08:00
parent 0550a64339
commit e6e0dc5d7d
8 changed files with 1444 additions and 463 deletions
--- a/docs/sparse_attention_guide.md
+++ b/docs/sparse_attention_guide.md
@@ -440,3 +440,79 @@ Required libraries:
 - `minference`: For MInference vertical_slash kernel

 Docker image `tzj/xattn:v0.5` has all dependencies pre-installed.
+
+---
+
+## Quest Sparse Policy
+
+**Files**: `nanovllm/kvcache/sparse/quest.py`, `nanovllm/kvcache/sparse/policy.py`
+
+### Core Idea
+
+Quest policy selects Top-K blocks based on query-key similarity bounds using min/max key metadata. This enables efficient block selection for CPU offload scenarios.
+
+### Scoring Mechanism
+
+```python
+# Compute scores using key metadata bounds
+score_min = torch.einsum('hd,bhd->bh', q, key_min)  # [num_blocks, kv_heads]
+score_max = torch.einsum('hd,bhd->bh', q, key_max)  # [num_blocks, kv_heads]
+scores = torch.maximum(score_min, score_max).mean(dim=-1)  # [num_blocks] ← averaged!
+```
+
+### Critical Limitation - No Per-Head Scheduling
+
+The `.mean(dim=-1)` averages scores across all heads, making a **unified** block selection for all heads:
+
+```
+Block A: head0 needs (+4), head1 doesn't (-4) → avg = 0 → NOT selected
+Block B: head0 doesn't (-4), head1 needs (+4) → avg = 0 → NOT selected
+Block C: both heads moderately need (+2, +2) → avg = +2 → selected
+```
+
+### Why Per-Head Scheduling is Infeasible
+
+1. **Memory Layout**: GPU cache stores all heads together `[block_size, kv_heads, head_dim]`
+
+2. **FlashAttention**: Requires complete heads - partial heads cause dimension mismatch
+
+3. **Block Granularity**: If any head needs a block, the entire block (all heads) must be loaded
+
+### Policy Types
+
+| Policy | supports_prefill | supports_decode | Description |
+|--------|------------------|-----------------|-------------|
+| `FullAttentionPolicy` | True | True | Loads all blocks (no sparsity) |
+| `QuestPolicy` | False | True | Decode-only Top-K selection |
+
+### Usage Example
+
+```python
+from nanovllm.kvcache.sparse.policy import QuestPolicy
+
+# Create Quest policy for decode-only sparse attention
+policy = QuestPolicy(topk=8, threshold=4.0)
+
+# Select blocks based on query and key metadata
+selected_blocks = policy.select_blocks(
+    query,           # [num_tokens, num_heads, head_dim]
+    key_min,         # [num_blocks, num_heads, head_dim]
+    key_max,         # [num_blocks, num_heads, head_dim]
+)
+```
+
+### Key Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `topk` | 8 | Number of blocks to select |
+| `threshold` | 4.0 | Minimum score threshold for selection |
+
+### Integration with CPU Offload
+
+The Quest policy is used in conjunction with CPU offload to reduce the number of blocks transferred from CPU to GPU during decode:
+
+1. During prefill, all blocks are loaded (full attention)
+2. During decode, Quest selects only top-K important blocks
+3. Only selected blocks are transferred from CPU to GPU
+4. This reduces memory bandwidth requirements for long sequences