[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST

2026-01-08 23:22:38 +08:00
parent 0bfe1984ef
commit ea4e904de0
11 changed files with 853 additions and 533 deletions
--- a/notes.md
+++ b/notes.md
@@ -1,205 +1,324 @@
-# Notes: Layerwise Offload Implementation
+# Notes: Sparsity Integration into Layerwise Offload

-## Code Analysis
+## Current Architecture Analysis

-### Current Layerwise Offload Flow
+### GPU-Only Path vs Offload Path
+
+| Aspect | GPU-Only | Layerwise Offload |
+|--------|----------|-------------------|
+| KV Storage | GPU blocks (paged) | CPU pinned + GPU ring buffer |
+| Prefill | All layers → then attention | Per-layer: attention → offload |
+| Decode | FlashAttn with block table | Ring buffer H2D → FlashAttn |
+| Sparse Support | MInference via `attention.py` | Not integrated |
+
+### MInference Flow (GPU-Only)

-**Prefill** (`model_runner.py:462-573`):
 ```
-for layer_id in range(num_layers):
-    q, k, v = compute_qkv(hidden_states)
-    attn_out = flash_attn_varlen_func(q, k, v, causal=True)
-    hidden_states = mlp(attn_out)
-    _offload_layer_kv_to_cpu_sync(layer_id, k, v)  # BLOCKING!
+attention.py:101-105:
+  if context.sparse_prefill_policy is not None:
+      o = context.sparse_prefill_policy.sparse_prefill_attention(q, k, v, layer_id)
+
+minference.py:sparse_prefill_attention():
+  1. estimate_pattern(q, k, layer_id) -> vertical_indices, slash_indices
+  2. _triton_mixed_sparse_attention(q, k, v, indices)
+  3. return output
 ```

-**Decode** (`model_runner.py:641-817`):
+### Quest Flow (GPU Block Mode)
+
 ```
-for layer_id in range(num_layers):
-    # Load all prefilled KV from CPU (SLOW!)
-    for block_id in cpu_block_table:
-        k_block = k_cache_cpu[layer_id, block_id].to("cuda")
-        v_block = v_cache_cpu[layer_id, block_id].to("cuda")
-
-    k_full = cat([k_prefill, k_decode_prev, k_new])
-    attn_out = flash_attn(q, k_full, v_full, causal=False)
-
-    # Store new KV to decode buffer
-    decode_k_buffer[layer_id, pos].copy_(k_new)
-
-# Block-full offload (lines 793-811)
-if block_is_full:
-    for layer_id in range(num_layers):
-        k_cache_cpu[layer_id, block].copy_(decode_k_buffer[layer_id], non_blocking=True)
-    torch.cuda.synchronize()  # BAD: global sync
+hybrid_manager.py (if using CPU offload with Quest):
+  select_blocks(available_blocks, ctx) -> selected block IDs
+  -> load selected blocks to GPU
+  -> standard FlashAttn with loaded blocks
 ```

-### OffloadEngine Existing Infrastructure
+### Layerwise Offload Prefill Flow

-**Streams** (available for use):
- `compute_stream` - dedicated compute stream (not default!)
- `prefill_offload_streams[layer_id]` - per-layer D2H streams
- `slot_transfer_streams[slot_idx]` - per-slot H2D streams
- `transfer_stream_main` - main transfer stream
- `_pipeline_layer_stream` - cross-layer pipeline stream
+```
+model_runner.py:run_layerwise_offload_prefill():
+  for layer_id in range(num_layers):
+      # QKV projection
+      q, k, v = qkv_proj(hidden_ln)

-**Events** (available for use):
- `prefill_offload_events[layer_id]` - per-layer offload completion
- `ring_slot_ready[slot]` - H2D completion
- `ring_slot_offload_done[slot]` - D2H completion
- `ring_slot_compute_done[slot]` - compute completion
- `_pipeline_next_layer_event` - pipeline next layer ready
+      # RoPE
+      q, k = rotary_emb(positions, q, k)

-**Buffers** (already allocated):
- `k_cache_cpu/v_cache_cpu` - [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
- `k_cache_gpu/v_cache_gpu` - [num_gpu_blocks, block_size, kv_heads, head_dim] (no layer dim!)
- `decode_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
- `prefill_k_buffer/v_buffer` - [num_layers, block_size, kv_heads, head_dim]
- `layer_k_buffer_a/b, layer_v_buffer_a/b` - [max_prefill_blocks, block_size, kv_heads, head_dim]
+      # FULL attention (no sparsity!)
+      attn_output = flash_attn_varlen_func(q, k, v, ...)

-### Useful Existing Methods
+      # MLP
+      hidden_states = mlp(attn_out + residual)

-**Async offload** (currently unused in layerwise):
+      # Sync offload ALL k, v to CPU
+      for block_id in cpu_block_ids:
+          k_cache_cpu[layer_id, block_id].copy_(k[start:end])
+          v_cache_cpu[layer_id, block_id].copy_(v[start:end])
+```
+
+### Layerwise Offload Decode Flow
+
+```
+model_runner.py:run_layerwise_offload_decode():
+  # Preload first N layers to ring buffer
+  for i in range(num_buffers):
+      offload_engine.load_layer_kv_to_buffer(i, i, cpu_block_table, valid_tokens)
+
+  for layer_id in range(num_layers):
+      current_buffer = layer_id % num_buffers
+
+      # Wait for buffer load
+      offload_engine.wait_buffer_load(current_buffer)
+
+      # Get prefilled KV from ring buffer (ALL blocks loaded)
+      k_prefill, v_prefill = offload_engine.get_buffer_kv(current_buffer, total_prefill_tokens)
+
+      # QKV for new token
+      q, k_new, v_new = qkv_proj(hidden_ln)
+
+      # Concat and full attention
+      k_full = torch.cat([k_prefill, k_decode_prev, k_new])
+      attn_output = flash_attn_varlen_func(q, k_full, v_full, ...)
+
+      # Start loading next layer
+      offload_engine.load_layer_kv_to_buffer(current_buffer, layer_id + num_buffers, ...)
+```
+
+## Integration Points
+
+### 1. Prefill Sparse Integration Point
+
+**Location:** `model_runner.py:535-543`
+
+**Current:**
 ```python
-offload_prefill_buffer_async(layer_id, cpu_block_id, num_valid_tokens)
-wait_all_prefill_offloads()
-wait_prefill_offload(layer_id)
+attn_output = flash_attn_varlen_func(
+    q, k, v,
+    cu_seqlens_q=cu_seqlens,
+    cu_seqlens_k=cu_seqlens,
+    max_seqlen_q=total_tokens,
+    max_seqlen_k=total_tokens,
+    softmax_scale=layer.self_attn.attn.scale,
+    causal=True,
+)
 ```

-**Cross-layer pipeline** (for decode):
+**After Integration:**
 ```python
-start_decode_pipeline(cpu_block_ids)
-get_decode_layer_kv(layer_id, num_blocks) -> (k, v)
-end_decode_pipeline()
+if self.sparse_policy and self.sparse_policy.supports_offload_prefill:
+    attn_output, k_sparse, v_sparse = self.sparse_policy.offload_prefill_attention(
+        q, k, v, layer_id
+    )
+    k_to_offload = k_sparse if k_sparse is not None else k
+    v_to_offload = v_sparse if v_sparse is not None else v
+else:
+    attn_output = flash_attn_varlen_func(q, k, v, ...)
+    k_to_offload, v_to_offload = k, v
 ```

-### Chunked Prefill Code to Remove
+### 2. Decode Sparse Integration Point

-**attention.py** (lines to remove):
- 172-312: `_chunked_prefill_attention()`
- 314-346: `_sync_load_previous_chunks()`
- 348-480: `_ring_buffer_pipeline_load()`
- 482-591: `_chunked_decode_attention()`
- 593-667: `_decode_ring_buffer_pipeline()`
- 669-726: `_decode_with_layer_pipeline()`
+**Location:** `model_runner.py:636-637` and `model_runner.py:704-706`

-**context.py** (fields to remove):
- `is_chunked_prefill`
- `prev_kv_ranges`
- `chunk_offset`
- `chunked_seq`
- `decode_pos_in_block`
- `decode_start_pos_in_block`
- `current_chunk_idx`
-
-**Keep**:
- `kvcache_manager` - still needed for layerwise
- `sparse_prefill_policy` - needed for MInference
-
---
-
-## Memory Layout
-
-### 新设计: Ring-Buffered GPU KV Cache
-
-**设计原则**:
- 不追求极致peak memory优化，保证流水线正确性
- Ring buffer层数可从外部配置 (默认4层)
- 流水线深度 = num_kv_buffers - 1
-
-```
-# 新: Ring-Buffered GPU Cache (layerwise offload专用)
-# num_kv_buffers: 外部可配置，默认4
-layer_k_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
-layer_v_cache: [num_kv_buffers, max_seq_tokens, kv_heads, head_dim]
-
-# 移除: 旧的chunked prefill ring buffer
-# k_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim]  <- 删除
-# v_cache_gpu: [num_gpu_blocks, block_size, kv_heads, head_dim]  <- 删除
-```
-
-**为什么使用Ring Buffer?**
-
-Decode阶段的流水线需求 (以4个buffer为例):
-```
-Buffer 0: [Load L0] → [Compute L0] ──────────────────► [Load L4]
-Buffer 1:           [Load L1] → [Compute L1] ────────────────────►
-Buffer 2:                     [Load L2] → [Compute L2] ────────────►
-Buffer 3:                               [Load L3] → [Compute L3] ──►
-```
-
-流水线深度 = 3，可以预加载3层，更好地隐藏H2D延迟。
-
-**内存开销** (Qwen3-4B, 128K tokens):
- 单层KV: 128K × 8 × 128 × 2 bytes = 256 MB
- 4层ring buffer: 4 × 256 MB = 1 GB
- 对比28层全GPU: 28 × 256 MB = 7.2 GB
- **节省**: 7.2 GB - 1 GB = 6.2 GB
-
-**配置传递**:
-```
-LLM(num_kv_buffers=4) → Config → OffloadEngine(num_kv_buffers=...)
-```
-
-### CPU Cache (保持不变)
-```
-k_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
-v_cache_cpu: [num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
-```
-Pinned memory for fast DMA transfers.
-
-### Memory per Layer (Qwen3-4B)
- kv_heads = 8
- head_dim = 128
- dtype = bfloat16 (2 bytes)
- Per token KV: 8 * 128 * 2 * 2 = 4KB
- 128K tokens: 512 MB per layer
- 28 layers: 14 GB total on CPU
-
---
-
-## Stream Synchronization Pattern
-
-### Correct Pattern for Async Offload
+**Current (preload):**
 ```python
-# In offload stream
-with torch.cuda.stream(offload_stream):
-    offload_stream.wait_stream(compute_stream)  # Wait for compute to finish
-    cpu_tensor.copy_(gpu_tensor, non_blocking=True)
-    event.record(offload_stream)
-
-# Before reusing gpu_tensor
-compute_stream.wait_event(event)  # Wait for offload to complete
+for i in range(num_preload):
+    offload_engine.load_layer_kv_to_buffer(
+        i, i, cpu_block_table, valid_tokens_per_block
+    )
 ```

-### Correct Pattern for Async Load
+**After Integration:**
 ```python
-# In load stream
-with torch.cuda.stream(load_stream):
-    gpu_buffer.copy_(cpu_tensor, non_blocking=True)
-    event.record(load_stream)
-
-# Before using gpu_buffer
-compute_stream.wait_event(event)  # Wait for load to complete
+for i in range(num_preload):
+    layer_to_load = i
+    if self.sparse_policy and self.sparse_policy.supports_offload_decode:
+        # Prepare q for this layer (need to compute ahead)
+        # OR: use previous layer's pattern as estimate
+        selected_blocks = self.sparse_policy.select_offload_blocks(
+            None,  # q not available yet at preload
+            layer_to_load,
+            cpu_block_table,
+            valid_tokens_per_block
+        )
+    else:
+        selected_blocks = cpu_block_table
+    offload_engine.load_sparse_layer_kv_to_buffer(
+        i, layer_to_load, selected_blocks, valid_tokens_per_block
+    )
 ```

---
+**Challenge:** Q is not available during preload phase!

-## Test Configuration
+**Solutions:**
+1. Skip sparse preload, only sparse for non-preloaded layers
+2. Use previous decode step's pattern as estimate
+3. Add preload hook to sparse policy

-**Needle test command**:
-```bash
-PYTHONPATH=/home/zijie/.claude-squad/worktrees/zijie/int-offload-1_188890c8699249f7:$PYTHONPATH \
-python tests/test_needle.py \
-    --model ~/models/Qwen3-4B-Instruct-2507/ \
-    --max-model-len 32768 \
-    --input-len 8192 \
-    --enable-offload \
-    --block-size 1024 \
-    --num-gpu-blocks 2
+### 3. Offload Engine Extension
+
+**New Method in OffloadEngine:**
+
+```python
+def load_sparse_layer_kv_to_buffer(
+    self,
+    buffer_idx: int,
+    layer_id: int,
+    selected_cpu_block_ids: List[int],
+    original_valid_tokens: List[int],
+) -> int:
+    """
+    Load only selected blocks from CPU to buffer.
+
+    Returns:
+        Total tokens loaded (may be less than full sequence)
+    """
+    stream = self.layer_load_streams[buffer_idx]
+
+    with torch.cuda.stream(stream):
+        stream.wait_event(self.buffer_compute_done_events[buffer_idx])
+
+        # Build mapping: original block -> selected position
+        offset = 0
+        for i, cpu_block_id in enumerate(selected_cpu_block_ids):
+            # Find original index to get valid tokens
+            valid_tokens = original_valid_tokens[i]  # Need mapping
+
+            self.layer_k_cache[buffer_idx, offset:offset+valid_tokens].copy_(
+                self.k_cache_cpu[layer_id, cpu_block_id, :valid_tokens],
+                non_blocking=True
+            )
+            # ... v_cache same
+
+            offset += valid_tokens
+
+        self.buffer_load_events[buffer_idx].record(stream)
+
+    return offset  # Caller needs to know actual loaded tokens
 ```

-**GPU mutex check before running**:
-```bash
-nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
+## Metadata Flow for Quest
+
+### During Prefill Offload
+
+**Current:** No metadata collection in offload path
+
+**Required:** Call `on_prefill_offload()` for each block
+
+```python
+# In run_layerwise_offload_prefill()
+for i, cpu_block_id in enumerate(cpu_block_ids):
+    start = i * block_size
+    end = min(start + block_size, total_tokens)
+    actual_size = end - start
+
+    # BEFORE offload: update Quest metadata
+    if self.sparse_policy and hasattr(self.sparse_policy, 'on_prefill_offload'):
+        self.sparse_policy.on_prefill_offload(
+            cpu_block_id, layer_id, k[start:end], actual_size
+        )
+
+    # Offload
+    offload_engine.k_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(k[start:end])
+    offload_engine.v_cache_cpu[layer_id, cpu_block_id, :actual_size].copy_(v[start:end])
 ```
+
+### Quest Metadata Shape
+
+```python
+# BlockMetadataManager
+key_min: [num_blocks, num_layers, num_kv_heads, head_dim]  # Min key per block per layer
+key_max: [num_blocks, num_layers, num_kv_heads, head_dim]  # Max key per block per layer
+```
+
+**Memory:** 2 * num_blocks * num_layers * kv_heads * head_dim * 2 bytes
+- Example: 1000 blocks * 28 layers * 4 heads * 128 dim * 2 * 2 = ~57 MB
+
+## Performance Considerations
+
+### MInference Prefill Overhead
+
+| Operation | Time (64K seq) |
+|-----------|----------------|
+| Pattern estimation (last-64) | ~5ms |
+| Triton sparse attention | ~80ms |
+| Full FlashAttention | ~100ms |
+| **Net Speedup** | ~15-20% |
+
+### Quest Decode Overhead
+
+| Operation | Time |
+|-----------|------|
+| Block scoring (GPU metadata) | ~0.1ms |
+| Top-K selection | ~0.05ms |
+| Sparse H2D load (8 blocks) | ~2ms |
+| Full H2D load (100 blocks) | ~20ms |
+| **Net Speedup** | ~10x H2D |
+
+### Memory Trade-offs
+
+| Mode | GPU Memory | CPU Memory | H2D Bandwidth |
+|------|------------|------------|---------------|
+| Full offload | Ring buffer | Full KV | High |
+| Sparse offload | Ring buffer | Full KV | Low (subset) |
+| Aggressive sparse | Ring buffer | Sparse KV | Very low |
+
+## Edge Cases
+
+### 1. Short Sequences (< sparse threshold)
+
+```python
+if total_tokens < sparse_threshold:
+    # Fall back to full attention
+    use_sparse = False
+```
+
+### 2. First Decode Step (no previous Q)
+
+Quest can't score blocks without Q. Options:
+- Use average embedding as proxy
+- Load all blocks for first step
+- Use prefill pattern as estimate
+
+### 3. Variable Sequence Lengths in Batch
+
+Layerwise offload currently only supports batch_size=1:
+```python
+assert len(seqs) == 1, "Layer-wise offload only supports single sequence"
+```
+
+Sparse integration should maintain this constraint.
+
+### 4. Ring Buffer vs Sparse Load Mismatch
+
+Ring buffer assumes fixed `total_prefill_tokens`:
+```python
+k_prefill, v_prefill = offload_engine.get_buffer_kv(buffer_idx, total_prefill_tokens)
+```
+
+Sparse load has variable token count. Need:
+```python
+# Track actual loaded tokens per buffer
+loaded_tokens[buffer_idx] = sparse_load_count
+k_prefill, v_prefill = offload_engine.get_buffer_kv(buffer_idx, loaded_tokens[buffer_idx])
+```
+
+## Testing Strategy
+
+### Unit Tests
+
+1. `test_sparse_policy_interface.py` - Verify new interface methods
+2. `test_minference_offload.py` - MInference in offload mode
+3. `test_quest_offload.py` - Quest block selection in offload mode
+
+### Integration Tests
+
+1. `test_offload_sparse_e2e.py` - Full prefill+decode with sparsity
+2. `test_accuracy_comparison.py` - Compare outputs: full vs sparse
+
+### Benchmarks
+
+1. `bench_offload_sparse.py` - Compare:
+   - Full offload (baseline)
+   - MInference prefill + Quest decode
+   - Aggressive sparse offload