[WIP] replace merge attention with triton kernel.

2025-12-25 01:07:05 +08:00
parent cf5e7df093
commit 16fcf8350b
5 changed files with 490 additions and 405 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,365 +1,172 @@
 # CLAUDE.md

-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+This file provides guidance to Claude Code when working with this repository.

 ## Overview

-Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Currently supports Qwen3 models.
+Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.

 ## Architecture

 ### Core Components

-**LLMEngine** (`nanovllm/engine/llm_engine.py`):
- Main entry point, wraps ModelRunner and Scheduler
- `generate()` runs prefill-decode loop until all sequences finish
-
-**ModelRunner** (`nanovllm/engine/model_runner.py`):
- Loads model weights, allocates KV cache, captures CUDA graphs
- Rank 0 is main process; ranks 1+ run via `loop()` with shared memory events
- Chunked offload methods: `run_chunked_offload_prefill()`, `run_chunked_offload_decode()`
-
-**Scheduler** (`nanovllm/engine/scheduler.py`):
- Two-phase scheduling: prefill (waiting queue) then decode (running queue)
-
-**BlockManager** (`nanovllm/engine/block_manager.py`):
- Paged attention block allocation with prefix caching via xxhash
- Blocks are 4096 tokens by default (configurable via `kvcache_block_size`)
-
-### Model & Attention
-
-**Attention** (`nanovllm/layers/attention.py`):
- FlashAttention: `flash_attn_varlen_func` (prefill), `flash_attn_with_kvcache` (decode)
- Triton kernel `store_kvcache_kernel` for KV cache writes
- Chunked attention methods: `_chunked_prefill_attention()`, `_chunked_decode_attention()`
-
-**Global Context** (`nanovllm/utils/context.py`):
- Stores attention metadata via `get_context()`/`set_context()`
- Key fields: `cu_seqlens`, `slot_mapping`, `block_tables`, `chunked_seq`, `kvcache_manager`
- `kvcache_manager`: Reference to HybridKVCacheManager for chunked attention (set when `is_chunked_prefill=True`)
+- **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop
+- **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs
+- **Scheduler** (`scheduler.py`): Two-phase scheduling (prefill → decode)
+- **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096
+- **Attention** (`layers/attention.py`): FlashAttention with chunked methods for CPU offload

 ## CPU Offload System

-### Overview
-
-When `enable_cpu_offload=True`, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.
-
-### Unified Ring Buffer Design
+### Ring Buffer Design

 ```
-GPU Slots: [0]  [1]  [2]  [3]  [4]  ...
-           ←────────────────────────────→
-              All slots as ring buffer
-
-Prefill: ALL slots cycle as ring buffer [slot = chunk_idx % N]
-Decode:  slot[0] = decode_slot, slots[1:] = load slots for previous chunks
+GPU Slots: [0]  [1]  [2]  [3]  ...  (unified ring buffer)
+Prefill: slot = chunk_idx % N
+Decode:  slot[0] = decode, slots[1:] = load previous chunks
 ```

-**File**: `nanovllm/kvcache/offload_engine.py`
+**Key Files**: `kvcache/offload_engine.py`, `kvcache/hybrid_manager.py`

-Key attributes:
- `num_ring_slots`: Total GPU slots (= num_gpu_blocks)
- `ring_slots`: List of all GPU slot indices [0, 1, 2, ...]
- `decode_slot = 0`: Fixed slot for decode KV writes
- `decode_load_slots`: Slots[1:] for loading previous chunks during decode
- `k_cache_gpu/v_cache_gpu`: Shape `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
- `k_cache_cpu/v_cache_cpu`: Shape `[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]` (pinned memory)
+**Memory Layout**:
+- GPU: `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
+- CPU: `[num_layers, num_cpu_blocks, ...]` (pinned memory)

-Key methods:
-```python
-# Prefill: get write slot and load slots
-get_write_slot_for_prefill(chunk_idx)        # Returns chunk_idx % num_ring_slots
-get_load_slots_for_prefill(write_slot_idx)   # Returns all slots except write_slot
+**Key Methods**:
+- `load_to_slot_layer(slot, layer, cpu_block)`: Async H2D load
+- `offload_slot_to_cpu(slot, cpu_block)`: Async D2H offload
+- Per-slot per-layer CUDA events for fine-grained synchronization

-# Decode: get load slots (excludes decode_slot)
-get_load_slots_for_decode()                   # Returns slots[1:]
+**Pipeline**: Double buffering with `compute_done` events prevents data races. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).

-# Per-slot per-layer operations
-load_to_slot_layer(slot_idx, layer_id, cpu_block_id)  # Async load single block
-wait_slot_layer(slot_idx, layer_id)                   # Wait for layer's transfer
-offload_slot_to_cpu(slot_idx, cpu_block_id)           # Async offload to CPU
-```
+## Scatter-Gather DMA (sgDMA) - INTEGRATED ✓

-### Per-Slot Per-Layer Events (Critical Design)
+### Problem & Solution

-Each slot has per-layer CUDA events for fine-grained synchronization:
- `ring_slot_ready[slot_idx][layer_id]`: H2D transfer completion
- `ring_slot_offload_done[slot_idx][layer_id]`: D2H transfer completion
- `ring_slot_compute_done[slot_idx][layer_id]`: Attention compute completion (for safe buffer reuse)
+**Problem**: Strided CPU cache access `k_cache_cpu[:, block_id]` caused slow Device→Pageable transfers at ~1.4 GB/s instead of optimal ~24 GB/s pinned memory bandwidth.

-This enables:
-1. Overlapped H2D transfer with attention computation
-2. Each layer independently waits for its own data
-3. Pipeline depth = N-1 for prefill (N slots, 1 for writing)
+**Solution**: Implemented `cudaMemcpy2D` via custom CUDA extension to handle strided layouts natively. **Integration complete** as of 2025-12-25.

-### Async Pipeline with Double Buffering
-
-**File**: `nanovllm/layers/attention.py` - `_ring_buffer_pipeline_load()`
-
-The async pipeline uses double buffering with `compute_done` events to prevent data races:
+### Quick Start

 ```python
-# Synchronization flow for safe async pipeline:
-1. load_to_slot_layer() waits for compute_done[slot] before overwriting
-2. wait_slot_layer() waits for slot_ready[slot] before reading
-3. After flash_attn, record_slot_compute_done(slot) allows next load
+from nanovllm.comm import memcpy_2d_async

-Timeline with 2 slots (A, B):
-┌──────────────┐
-│ Load B0→A    │
-└──────────────┘
-               ┌──────────────┐ ┌──────────────┐
-               │ Load B1→B    │ │ Load B2→A    │ ...
-               └──────────────┘ └──────────────┘
-                              ↘               ↘
-                ┌──────────────┐ ┌──────────────┐
-                │ Compute(A)   │ │ Compute(B)   │ ...
-                └──────────────┘ └──────────────┘
+# Transfer block_id across all layers
+spitch = num_blocks * features * dtype_size  # stride between layers
+dpitch = features * dtype_size               # contiguous destination
+width = features * dtype_size                # bytes per row
+height = num_layers                          # number of rows
+
+memcpy_2d_async(gpu_buf, cpu_cache[:, block_id], dpitch, spitch, width, height, "h2d", stream)
 ```

-**Key**: `load_to_slot_layer` internally waits for `compute_done` before starting transfer, preventing data race where new data overwrites unread data.
+### Benchmark Performance (Synthetic, 256MB)

-### Chunked Prefill Flow (Ring Buffer Pipeline)
+| Method | Bandwidth | Speedup |
+|--------|-----------|---------|
+| **cudaMemcpy2D (sgDMA)** | **24.95 GB/s** | **Baseline** |
+| PyTorch strided | 4.25 GB/s | **5.87x slower** |
+| PyTorch contiguous | 24.92 GB/s | Same |

-**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
+### Real-World Performance (A100, Attention Offload)

-```
-For prefill chunk K:
-1. Current chunk's KV written to ring_slot[K % N]
-2. Load previous chunks from CPU using N-1 available slots (pipeline)
-3. Compute attention against previous KV (no causal mask)
-4. Compute attention against current KV (causal mask)
-5. Merge results using online softmax (LSE)
-6. Offload current slot to CPU
+**Measured from `test_attention_offload.py` profiling**:

-Pipeline Timeline (with 4 slots, processing chunk 3):
-write_slot = 3, load_slots = [0, 1, 2]
+| Transfer Type | Count | Bandwidth | Previous | Speedup |
+|---------------|-------|-----------|----------|---------|
+| **Device→Pinned (D2H)** | 416 | **21.49 GB/s** | 1.40 GB/s | **15.35x** |
+| **Pinned→Device (H2D)** | 24,960 | **23.39 GB/s** | N/A | N/A |
+| Device→Pageable (D2H) | **0** | N/A | ~40 transfers | **Eliminated** |

-┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
-│Load B0→S0   │ │Load B1→S1   │ │Load B2→S2   │ │   (wait)    │
-└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
-              ↘               ↘               ↘
-               ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
-               │ Attn(B0)    │ │ Attn(B1)    │ │ Attn(B2)    │
-               └─────────────┘ └─────────────┘ └─────────────┘
-```
+**Verification**: All slow Device→Pageable transfers eliminated. System achieves near-optimal PCIe Gen3 x16 bandwidth.

-**Key**: Write slot cycles through ALL slots, load slots = all except write slot.
+**Build**: `python setup.py build_ext --inplace`

-### Chunked Decode Flow (Double Buffering)
+**Files**:
+- `csrc/sgdma_kernel.cu`, `csrc/sgdma.cpp`: CUDA extension
+- `nanovllm/comm/sgdma.py`: Python API
+- `tests/test_sgdma.py`: Standalone benchmark
+- `kvcache/offload_engine.py`: Integration (4 methods updated)

-**File**: `nanovllm/layers/attention.py` - `_chunked_decode_attention()`
+### Integration Details

-Decode uses legacy double-buffering with `decode_load_slots`:
- First half of decode_load_slots: 'compute' buffer
- Second half: 'prefetch' buffer
+**Modified methods in `offload_engine.py`**:
+- `load_to_slot_all_layers()`: H2D ring buffer load
+- `offload_slot_to_cpu()`: D2H ring buffer offload
+- `offload_decode_slot()`: D2H decode slot offload
+- `load_cpu_blocks_to_gpu_slots_all_layers()`: Batch H2D load

-```
-Timeline:
-        ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
-Load:   │C0 → buf0    │ │C1 → buf1    │ │C2 → buf0    │
-        └─────────────┘ └─────────────┘ └─────────────┘
-                      ↘               ↘               ↘
-Compute:               [C0]           [C1]           [C2]
-
-1. Pre-load first chunk to compute buffer
-2. Wait for current buffer, trigger async prefetch to OTHER buffer
-3. Compute attention, merge results
-4. Swap buffers, repeat
-5. Finally attend to decode_slot (new token's KV)
-```
-
-### HybridKVCacheManager
-
-**File**: `nanovllm/kvcache/hybrid_manager.py`
-
-CPU-primary KV cache manager with GPU ring buffer design:
- All KV cache is stored on CPU as primary storage
- GPU is used as a ring buffer for computation only
- Ring buffer enables pipelined H2D transfers overlapped with computation
-
-Key methods:
- `allocate()` / `allocate_cpu_only()`: Allocate all blocks to CPU
- `get_all_cpu_blocks(seq)`: Get all CPU block IDs for a sequence
- `get_prefilled_cpu_blocks(seq)`: Get CPU blocks from previous chunks
- `get_write_slot_for_chunked_offload(seq)`: Get GPU slot for writing new KV (returns decode_slot)
-
-### Online Softmax Merge
-
-**File**: `nanovllm/kvcache/chunked_attention.py`
-
-When computing attention across multiple chunks, results are merged using log-sum-exp:
+**Example replacement**:
 ```python
-def merge_attention_outputs(o1, lse1, o2, lse2):
-    # Uses LSE to correctly weight and combine partial attention outputs
+# Before (slow, Device→Pageable fallback)
+self.k_cache_gpu[:, slot].copy_(self.k_cache_cpu[:, cpu_block], non_blocking=True)
+
+# After (fast, Device→Pinned via sgDMA)
+memcpy_2d_async(
+    self.k_cache_gpu[:, slot], self.k_cache_cpu[:, cpu_block],
+    self.gpu_pitch, self.cpu_pitch, self.width, self.height,
+    "h2d", stream=self.transfer_stream_main
+)
 ```

-### Flash Attention with LSE
+**Actual Impact**: 15.35x faster D2H transfers, eliminates memory transfer bottleneck. Expected 2-3x overall prefill throughput improvement.

-**File**: `nanovllm/kvcache/chunked_attention.py` - `flash_attn_with_lse()`
+## Configuration

-Uses native `flash_attn_func` with `return_attn_probs=True` to get LSE output. This:
- Natively supports GQA (no memory overhead for head replication)
- Avoids `repeat_interleave` which would copy K/V heads (40MB+ per call)
- Returns `(output, lse)` for online softmax merging
-
-### Pipeline Depth
-
- **Prefill**: Pipeline depth = N-1 (where N = num_gpu_blocks)
- **Decode**: Pipeline depth = (N-1)/2 (double buffering within decode_load_slots)
-
-## Performance Optimizations
-
-### Warmup Model Optimization
-
-**File**: `nanovllm/engine/model_runner.py` - `warmup_model()`
-
-Warmup uses a reasonable sequence length (`block_size * 2`) instead of `max_model_len`:
- Avoids huge intermediate activation memory allocation
- 8192 tokens is sufficient to trigger CUDA kernel JIT compilation
- Prevents OOM during initialization for long-context configs (256K+)
-
-### Memory Considerations
-
-**GQA Head Replication**: The chunked attention uses native `flash_attn_func` which handles GQA internally without memory overhead. Previous implementation used `repeat_interleave` which copied K/V heads, adding ~40MB per attention call.
-
-**Block Size Trade-off**:
- Larger block_size (4096) = fewer H2D transfers, better throughput
- Smaller block_size (256) = finer granularity, less wasted memory
- Current default: 4096 tokens per block
-
-## Configuration Defaults
-
-| Parameter | Default | Description |
-|-----------|---------|-------------|
-| `kvcache_block_size` | 4096 | Tokens per KV cache block |
-| `max_num_batched_tokens` | 16384 | Max tokens per batch |
-| `max_num_seqs` | 512 | Max concurrent sequences |
-| `gpu_memory_utilization` | 0.9 | GPU memory fraction for KV cache |
-| `enforce_eager` | False | Disable CUDA graphs if True |
+| Parameter | Default | Notes |
+|-----------|---------|-------|
+| `kvcache_block_size` | 4096 | Tokens per block |
+| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
+| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
+| `enable_cpu_offload` | False | Enable for long context |

 ## Benchmarking

-### Benchmark Files
+**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)

-| File | Purpose | Key Parameters |
-|------|---------|----------------|
-| `bench.py` | Standard GPU benchmark | Pure GPU inference |
-| `bench_offload.py` | CPU offload benchmark | `enable_cpu_offload=True`, `num_gpu_blocks=8` |
-| `bench_vllm.py` | vLLM comparison | Uses vLLM API for baseline comparison |
+**Common Issues**:
+1. `max_num_batched_tokens < max_model_len`: Set equal for long context
+2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
+3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json

-### Current Test Configuration
+**Model Limits**:
+- Qwen3-0.6B/4B: 40960 tokens
+- Qwen2.5-7B-Instruct-1M: 1048576 tokens

-All benchmark files are aligned to use:
- **Model**: `~/models/Qwen3-0.6B/`
- **max_model_len**: 40960 (limited by model's `max_position_embeddings`)
- **Prefill test**: input_len = max_len - 1 (40959 tokens)
- **Decode test**: input_len = max_len - 128, output_len = 128
+**Performance (Qwen3-0.6B, 40K)**:
+- GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
+- CPU Offload: ~7.2k tok/s (prefill), ~3.5 tok/s (decode)

-### Common Issues and Solutions
+## TODO: Alternative Optimizations

-**1. `max_num_batched_tokens` assertion error**
-```
-AssertionError: assert self.max_num_batched_tokens >= self.max_model_len
-```
-**Solution**: Set `max_num_batched_tokens=max_model_len` when using large context lengths.
+### 1. Pure PyTorch Layout Reorganization (Alternative to sgDMA)

-**2. CUDA graph block_tables dimension mismatch**
-```
-RuntimeError: The expanded size of the tensor (1) must match the existing size (2)
-```
-**Cause**: `input_len + output_len > max_model_len` causes more blocks than pre-allocated in CUDA graph.
-**Solution**: Ensure `input_len + output_len <= max_model_len`.
+**Note**: sgDMA (above) already solves this. This is a pure-PyTorch alternative requiring more code changes.

-**3. RoPE position embedding out of bounds**
-```
-Assertion `index out of bounds: 0 <= ... < 40960` failed
-```
-**Cause**: Sequence length exceeds model's `max_position_embeddings`.
-**Solution**: Check model's `config.json` for `max_position_embeddings` and limit `max_model_len` accordingly.
-
-### Model Context Length Limits
-
-| Model | max_position_embeddings | Notes |
-|-------|------------------------|-------|
-| Qwen3-0.6B | 40960 | ~40K context |
-| Qwen3-4B | 40960 | ~40K context |
-| Qwen2.5-7B-Instruct-1M | 1048576 | 1M context |
-
-**Important**: Always check `max_position_embeddings` in `config.json` before setting `max_model_len`.
-
-### Performance Reference (Qwen3-0.6B, 40K context)
-
-| Mode | Prefill (tok/s) | Decode (tok/s) |
-|------|-----------------|----------------|
-| GPU (bench.py) | ~18,000 | ~100 |
-| CPU Offload (bench_offload.py) | ~7,200 | ~3.5 |
-
-CPU offload trades performance for memory efficiency, enabling long-context inference on limited GPU memory.
-
-## TODO: Performance Optimizations
-
-### 1. Fix Non-Contiguous CPU Cache Layout (High Priority)
-
-**Problem**: Device-to-Pageable transfers causing 16x slowdown in CPU offload.
-
-**Root Cause**:
-Current CPU cache layout `[num_layers, num_cpu_blocks, ...]` causes non-contiguous memory access when slicing `k_cache_cpu[:, cpu_block_id]`. Although the tensor is pinned, CUDA runtime falls back to slow pageable transfer path because the slice is non-contiguous.
-
-**Evidence from Profiling** (`tests/test_pinned_transfer.py` + nsys):
-```
-Non-contiguous slice (current):
- Transfer type: Device -> Pageable
- Avg duration: 5.825 ms
- Bandwidth: 1.44 GB/s
-
-Contiguous layout (optimized):
- Transfer type: Device -> Pinned
- Avg duration: 0.364 ms
- Bandwidth: 23.11 GB/s
-
-Performance gain: 16x faster!
-```
-
-**Technical Details**:
- Pinned memory requires both `pin_memory=True` AND contiguous layout for fast DMA
- Non-contiguous slice forces CUDA to:
-  1. Allocate temporary pageable buffer on CPU
-  2. Copy non-contiguous data to buffer (CPU overhead)
-  3. Transfer from pageable buffer to GPU (slow path)
- PCIe DMA engine requires contiguous memory blocks for optimal throughput
-
-**Solution**:
-Change CPU cache tensor layout from:
+**Change Layout**:
 ```python
-# Current (non-contiguous when accessing per-block):
-k_cache_cpu = torch.zeros(
-    num_layers, num_cpu_blocks, block_size, num_kv_heads, head_dim,
-    dtype=dtype, device="cpu", pin_memory=True
-)
-# Access: k_cache_cpu[:, cpu_block_id]  -> non-contiguous!
+# Current (non-contiguous access)
+k_cache_cpu = torch.zeros(num_layers, num_cpu_blocks, block_size, kv_heads, head_dim,
+                          pin_memory=True)
+# Access: k_cache_cpu[:, block_id]  -> strided, slow

-# Optimized (contiguous per-block access):
-k_cache_cpu = torch.zeros(
-    num_cpu_blocks, num_layers, block_size, num_kv_heads, head_dim,
-    dtype=dtype, device="cpu", pin_memory=True
-)
-# Access: k_cache_cpu[cpu_block_id]  -> contiguous!
+# Optimized (contiguous access)
+k_cache_cpu = torch.zeros(num_cpu_blocks, num_layers, block_size, kv_heads, head_dim,
+                          pin_memory=True)
+# Access: k_cache_cpu[block_id]  -> contiguous, fast
 ```

-**Files to modify**:
- `nanovllm/kvcache/offload_engine.py`:
-  - Lines 104-111: Change tensor allocation layout
-  - All methods accessing CPU cache: update indexing
-  - `load_to_slot_layer()`, `offload_slot_to_cpu()`, `offload_slot_layer_to_cpu()`
- Update any other code that accesses `k_cache_cpu`/`v_cache_cpu`
+**Files to Modify**:
+- `kvcache/offload_engine.py`: Update all indexing in `load_to_slot_layer()`, `offload_slot_to_cpu()`
+- Audit all `k_cache_cpu`/`v_cache_cpu` accesses

-**Expected Impact**:
- 16x faster D2H transfers in CPU offload mode
- Overall prefill throughput improvement: ~2-3x (D2H is currently the bottleneck)
- No change to API or functionality, pure performance optimization
+**Trade-off**:
+- **sgDMA**: Minimal code changes, requires CUDA extension, 24.95 GB/s
+- **Layout Change**: Pure PyTorch, extensive refactoring, 24.91 GB/s (same performance)

-**Reference**:
- Test: `tests/test_pinned_transfer.py`
- Profiling: `results/nsys/pinned_transfer_20251224_213158.nsys-rep`
- Analysis: See traces showing Device->Pageable vs Device->Pinned
+**Recommendation**: Use sgDMA for faster implementation with same performance.
+
+---
+
+**Author**: Zijie Tian