[WIP] replace merge attention with triton kernel.

This commit is contained in:
Zijie Tian
2025-12-25 01:07:05 +08:00
parent cf5e7df093
commit 16fcf8350b
5 changed files with 490 additions and 405 deletions

415
CLAUDE.md
View File

@@ -1,365 +1,172 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This file provides guidance to Claude Code when working with this repository.
## Overview
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Currently supports Qwen3 models.
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
## Architecture
### Core Components
**LLMEngine** (`nanovllm/engine/llm_engine.py`):
- Main entry point, wraps ModelRunner and Scheduler
- `generate()` runs prefill-decode loop until all sequences finish
**ModelRunner** (`nanovllm/engine/model_runner.py`):
- Loads model weights, allocates KV cache, captures CUDA graphs
- Rank 0 is main process; ranks 1+ run via `loop()` with shared memory events
- Chunked offload methods: `run_chunked_offload_prefill()`, `run_chunked_offload_decode()`
**Scheduler** (`nanovllm/engine/scheduler.py`):
- Two-phase scheduling: prefill (waiting queue) then decode (running queue)
**BlockManager** (`nanovllm/engine/block_manager.py`):
- Paged attention block allocation with prefix caching via xxhash
- Blocks are 4096 tokens by default (configurable via `kvcache_block_size`)
### Model & Attention
**Attention** (`nanovllm/layers/attention.py`):
- FlashAttention: `flash_attn_varlen_func` (prefill), `flash_attn_with_kvcache` (decode)
- Triton kernel `store_kvcache_kernel` for KV cache writes
- Chunked attention methods: `_chunked_prefill_attention()`, `_chunked_decode_attention()`
**Global Context** (`nanovllm/utils/context.py`):
- Stores attention metadata via `get_context()`/`set_context()`
- Key fields: `cu_seqlens`, `slot_mapping`, `block_tables`, `chunked_seq`, `kvcache_manager`
- `kvcache_manager`: Reference to HybridKVCacheManager for chunked attention (set when `is_chunked_prefill=True`)
- **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop
- **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs
- **Scheduler** (`scheduler.py`): Two-phase scheduling (prefilldecode)
- **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096
- **Attention** (`layers/attention.py`): FlashAttention with chunked methods for CPU offload
## CPU Offload System
### Overview
When `enable_cpu_offload=True`, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.
### Unified Ring Buffer Design
### Ring Buffer Design
```
GPU Slots: [0] [1] [2] [3] [4] ...
←────────────────────────────→
All slots as ring buffer
Prefill: ALL slots cycle as ring buffer [slot = chunk_idx % N]
Decode: slot[0] = decode_slot, slots[1:] = load slots for previous chunks
GPU Slots: [0] [1] [2] [3] ... (unified ring buffer)
Prefill: slot = chunk_idx % N
Decode: slot[0] = decode, slots[1:] = load previous chunks
```
**File**: `nanovllm/kvcache/offload_engine.py`
**Key Files**: `kvcache/offload_engine.py`, `kvcache/hybrid_manager.py`
Key attributes:
- `num_ring_slots`: Total GPU slots (= num_gpu_blocks)
- `ring_slots`: List of all GPU slot indices [0, 1, 2, ...]
- `decode_slot = 0`: Fixed slot for decode KV writes
- `decode_load_slots`: Slots[1:] for loading previous chunks during decode
- `k_cache_gpu/v_cache_gpu`: Shape `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
- `k_cache_cpu/v_cache_cpu`: Shape `[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]` (pinned memory)
**Memory Layout**:
- GPU: `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
- CPU: `[num_layers, num_cpu_blocks, ...]` (pinned memory)
Key methods:
```python
# Prefill: get write slot and load slots
get_write_slot_for_prefill(chunk_idx) # Returns chunk_idx % num_ring_slots
get_load_slots_for_prefill(write_slot_idx) # Returns all slots except write_slot
**Key Methods**:
- `load_to_slot_layer(slot, layer, cpu_block)`: Async H2D load
- `offload_slot_to_cpu(slot, cpu_block)`: Async D2H offload
- Per-slot per-layer CUDA events for fine-grained synchronization
# Decode: get load slots (excludes decode_slot)
get_load_slots_for_decode() # Returns slots[1:]
**Pipeline**: Double buffering with `compute_done` events prevents data races. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).
# Per-slot per-layer operations
load_to_slot_layer(slot_idx, layer_id, cpu_block_id) # Async load single block
wait_slot_layer(slot_idx, layer_id) # Wait for layer's transfer
offload_slot_to_cpu(slot_idx, cpu_block_id) # Async offload to CPU
```
## Scatter-Gather DMA (sgDMA) - INTEGRATED ✓
### Per-Slot Per-Layer Events (Critical Design)
### Problem & Solution
Each slot has per-layer CUDA events for fine-grained synchronization:
- `ring_slot_ready[slot_idx][layer_id]`: H2D transfer completion
- `ring_slot_offload_done[slot_idx][layer_id]`: D2H transfer completion
- `ring_slot_compute_done[slot_idx][layer_id]`: Attention compute completion (for safe buffer reuse)
**Problem**: Strided CPU cache access `k_cache_cpu[:, block_id]` caused slow Device→Pageable transfers at ~1.4 GB/s instead of optimal ~24 GB/s pinned memory bandwidth.
This enables:
1. Overlapped H2D transfer with attention computation
2. Each layer independently waits for its own data
3. Pipeline depth = N-1 for prefill (N slots, 1 for writing)
**Solution**: Implemented `cudaMemcpy2D` via custom CUDA extension to handle strided layouts natively. **Integration complete** as of 2025-12-25.
### Async Pipeline with Double Buffering
**File**: `nanovllm/layers/attention.py` - `_ring_buffer_pipeline_load()`
The async pipeline uses double buffering with `compute_done` events to prevent data races:
### Quick Start
```python
# Synchronization flow for safe async pipeline:
1. load_to_slot_layer() waits for compute_done[slot] before overwriting
2. wait_slot_layer() waits for slot_ready[slot] before reading
3. After flash_attn, record_slot_compute_done(slot) allows next load
from nanovllm.comm import memcpy_2d_async
Timeline with 2 slots (A, B):
Load B0A
Load B1B Load B2A ...
Compute(A) Compute(B) ...
# Transfer block_id across all layers
spitch = num_blocks * features * dtype_size # stride between layers
dpitch = features * dtype_size # contiguous destination
width = features * dtype_size # bytes per row
height = num_layers # number of rows
memcpy_2d_async(gpu_buf, cpu_cache[:, block_id], dpitch, spitch, width, height, "h2d", stream)
```
**Key**: `load_to_slot_layer` internally waits for `compute_done` before starting transfer, preventing data race where new data overwrites unread data.
### Benchmark Performance (Synthetic, 256MB)
### Chunked Prefill Flow (Ring Buffer Pipeline)
| Method | Bandwidth | Speedup |
|--------|-----------|---------|
| **cudaMemcpy2D (sgDMA)** | **24.95 GB/s** | **Baseline** |
| PyTorch strided | 4.25 GB/s | **5.87x slower** |
| PyTorch contiguous | 24.92 GB/s | Same |
**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
### Real-World Performance (A100, Attention Offload)
```
For prefill chunk K:
1. Current chunk's KV written to ring_slot[K % N]
2. Load previous chunks from CPU using N-1 available slots (pipeline)
3. Compute attention against previous KV (no causal mask)
4. Compute attention against current KV (causal mask)
5. Merge results using online softmax (LSE)
6. Offload current slot to CPU
**Measured from `test_attention_offload.py` profiling**:
Pipeline Timeline (with 4 slots, processing chunk 3):
write_slot = 3, load_slots = [0, 1, 2]
| Transfer Type | Count | Bandwidth | Previous | Speedup |
|---------------|-------|-----------|----------|---------|
| **Device→Pinned (D2H)** | 416 | **21.49 GB/s** | 1.40 GB/s | **15.35x** |
| **Pinned→Device (H2D)** | 24,960 | **23.39 GB/s** | N/A | N/A |
| Device→Pageable (D2H) | **0** | N/A | ~40 transfers | **Eliminated** |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│Load B0→S0 │ │Load B1→S1 │ │Load B2→S2 │ │ (wait) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
↘ ↘ ↘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Attn(B0) │ │ Attn(B1) │ │ Attn(B2) │
└─────────────┘ └─────────────┘ └─────────────┘
```
**Verification**: All slow Device→Pageable transfers eliminated. System achieves near-optimal PCIe Gen3 x16 bandwidth.
**Key**: Write slot cycles through ALL slots, load slots = all except write slot.
**Build**: `python setup.py build_ext --inplace`
### Chunked Decode Flow (Double Buffering)
**Files**:
- `csrc/sgdma_kernel.cu`, `csrc/sgdma.cpp`: CUDA extension
- `nanovllm/comm/sgdma.py`: Python API
- `tests/test_sgdma.py`: Standalone benchmark
- `kvcache/offload_engine.py`: Integration (4 methods updated)
**File**: `nanovllm/layers/attention.py` - `_chunked_decode_attention()`
### Integration Details
Decode uses legacy double-buffering with `decode_load_slots`:
- First half of decode_load_slots: 'compute' buffer
- Second half: 'prefetch' buffer
**Modified methods in `offload_engine.py`**:
- `load_to_slot_all_layers()`: H2D ring buffer load
- `offload_slot_to_cpu()`: D2H ring buffer offload
- `offload_decode_slot()`: D2H decode slot offload
- `load_cpu_blocks_to_gpu_slots_all_layers()`: Batch H2D load
```
Timeline:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
Load: │C0 → buf0 │ │C1 → buf1 │ │C2 → buf0 │
└─────────────┘ └─────────────┘ └─────────────┘
↘ ↘ ↘
Compute: [C0] [C1] [C2]
1. Pre-load first chunk to compute buffer
2. Wait for current buffer, trigger async prefetch to OTHER buffer
3. Compute attention, merge results
4. Swap buffers, repeat
5. Finally attend to decode_slot (new token's KV)
```
### HybridKVCacheManager
**File**: `nanovllm/kvcache/hybrid_manager.py`
CPU-primary KV cache manager with GPU ring buffer design:
- All KV cache is stored on CPU as primary storage
- GPU is used as a ring buffer for computation only
- Ring buffer enables pipelined H2D transfers overlapped with computation
Key methods:
- `allocate()` / `allocate_cpu_only()`: Allocate all blocks to CPU
- `get_all_cpu_blocks(seq)`: Get all CPU block IDs for a sequence
- `get_prefilled_cpu_blocks(seq)`: Get CPU blocks from previous chunks
- `get_write_slot_for_chunked_offload(seq)`: Get GPU slot for writing new KV (returns decode_slot)
### Online Softmax Merge
**File**: `nanovllm/kvcache/chunked_attention.py`
When computing attention across multiple chunks, results are merged using log-sum-exp:
**Example replacement**:
```python
def merge_attention_outputs(o1, lse1, o2, lse2):
# Uses LSE to correctly weight and combine partial attention outputs
# Before (slow, Device→Pageable fallback)
self.k_cache_gpu[:, slot].copy_(self.k_cache_cpu[:, cpu_block], non_blocking=True)
# After (fast, Device→Pinned via sgDMA)
memcpy_2d_async(
self.k_cache_gpu[:, slot], self.k_cache_cpu[:, cpu_block],
self.gpu_pitch, self.cpu_pitch, self.width, self.height,
"h2d", stream=self.transfer_stream_main
)
```
### Flash Attention with LSE
**Actual Impact**: 15.35x faster D2H transfers, eliminates memory transfer bottleneck. Expected 2-3x overall prefill throughput improvement.
**File**: `nanovllm/kvcache/chunked_attention.py` - `flash_attn_with_lse()`
## Configuration
Uses native `flash_attn_func` with `return_attn_probs=True` to get LSE output. This:
- Natively supports GQA (no memory overhead for head replication)
- Avoids `repeat_interleave` which would copy K/V heads (40MB+ per call)
- Returns `(output, lse)` for online softmax merging
### Pipeline Depth
- **Prefill**: Pipeline depth = N-1 (where N = num_gpu_blocks)
- **Decode**: Pipeline depth = (N-1)/2 (double buffering within decode_load_slots)
## Performance Optimizations
### Warmup Model Optimization
**File**: `nanovllm/engine/model_runner.py` - `warmup_model()`
Warmup uses a reasonable sequence length (`block_size * 2`) instead of `max_model_len`:
- Avoids huge intermediate activation memory allocation
- 8192 tokens is sufficient to trigger CUDA kernel JIT compilation
- Prevents OOM during initialization for long-context configs (256K+)
### Memory Considerations
**GQA Head Replication**: The chunked attention uses native `flash_attn_func` which handles GQA internally without memory overhead. Previous implementation used `repeat_interleave` which copied K/V heads, adding ~40MB per attention call.
**Block Size Trade-off**:
- Larger block_size (4096) = fewer H2D transfers, better throughput
- Smaller block_size (256) = finer granularity, less wasted memory
- Current default: 4096 tokens per block
## Configuration Defaults
| Parameter | Default | Description |
|-----------|---------|-------------|
| `kvcache_block_size` | 4096 | Tokens per KV cache block |
| `max_num_batched_tokens` | 16384 | Max tokens per batch |
| `max_num_seqs` | 512 | Max concurrent sequences |
| `gpu_memory_utilization` | 0.9 | GPU memory fraction for KV cache |
| `enforce_eager` | False | Disable CUDA graphs if True |
| Parameter | Default | Notes |
|-----------|---------|-------|
| `kvcache_block_size` | 4096 | Tokens per block |
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
| `enable_cpu_offload` | False | Enable for long context |
## Benchmarking
### Benchmark Files
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
| File | Purpose | Key Parameters |
|------|---------|----------------|
| `bench.py` | Standard GPU benchmark | Pure GPU inference |
| `bench_offload.py` | CPU offload benchmark | `enable_cpu_offload=True`, `num_gpu_blocks=8` |
| `bench_vllm.py` | vLLM comparison | Uses vLLM API for baseline comparison |
**Common Issues**:
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
### Current Test Configuration
**Model Limits**:
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
All benchmark files are aligned to use:
- **Model**: `~/models/Qwen3-0.6B/`
- **max_model_len**: 40960 (limited by model's `max_position_embeddings`)
- **Prefill test**: input_len = max_len - 1 (40959 tokens)
- **Decode test**: input_len = max_len - 128, output_len = 128
**Performance (Qwen3-0.6B, 40K)**:
- GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
- CPU Offload: ~7.2k tok/s (prefill), ~3.5 tok/s (decode)
### Common Issues and Solutions
## TODO: Alternative Optimizations
**1. `max_num_batched_tokens` assertion error**
```
AssertionError: assert self.max_num_batched_tokens >= self.max_model_len
```
**Solution**: Set `max_num_batched_tokens=max_model_len` when using large context lengths.
### 1. Pure PyTorch Layout Reorganization (Alternative to sgDMA)
**2. CUDA graph block_tables dimension mismatch**
```
RuntimeError: The expanded size of the tensor (1) must match the existing size (2)
```
**Cause**: `input_len + output_len > max_model_len` causes more blocks than pre-allocated in CUDA graph.
**Solution**: Ensure `input_len + output_len <= max_model_len`.
**Note**: sgDMA (above) already solves this. This is a pure-PyTorch alternative requiring more code changes.
**3. RoPE position embedding out of bounds**
```
Assertion `index out of bounds: 0 <= ... < 40960` failed
```
**Cause**: Sequence length exceeds model's `max_position_embeddings`.
**Solution**: Check model's `config.json` for `max_position_embeddings` and limit `max_model_len` accordingly.
### Model Context Length Limits
| Model | max_position_embeddings | Notes |
|-------|------------------------|-------|
| Qwen3-0.6B | 40960 | ~40K context |
| Qwen3-4B | 40960 | ~40K context |
| Qwen2.5-7B-Instruct-1M | 1048576 | 1M context |
**Important**: Always check `max_position_embeddings` in `config.json` before setting `max_model_len`.
### Performance Reference (Qwen3-0.6B, 40K context)
| Mode | Prefill (tok/s) | Decode (tok/s) |
|------|-----------------|----------------|
| GPU (bench.py) | ~18,000 | ~100 |
| CPU Offload (bench_offload.py) | ~7,200 | ~3.5 |
CPU offload trades performance for memory efficiency, enabling long-context inference on limited GPU memory.
## TODO: Performance Optimizations
### 1. Fix Non-Contiguous CPU Cache Layout (High Priority)
**Problem**: Device-to-Pageable transfers causing 16x slowdown in CPU offload.
**Root Cause**:
Current CPU cache layout `[num_layers, num_cpu_blocks, ...]` causes non-contiguous memory access when slicing `k_cache_cpu[:, cpu_block_id]`. Although the tensor is pinned, CUDA runtime falls back to slow pageable transfer path because the slice is non-contiguous.
**Evidence from Profiling** (`tests/test_pinned_transfer.py` + nsys):
```
Non-contiguous slice (current):
- Transfer type: Device -> Pageable
- Avg duration: 5.825 ms
- Bandwidth: 1.44 GB/s
Contiguous layout (optimized):
- Transfer type: Device -> Pinned
- Avg duration: 0.364 ms
- Bandwidth: 23.11 GB/s
Performance gain: 16x faster!
```
**Technical Details**:
- Pinned memory requires both `pin_memory=True` AND contiguous layout for fast DMA
- Non-contiguous slice forces CUDA to:
1. Allocate temporary pageable buffer on CPU
2. Copy non-contiguous data to buffer (CPU overhead)
3. Transfer from pageable buffer to GPU (slow path)
- PCIe DMA engine requires contiguous memory blocks for optimal throughput
**Solution**:
Change CPU cache tensor layout from:
**Change Layout**:
```python
# Current (non-contiguous when accessing per-block):
k_cache_cpu = torch.zeros(
num_layers, num_cpu_blocks, block_size, num_kv_heads, head_dim,
dtype=dtype, device="cpu", pin_memory=True
)
# Access: k_cache_cpu[:, cpu_block_id] -> non-contiguous!
# Current (non-contiguous access)
k_cache_cpu = torch.zeros(num_layers, num_cpu_blocks, block_size, kv_heads, head_dim,
pin_memory=True)
# Access: k_cache_cpu[:, block_id] -> strided, slow
# Optimized (contiguous per-block access):
k_cache_cpu = torch.zeros(
num_cpu_blocks, num_layers, block_size, num_kv_heads, head_dim,
dtype=dtype, device="cpu", pin_memory=True
)
# Access: k_cache_cpu[cpu_block_id] -> contiguous!
# Optimized (contiguous access)
k_cache_cpu = torch.zeros(num_cpu_blocks, num_layers, block_size, kv_heads, head_dim,
pin_memory=True)
# Access: k_cache_cpu[block_id] -> contiguous, fast
```
**Files to modify**:
- `nanovllm/kvcache/offload_engine.py`:
- Lines 104-111: Change tensor allocation layout
- All methods accessing CPU cache: update indexing
- `load_to_slot_layer()`, `offload_slot_to_cpu()`, `offload_slot_layer_to_cpu()`
- Update any other code that accesses `k_cache_cpu`/`v_cache_cpu`
**Files to Modify**:
- `kvcache/offload_engine.py`: Update all indexing in `load_to_slot_layer()`, `offload_slot_to_cpu()`
- Audit all `k_cache_cpu`/`v_cache_cpu` accesses
**Expected Impact**:
- 16x faster D2H transfers in CPU offload mode
- Overall prefill throughput improvement: ~2-3x (D2H is currently the bottleneck)
- No change to API or functionality, pure performance optimization
**Trade-off**:
- **sgDMA**: Minimal code changes, requires CUDA extension, 24.95 GB/s
- **Layout Change**: Pure PyTorch, extensive refactoring, 24.91 GB/s (same performance)
**Reference**:
- Test: `tests/test_pinned_transfer.py`
- Profiling: `results/nsys/pinned_transfer_20251224_213158.nsys-rep`
- Analysis: See traces showing Device->Pageable vs Device->Pinned
**Recommendation**: Use sgDMA for faster implementation with same performance.
---
**Author**: Zijie Tian