[WIP] replace merge attention with triton kernel.

2025-12-25 01:07:05 +08:00
parent cf5e7df093
commit 16fcf8350b
5 changed files with 490 additions and 405 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,365 +1,172 @@
 # CLAUDE.md
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+This file provides guidance to Claude Code when working with this repository.
 ## Overview
-Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Currently supports Qwen3 models.
+Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
 ## Architecture
 ### Core Components
-**LLMEngine** (`nanovllm/engine/llm_engine.py`):
+- **LLMEngine** (`llm_engine.py`): Main entry, runs prefill-decode loop
- Main entry point, wraps ModelRunner and Scheduler
+- **ModelRunner** (`model_runner.py`): Loads weights, allocates KV cache, CUDA graphs
- `generate()` runs prefill-decode loop until all sequences finish
+- **Scheduler** (`scheduler.py`): Two-phase scheduling (prefill → decode)
-
+- **BlockManager** (`block_manager.py`): Paged attention with prefix caching (xxhash), default block size 4096
-**ModelRunner** (`nanovllm/engine/model_runner.py`):
+- **Attention** (`layers/attention.py`): FlashAttention with chunked methods for CPU offload
 - Loads model weights, allocates KV cache, captures CUDA graphs
 - Rank 0 is main process; ranks 1+ run via `loop()` with shared memory events
 - Chunked offload methods: `run_chunked_offload_prefill()`, `run_chunked_offload_decode()`
 **Scheduler** (`nanovllm/engine/scheduler.py`):
 - Two-phase scheduling: prefill (waiting queue) then decode (running queue)
 **BlockManager** (`nanovllm/engine/block_manager.py`):
 - Paged attention block allocation with prefix caching via xxhash
 - Blocks are 4096 tokens by default (configurable via `kvcache_block_size`)
 ### Model & Attention
 **Attention** (`nanovllm/layers/attention.py`):
 - FlashAttention: `flash_attn_varlen_func` (prefill), `flash_attn_with_kvcache` (decode)
 - Triton kernel `store_kvcache_kernel` for KV cache writes
 - Chunked attention methods: `_chunked_prefill_attention()`, `_chunked_decode_attention()`
 **Global Context** (`nanovllm/utils/context.py`):
 - Stores attention metadata via `get_context()`/`set_context()`
 - Key fields: `cu_seqlens`, `slot_mapping`, `block_tables`, `chunked_seq`, `kvcache_manager`
 - `kvcache_manager`: Reference to HybridKVCacheManager for chunked attention (set when `is_chunked_prefill=True`)
 ## CPU Offload System
-### Overview
+### Ring Buffer Design
 When `enable_cpu_offload=True`, KV cache is stored on CPU with a small GPU buffer for computation. This enables long-context inference with limited GPU memory.
 ### Unified Ring Buffer Design
 ```
-GPU Slots: [0]  [1]  [2]  [3]  [4]  ...
+GPU Slots: [0]  [1]  [2]  [3]  ...  (unified ring buffer)
-           ←────────────────────────────→
+Prefill: slot = chunk_idx % N
-              All slots as ring buffer
+Decode:  slot[0] = decode, slots[1:] = load previous chunks
 Prefill: ALL slots cycle as ring buffer [slot = chunk_idx % N]
 Decode:  slot[0] = decode_slot, slots[1:] = load slots for previous chunks
 ```
-**File**: `nanovllm/kvcache/offload_engine.py`
+**Key Files**: `kvcache/offload_engine.py`, `kvcache/hybrid_manager.py`
-Key attributes:
+**Memory Layout**:
- `num_ring_slots`: Total GPU slots (= num_gpu_blocks)
+- GPU: `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
- `ring_slots`: List of all GPU slot indices [0, 1, 2, ...]
+- CPU: `[num_layers, num_cpu_blocks, ...]` (pinned memory)
 - `decode_slot = 0`: Fixed slot for decode KV writes
 - `decode_load_slots`: Slots[1:] for loading previous chunks during decode
 - `k_cache_gpu/v_cache_gpu`: Shape `[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]`
 - `k_cache_cpu/v_cache_cpu`: Shape `[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]` (pinned memory)
-Key methods:
+**Key Methods**:
-```python
+- `load_to_slot_layer(slot, layer, cpu_block)`: Async H2D load
-# Prefill: get write slot and load slots
+- `offload_slot_to_cpu(slot, cpu_block)`: Async D2H offload
-get_write_slot_for_prefill(chunk_idx)        # Returns chunk_idx % num_ring_slots
+- Per-slot per-layer CUDA events for fine-grained synchronization
 get_load_slots_for_prefill(write_slot_idx)   # Returns all slots except write_slot
-# Decode: get load slots (excludes decode_slot)
+**Pipeline**: Double buffering with `compute_done` events prevents data races. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).
 get_load_slots_for_decode()                   # Returns slots[1:]
-# Per-slot per-layer operations
+## Scatter-Gather DMA (sgDMA) - INTEGRATED ✓
 load_to_slot_layer(slot_idx, layer_id, cpu_block_id)  # Async load single block
 wait_slot_layer(slot_idx, layer_id)                   # Wait for layer's transfer
 offload_slot_to_cpu(slot_idx, cpu_block_id)           # Async offload to CPU
 ```
-### Per-Slot Per-Layer Events (Critical Design)
+### Problem & Solution
-Each slot has per-layer CUDA events for fine-grained synchronization:
+**Problem**: Strided CPU cache access `k_cache_cpu[:, block_id]` caused slow Device→Pageable transfers at ~1.4 GB/s instead of optimal ~24 GB/s pinned memory bandwidth.
 - `ring_slot_ready[slot_idx][layer_id]`: H2D transfer completion
 - `ring_slot_offload_done[slot_idx][layer_id]`: D2H transfer completion
 - `ring_slot_compute_done[slot_idx][layer_id]`: Attention compute completion (for safe buffer reuse)
-This enables:
+**Solution**: Implemented `cudaMemcpy2D` via custom CUDA extension to handle strided layouts natively. **Integration complete** as of 2025-12-25.
 1. Overlapped H2D transfer with attention computation
 2. Each layer independently waits for its own data
 3. Pipeline depth = N-1 for prefill (N slots, 1 for writing)
-### Async Pipeline with Double Buffering
+### Quick Start
 **File**: `nanovllm/layers/attention.py` - `_ring_buffer_pipeline_load()`
 The async pipeline uses double buffering with `compute_done` events to prevent data races:
 ```python
-# Synchronization flow for safe async pipeline:
+from nanovllm.comm import memcpy_2d_async
 1. load_to_slot_layer() waits for compute_done[slot] before overwriting
 2. wait_slot_layer() waits for slot_ready[slot] before reading
 3. After flash_attn, record_slot_compute_done(slot) allows next load
-Timeline with 2 slots (A, B):
+# Transfer block_id across all layers
-┌──────────────┐
+spitch = num_blocks * features * dtype_size  # stride between layers
-│ Load B0→A    │
+dpitch = features * dtype_size               # contiguous destination
-└──────────────┘
+width = features * dtype_size                # bytes per row
-               ┌──────────────┐ ┌──────────────┐
+height = num_layers                          # number of rows
-               │ Load B1→B    │ │ Load B2→A    │ ...
+
-               └──────────────┘ └──────────────┘
+memcpy_2d_async(gpu_buf, cpu_cache[:, block_id], dpitch, spitch, width, height, "h2d", stream)
                              ↘               ↘
                ┌──────────────┐ ┌──────────────┐
                │ Compute(A)   │ │ Compute(B)   │ ...
                └──────────────┘ └──────────────┘
 ```
-**Key**: `load_to_slot_layer` internally waits for `compute_done` before starting transfer, preventing data race where new data overwrites unread data.
+### Benchmark Performance (Synthetic, 256MB)
-### Chunked Prefill Flow (Ring Buffer Pipeline)
+| Method | Bandwidth | Speedup |
 |--------|-----------|---------|
 | **cudaMemcpy2D (sgDMA)** | **24.95 GB/s** | **Baseline** |
 | PyTorch strided | 4.25 GB/s | **5.87x slower** |
 | PyTorch contiguous | 24.92 GB/s | Same |
-**File**: `nanovllm/layers/attention.py` - `_chunked_prefill_attention()`
+### Real-World Performance (A100, Attention Offload)
-```
+**Measured from `test_attention_offload.py` profiling**:
 For prefill chunk K:
 1. Current chunk's KV written to ring_slot[K % N]
 2. Load previous chunks from CPU using N-1 available slots (pipeline)
 3. Compute attention against previous KV (no causal mask)
 4. Compute attention against current KV (causal mask)
 5. Merge results using online softmax (LSE)
 6. Offload current slot to CPU
-Pipeline Timeline (with 4 slots, processing chunk 3):
+| Transfer Type | Count | Bandwidth | Previous | Speedup |
-write_slot = 3, load_slots = [0, 1, 2]
+|---------------|-------|-----------|----------|---------|
 | **Device→Pinned (D2H)** | 416 | **21.49 GB/s** | 1.40 GB/s | **15.35x** |
 | **Pinned→Device (H2D)** | 24,960 | **23.39 GB/s** | N/A | N/A |
 | Device→Pageable (D2H) | **0** | N/A | ~40 transfers | **Eliminated** |
-┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+**Verification**: All slow Device→Pageable transfers eliminated. System achieves near-optimal PCIe Gen3 x16 bandwidth.
 │Load B0→S0   │ │Load B1→S1   │ │Load B2→S2   │ │   (wait)    │
 └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
              ↘               ↘               ↘
               ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
               │ Attn(B0)    │ │ Attn(B1)    │ │ Attn(B2)    │
               └─────────────┘ └─────────────┘ └─────────────┘
 ```
-**Key**: Write slot cycles through ALL slots, load slots = all except write slot.
+**Build**: `python setup.py build_ext --inplace`
-### Chunked Decode Flow (Double Buffering)
+**Files**:
 - `csrc/sgdma_kernel.cu`, `csrc/sgdma.cpp`: CUDA extension
 - `nanovllm/comm/sgdma.py`: Python API
 - `tests/test_sgdma.py`: Standalone benchmark
 - `kvcache/offload_engine.py`: Integration (4 methods updated)
-**File**: `nanovllm/layers/attention.py` - `_chunked_decode_attention()`
+### Integration Details
-Decode uses legacy double-buffering with `decode_load_slots`:
+**Modified methods in `offload_engine.py`**:
- First half of decode_load_slots: 'compute' buffer
+- `load_to_slot_all_layers()`: H2D ring buffer load
- Second half: 'prefetch' buffer
+- `offload_slot_to_cpu()`: D2H ring buffer offload
 - `offload_decode_slot()`: D2H decode slot offload
 - `load_cpu_blocks_to_gpu_slots_all_layers()`: Batch H2D load
-```
+**Example replacement**:
 Timeline:
        ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
 Load:   │C0 → buf0    │ │C1 → buf1    │ │C2 → buf0    │
        └─────────────┘ └─────────────┘ └─────────────┘
                      ↘               ↘               ↘
 Compute:               [C0]           [C1]           [C2]
 1. Pre-load first chunk to compute buffer
 2. Wait for current buffer, trigger async prefetch to OTHER buffer
 3. Compute attention, merge results
 4. Swap buffers, repeat
 5. Finally attend to decode_slot (new token's KV)
 ```
 ### HybridKVCacheManager
 **File**: `nanovllm/kvcache/hybrid_manager.py`
 CPU-primary KV cache manager with GPU ring buffer design:
 - All KV cache is stored on CPU as primary storage
 - GPU is used as a ring buffer for computation only
 - Ring buffer enables pipelined H2D transfers overlapped with computation
 Key methods:
 - `allocate()` / `allocate_cpu_only()`: Allocate all blocks to CPU
 - `get_all_cpu_blocks(seq)`: Get all CPU block IDs for a sequence
 - `get_prefilled_cpu_blocks(seq)`: Get CPU blocks from previous chunks
 - `get_write_slot_for_chunked_offload(seq)`: Get GPU slot for writing new KV (returns decode_slot)
 ### Online Softmax Merge
 **File**: `nanovllm/kvcache/chunked_attention.py`
 When computing attention across multiple chunks, results are merged using log-sum-exp:
 ```python
-def merge_attention_outputs(o1, lse1, o2, lse2):
+# Before (slow, Device→Pageable fallback)
-    # Uses LSE to correctly weight and combine partial attention outputs
+self.k_cache_gpu[:, slot].copy_(self.k_cache_cpu[:, cpu_block], non_blocking=True)
 # After (fast, Device→Pinned via sgDMA)
 memcpy_2d_async(
    self.k_cache_gpu[:, slot], self.k_cache_cpu[:, cpu_block],
    self.gpu_pitch, self.cpu_pitch, self.width, self.height,
    "h2d", stream=self.transfer_stream_main
 )
 ```
-### Flash Attention with LSE
+**Actual Impact**: 15.35x faster D2H transfers, eliminates memory transfer bottleneck. Expected 2-3x overall prefill throughput improvement.
-**File**: `nanovllm/kvcache/chunked_attention.py` - `flash_attn_with_lse()`
+## Configuration
-Uses native `flash_attn_func` with `return_attn_probs=True` to get LSE output. This:
+| Parameter | Default | Notes |
- Natively supports GQA (no memory overhead for head replication)
+|-----------|---------|-------|
- Avoids `repeat_interleave` which would copy K/V heads (40MB+ per call)
+| `kvcache_block_size` | 4096 | Tokens per block |
- Returns `(output, lse)` for online softmax merging
+| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
-
+| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
-### Pipeline Depth
+| `enable_cpu_offload` | False | Enable for long context |
 - **Prefill**: Pipeline depth = N-1 (where N = num_gpu_blocks)
 - **Decode**: Pipeline depth = (N-1)/2 (double buffering within decode_load_slots)
 ## Performance Optimizations
 ### Warmup Model Optimization
 **File**: `nanovllm/engine/model_runner.py` - `warmup_model()`
 Warmup uses a reasonable sequence length (`block_size * 2`) instead of `max_model_len`:
 - Avoids huge intermediate activation memory allocation
 - 8192 tokens is sufficient to trigger CUDA kernel JIT compilation
 - Prevents OOM during initialization for long-context configs (256K+)
 ### Memory Considerations
 **GQA Head Replication**: The chunked attention uses native `flash_attn_func` which handles GQA internally without memory overhead. Previous implementation used `repeat_interleave` which copied K/V heads, adding ~40MB per attention call.
 **Block Size Trade-off**:
 - Larger block_size (4096) = fewer H2D transfers, better throughput
 - Smaller block_size (256) = finer granularity, less wasted memory
 - Current default: 4096 tokens per block
 ## Configuration Defaults
 | Parameter | Default | Description |
 |-----------|---------|-------------|
 | `kvcache_block_size` | 4096 | Tokens per KV cache block |
 | `max_num_batched_tokens` | 16384 | Max tokens per batch |
 | `max_num_seqs` | 512 | Max concurrent sequences |
 | `gpu_memory_utilization` | 0.9 | GPU memory fraction for KV cache |
 | `enforce_eager` | False | Disable CUDA graphs if True |
 ## Benchmarking
-### Benchmark Files
+**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
-| File | Purpose | Key Parameters |
+**Common Issues**:
-|------|---------|----------------|
+1. `max_num_batched_tokens < max_model_len`: Set equal for long context
-| `bench.py` | Standard GPU benchmark | Pure GPU inference |
+2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
-| `bench_offload.py` | CPU offload benchmark | `enable_cpu_offload=True`, `num_gpu_blocks=8` |
+3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
 | `bench_vllm.py` | vLLM comparison | Uses vLLM API for baseline comparison |
-### Current Test Configuration
+**Model Limits**:
 - Qwen3-0.6B/4B: 40960 tokens
 - Qwen2.5-7B-Instruct-1M: 1048576 tokens
-All benchmark files are aligned to use:
+**Performance (Qwen3-0.6B, 40K)**:
- **Model**: `~/models/Qwen3-0.6B/`
+- GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
- **max_model_len**: 40960 (limited by model's `max_position_embeddings`)
+- CPU Offload: ~7.2k tok/s (prefill), ~3.5 tok/s (decode)
 - **Prefill test**: input_len = max_len - 1 (40959 tokens)
 - **Decode test**: input_len = max_len - 128, output_len = 128
-### Common Issues and Solutions
+## TODO: Alternative Optimizations
-**1. `max_num_batched_tokens` assertion error**
+### 1. Pure PyTorch Layout Reorganization (Alternative to sgDMA)
 ```
 AssertionError: assert self.max_num_batched_tokens >= self.max_model_len
 ```
 **Solution**: Set `max_num_batched_tokens=max_model_len` when using large context lengths.
-**2. CUDA graph block_tables dimension mismatch**
+**Note**: sgDMA (above) already solves this. This is a pure-PyTorch alternative requiring more code changes.
 ```
 RuntimeError: The expanded size of the tensor (1) must match the existing size (2)
 ```
 **Cause**: `input_len + output_len > max_model_len` causes more blocks than pre-allocated in CUDA graph.
 **Solution**: Ensure `input_len + output_len <= max_model_len`.
-**3. RoPE position embedding out of bounds**
+**Change Layout**:
 ```
 Assertion `index out of bounds: 0 <= ... < 40960` failed
 ```
 **Cause**: Sequence length exceeds model's `max_position_embeddings`.
 **Solution**: Check model's `config.json` for `max_position_embeddings` and limit `max_model_len` accordingly.
 ### Model Context Length Limits
 | Model | max_position_embeddings | Notes |
 |-------|------------------------|-------|
 | Qwen3-0.6B | 40960 | ~40K context |
 | Qwen3-4B | 40960 | ~40K context |
 | Qwen2.5-7B-Instruct-1M | 1048576 | 1M context |
 **Important**: Always check `max_position_embeddings` in `config.json` before setting `max_model_len`.
 ### Performance Reference (Qwen3-0.6B, 40K context)
 | Mode | Prefill (tok/s) | Decode (tok/s) |
 |------|-----------------|----------------|
 | GPU (bench.py) | ~18,000 | ~100 |
 | CPU Offload (bench_offload.py) | ~7,200 | ~3.5 |
 CPU offload trades performance for memory efficiency, enabling long-context inference on limited GPU memory.
 ## TODO: Performance Optimizations
 ### 1. Fix Non-Contiguous CPU Cache Layout (High Priority)
 **Problem**: Device-to-Pageable transfers causing 16x slowdown in CPU offload.
 **Root Cause**:
 Current CPU cache layout `[num_layers, num_cpu_blocks, ...]` causes non-contiguous memory access when slicing `k_cache_cpu[:, cpu_block_id]`. Although the tensor is pinned, CUDA runtime falls back to slow pageable transfer path because the slice is non-contiguous.
 **Evidence from Profiling** (`tests/test_pinned_transfer.py` + nsys):
 ```
 Non-contiguous slice (current):
 - Transfer type: Device -> Pageable
 - Avg duration: 5.825 ms
 - Bandwidth: 1.44 GB/s
 Contiguous layout (optimized):
 - Transfer type: Device -> Pinned
 - Avg duration: 0.364 ms
 - Bandwidth: 23.11 GB/s
 Performance gain: 16x faster!
 ```
 **Technical Details**:
 - Pinned memory requires both `pin_memory=True` AND contiguous layout for fast DMA
 - Non-contiguous slice forces CUDA to:
  1. Allocate temporary pageable buffer on CPU
  2. Copy non-contiguous data to buffer (CPU overhead)
  3. Transfer from pageable buffer to GPU (slow path)
 - PCIe DMA engine requires contiguous memory blocks for optimal throughput
 **Solution**:
 Change CPU cache tensor layout from:
 ```python
-# Current (non-contiguous when accessing per-block):
+# Current (non-contiguous access)
-k_cache_cpu = torch.zeros(
+k_cache_cpu = torch.zeros(num_layers, num_cpu_blocks, block_size, kv_heads, head_dim,
-    num_layers, num_cpu_blocks, block_size, num_kv_heads, head_dim,
+                          pin_memory=True)
-    dtype=dtype, device="cpu", pin_memory=True
+# Access: k_cache_cpu[:, block_id]  -> strided, slow
 )
 # Access: k_cache_cpu[:, cpu_block_id]  -> non-contiguous!
-# Optimized (contiguous per-block access):
+# Optimized (contiguous access)
-k_cache_cpu = torch.zeros(
+k_cache_cpu = torch.zeros(num_cpu_blocks, num_layers, block_size, kv_heads, head_dim,
-    num_cpu_blocks, num_layers, block_size, num_kv_heads, head_dim,
+                          pin_memory=True)
-    dtype=dtype, device="cpu", pin_memory=True
+# Access: k_cache_cpu[block_id]  -> contiguous, fast
 )
 # Access: k_cache_cpu[cpu_block_id]  -> contiguous!
 ```
-**Files to modify**:
+**Files to Modify**:
- `nanovllm/kvcache/offload_engine.py`:
+- `kvcache/offload_engine.py`: Update all indexing in `load_to_slot_layer()`, `offload_slot_to_cpu()`
-  - Lines 104-111: Change tensor allocation layout
+- Audit all `k_cache_cpu`/`v_cache_cpu` accesses
  - All methods accessing CPU cache: update indexing
  - `load_to_slot_layer()`, `offload_slot_to_cpu()`, `offload_slot_layer_to_cpu()`
 - Update any other code that accesses `k_cache_cpu`/`v_cache_cpu`
-**Expected Impact**:
+**Trade-off**:
- 16x faster D2H transfers in CPU offload mode
+- **sgDMA**: Minimal code changes, requires CUDA extension, 24.95 GB/s
- Overall prefill throughput improvement: ~2-3x (D2H is currently the bottleneck)
+- **Layout Change**: Pure PyTorch, extensive refactoring, 24.91 GB/s (same performance)
 - No change to API or functionality, pure performance optimization
-**Reference**:
+**Recommendation**: Use sgDMA for faster implementation with same performance.
- Test: `tests/test_pinned_transfer.py`
+
- Profiling: `results/nsys/pinned_transfer_20251224_213158.nsys-rep`
+---
- Analysis: See traces showing Device->Pageable vs Device->Pinned
+
 **Author**: Zijie Tian
--- a/nanovllm/kvcache/chunked_attention.py
+++ b/nanovllm/kvcache/chunked_attention.py
@@ -275,6 +275,85 @@ def flash_attn_with_lse(
    return out, lse
@triton.jit
 def _merge_lse_kernel(
    lse1_ptr, lse2_ptr, lse_out_ptr,
    num_elements: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
 ):
    """Fused kernel for merging LSE values."""
    # Each program handles BLOCK_SIZE elements
    pid = tl.program_id(0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < num_elements
    # Load lse values
    lse1 = tl.load(lse1_ptr + offsets, mask=mask)
    lse2 = tl.load(lse2_ptr + offsets, mask=mask)
    # Compute max for numerical stability
    max_lse = tl.maximum(lse1, lse2)
    # Compute exp(lse - max_lse)
    exp1 = tl.exp(lse1 - max_lse)
    exp2 = tl.exp(lse2 - max_lse)
    # Compute merged LSE: max_lse + log(exp1 + exp2)
    lse_merged = max_lse + tl.log(exp1 + exp2)
    # Store result
    tl.store(lse_out_ptr + offsets, lse_merged, mask=mask)
@triton.jit
 def _merge_output_kernel(
    o1_ptr, o2_ptr, lse1_ptr, lse2_ptr, o_out_ptr,
    batch: tl.constexpr, seqlen_q: tl.constexpr, nheads: tl.constexpr, headdim: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
 ):
    """Fused kernel for merging attention outputs."""
    # Each program handles BLOCK_SIZE elements along headdim for one (batch, seqlen_q, nheads) position
    pid_batch = tl.program_id(0)
    pid_seq = tl.program_id(1)
    pid_head = tl.program_id(2)
    # Compute LSE index: [batch, nheads, seqlen_q]
    lse_idx = pid_batch * nheads * seqlen_q + pid_head * seqlen_q + pid_seq
    # Load LSE values
    lse1 = tl.load(lse1_ptr + lse_idx)
    lse2 = tl.load(lse2_ptr + lse_idx)
    # Compute max and scaling factors
    max_lse = tl.maximum(lse1, lse2)
    exp1 = tl.exp(lse1 - max_lse)
    exp2 = tl.exp(lse2 - max_lse)
    sum_exp = exp1 + exp2
    # Process headdim in chunks
    for d_offset in range(0, headdim, BLOCK_SIZE):
        d_idx = d_offset + tl.arange(0, BLOCK_SIZE)
        mask = d_idx < headdim
        # Compute output index: [batch, seqlen_q, nheads, headdim]
        base_idx = (pid_batch * seqlen_q * nheads * headdim +
                    pid_seq * nheads * headdim +
                    pid_head * headdim)
        o_idx = base_idx + d_idx
        # Load o1, o2
        o1_val = tl.load(o1_ptr + o_idx, mask=mask, other=0.0)
        o2_val = tl.load(o2_ptr + o_idx, mask=mask, other=0.0)
        # Compute merged output: (o1 * exp1 + o2 * exp2) / sum_exp
        o_merged = (o1_val * exp1 + o2_val * exp2) / sum_exp
        # Store result
        tl.store(o_out_ptr + o_idx, o_merged, mask=mask)
 def merge_attention_outputs(
    o1: torch.Tensor,
    lse1: torch.Tensor,
@@ -282,7 +361,7 @@ def merge_attention_outputs(
    lse2: torch.Tensor,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
    """
-    Merge two attention outputs using online softmax.
+    Merge two attention outputs using online softmax (Triton fused kernel).
    This implements the online softmax merging formula:
    - m_new = max(lse1, lse2)
@@ -299,31 +378,30 @@ def merge_attention_outputs(
        o_merged: Merged output [batch, seqlen_q, nheads, headdim]
        lse_merged: Merged LSE [batch, nheads, seqlen_q]
    """
-    # lse shape: [batch, nheads, seqlen_q]
+    batch, seqlen_q, nheads, headdim = o1.shape
    # o shape: [batch, seqlen_q, nheads, headdim]
-    # Compute max for numerical stability
+    # Allocate output tensors
-    max_lse = torch.maximum(lse1, lse2)
+    o_merged = torch.empty_like(o1)
    lse_merged = torch.empty_like(lse1)
-    # Compute scaling factors
+    # Launch LSE merge kernel
-    # exp1, exp2 shape: [batch, nheads, seqlen_q]
+    num_lse_elements = batch * nheads * seqlen_q
-    exp1 = torch.exp(lse1 - max_lse)
+    BLOCK_SIZE_LSE = 256
-    exp2 = torch.exp(lse2 - max_lse)
+    grid_lse = (triton.cdiv(num_lse_elements, BLOCK_SIZE_LSE),)
    _merge_lse_kernel[grid_lse](
        lse1, lse2, lse_merged,
        num_lse_elements,
        BLOCK_SIZE=BLOCK_SIZE_LSE,
    )
-    # Reshape for broadcasting with output
+    # Launch output merge kernel
-    # [batch, nheads, seqlen_q] -> [batch, seqlen_q, nheads, 1]
+    BLOCK_SIZE = 128
-    exp1_broad = exp1.transpose(1, 2).unsqueeze(-1)
+    grid_output = (batch, seqlen_q, nheads)
-    exp2_broad = exp2.transpose(1, 2).unsqueeze(-1)
+    _merge_output_kernel[grid_output](
-
+        o1, o2, lse1, lse2, o_merged,
-    # Merge outputs
+        batch, seqlen_q, nheads, headdim,
-    sum_exp = exp1_broad + exp2_broad
+        BLOCK_SIZE=BLOCK_SIZE,
-    o_merged = (o1 * exp1_broad + o2 * exp2_broad) / sum_exp
+    )
    # Compute merged LSE
    lse_merged = max_lse + torch.log(exp1 + exp2)
    # Ensure output has same dtype as input
    o_merged = o_merged.to(o1.dtype)
    return o_merged, lse_merged
--- a/nanovllm/kvcache/offload_engine.py
+++ b/nanovllm/kvcache/offload_engine.py
@@ -14,6 +14,7 @@ from typing import Dict, List, Tuple, Optional
 from dataclasses import dataclass
 from nanovllm.kvcache.kernels import gathered_copy_kv
 from nanovllm.comm import memcpy_2d_async
 from nanovllm.utils.logger import get_logger
 logger = get_logger("offload_engine")
@@ -65,6 +66,16 @@ class OffloadEngine:
        self.kv_dim = num_kv_heads * head_dim
        self.block_numel = block_size * self.kv_dim
        # ========== sgDMA pitch parameters for strided transfers ==========
        self.dtype_size = dtype.itemsize
        self.cpu_pitch = num_cpu_blocks * self.block_numel * self.dtype_size
        self.gpu_pitch = num_gpu_blocks * self.block_numel * self.dtype_size
        self.width = self.block_numel * self.dtype_size
        self.height = num_layers
        logger.info(f"sgDMA parameters: cpu_pitch={self.cpu_pitch}, gpu_pitch={self.gpu_pitch}, "
                    f"width={self.width}, height={self.height}")
        # ========== Unified Ring Buffer configuration ==========
        # Constraint checks
        assert num_gpu_blocks >= 2, \
@@ -478,14 +489,18 @@ class OffloadEngine:
        with torch.cuda.stream(stream):
            for cpu_block_id, gpu_slot in zip(cpu_block_ids, gpu_slot_ids):
-                # Copy all layers at once
+                # Copy all layers at once using sgDMA
-                self.k_cache_gpu[:, gpu_slot].copy_(
+                memcpy_2d_async(
                    self.k_cache_gpu[:, gpu_slot],
                    self.k_cache_cpu[:, cpu_block_id],
-                    non_blocking=True
+                    self.gpu_pitch, self.cpu_pitch, self.width, self.height,
                    "h2d", stream=stream
                )
-                self.v_cache_gpu[:, gpu_slot].copy_(
+                memcpy_2d_async(
                    self.v_cache_gpu[:, gpu_slot],
                    self.v_cache_cpu[:, cpu_block_id],
-                    non_blocking=True
+                    self.gpu_pitch, self.cpu_pitch, self.width, self.height,
                    "h2d", stream=stream
                )
        stream.synchronize()
@@ -697,11 +712,17 @@ class OffloadEngine:
        logger.debug(f"Ring load all layers: CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]")
        with torch.cuda.stream(self.transfer_stream_main):
-            self.k_cache_gpu[:, slot_idx].copy_(
+            memcpy_2d_async(
-                self.k_cache_cpu[:, cpu_block_id], non_blocking=True
+                self.k_cache_gpu[:, slot_idx],
                self.k_cache_cpu[:, cpu_block_id],
                self.gpu_pitch, self.cpu_pitch, self.width, self.height,
                "h2d", stream=self.transfer_stream_main
            )
-            self.v_cache_gpu[:, slot_idx].copy_(
+            memcpy_2d_async(
-                self.v_cache_cpu[:, cpu_block_id], non_blocking=True
+                self.v_cache_gpu[:, slot_idx],
                self.v_cache_cpu[:, cpu_block_id],
                self.gpu_pitch, self.cpu_pitch, self.width, self.height,
                "h2d", stream=self.transfer_stream_main
            )
            self.ring_slot_all_layers_ready[slot_idx].record(self.transfer_stream_main)
@@ -724,11 +745,17 @@ class OffloadEngine:
        torch.cuda.nvtx.range_push(f"D2H: Slot[{slot_idx}]->CPU[{cpu_block_id}]")
        with torch.cuda.stream(self.transfer_stream_main):
            self.transfer_stream_main.wait_stream(self.compute_stream)
-            self.k_cache_cpu[:, cpu_block_id].copy_(
+            memcpy_2d_async(
-                self.k_cache_gpu[:, slot_idx], non_blocking=True
+                self.k_cache_cpu[:, cpu_block_id],
                self.k_cache_gpu[:, slot_idx],
                self.cpu_pitch, self.gpu_pitch, self.width, self.height,
                "d2h", stream=self.transfer_stream_main
            )
-            self.v_cache_cpu[:, cpu_block_id].copy_(
+            memcpy_2d_async(
-                self.v_cache_gpu[:, slot_idx], non_blocking=True
+                self.v_cache_cpu[:, cpu_block_id],
                self.v_cache_gpu[:, slot_idx],
                self.cpu_pitch, self.gpu_pitch, self.width, self.height,
                "d2h", stream=self.transfer_stream_main
            )
            self.ring_slot_all_layers_offload_done[slot_idx].record(self.transfer_stream_main)
        torch.cuda.nvtx.range_pop()
@@ -813,11 +840,17 @@ class OffloadEngine:
        with torch.cuda.stream(self.transfer_stream_main):
            self.transfer_stream_main.wait_stream(self.compute_stream)
-            self.k_cache_cpu[:, cpu_block_id].copy_(
+            memcpy_2d_async(
-                self.k_cache_gpu[:, self.decode_slot], non_blocking=True
+                self.k_cache_cpu[:, cpu_block_id],
                self.k_cache_gpu[:, self.decode_slot],
                self.cpu_pitch, self.gpu_pitch, self.width, self.height,
                "d2h", stream=self.transfer_stream_main
            )
-            self.v_cache_cpu[:, cpu_block_id].copy_(
+            memcpy_2d_async(
-                self.v_cache_gpu[:, self.decode_slot], non_blocking=True
+                self.v_cache_cpu[:, cpu_block_id],
                self.v_cache_gpu[:, self.decode_slot],
                self.cpu_pitch, self.gpu_pitch, self.width, self.height,
                "d2h", stream=self.transfer_stream_main
            )
            self.decode_offload_done.record(self.transfer_stream_main)
--- a/tests/test_attention_offload.py
+++ b/tests/test_attention_offload.py
@@ -0,0 +1,221 @@
 """
 Test Attention layer with KV cache offload in isolation.
 This test demonstrates how to use Attention + HybridKVCacheManager directly
 without requiring full LLMEngine/ModelRunner setup.
 """
 import torch
 from nanovllm.layers.attention import Attention
 from nanovllm.kvcache.hybrid_manager import HybridKVCacheManager
 from nanovllm.engine.sequence import Sequence
 from nanovllm.utils.context import set_context, reset_context
 # ============================================================
 # Configuration
 # ============================================================
 NUM_LAYERS = 8  # Multi-layer for realistic profiling
 NUM_HEADS = 8
 NUM_KV_HEADS = 8
 HEAD_DIM = 64
 BLOCK_SIZE = 1024  # tokens per block
 CHUNK_SIZE = 1024  # tokens per chunk (same as block for simplicity)
 NUM_GPU_SLOTS = 4
 NUM_CPU_BLOCKS = 16
 DTYPE = torch.float16
 DEVICE = "cuda"
 # ============================================================
 # Setup: Create Manager and Attention Layers
 # ============================================================
 def create_manager():
    """Create and initialize HybridKVCacheManager with OffloadEngine."""
    manager = HybridKVCacheManager(
        num_gpu_slots=NUM_GPU_SLOTS,
        num_cpu_blocks=NUM_CPU_BLOCKS,
        block_size=BLOCK_SIZE,
    )
    # Initialize offload engine (this creates k_cache_gpu/cpu, v_cache_gpu/cpu)
    manager.allocate_cache(
        num_layers=NUM_LAYERS,
        num_kv_heads=NUM_KV_HEADS,
        head_dim=HEAD_DIM,
        dtype=DTYPE,
    )
    return manager
 def create_attention_layers(manager):
    """Create attention layers and bind KV cache."""
    layers = []
    for layer_id in range(NUM_LAYERS):
        attn = Attention(
            num_heads=NUM_HEADS,
            head_dim=HEAD_DIM,
            scale=HEAD_DIM ** -0.5,
            num_kv_heads=NUM_KV_HEADS,
        )
        attn.layer_id = layer_id
        # Bind KV cache from manager
        k_cache, v_cache = manager.get_layer_cache(layer_id)
        attn.k_cache = k_cache
        attn.v_cache = v_cache
        layers.append(attn.to(DEVICE))
    return layers
 def create_test_sequence(manager, num_chunks=3):
    """Create a test sequence and allocate blocks."""
    total_tokens = num_chunks * CHUNK_SIZE
    # Sequence only takes token_ids
    seq = Sequence(token_ids=list(range(total_tokens)))
    # Set block_size for this test
    seq.block_size = BLOCK_SIZE
    # Allocate blocks (will be on CPU in CPU-primary mode)
    manager.allocate(seq)
    return seq
 # ============================================================
 # Chunked Prefill Simulation
 # ============================================================
 def simulate_chunk_forward(
    layers,
    manager,
    seq,
    chunk_idx,
    chunk_size,
 ):
    """
    Simulate forward pass for one chunk through all layers.
    Returns:
        output: Final layer attention output
    """
    # Generate random Q, K, V for this chunk
    hidden = torch.randn(chunk_size, NUM_HEADS, HEAD_DIM, dtype=DTYPE, device=DEVICE)
    # Build slot_mapping: maps token positions to GPU slots
    write_slot = manager.offload_engine.get_write_slot_for_prefill(chunk_idx)
    slot_mapping = torch.full((chunk_size,), write_slot * BLOCK_SIZE, dtype=torch.long, device=DEVICE)
    slot_mapping += torch.arange(chunk_size, device=DEVICE)
    # Build cu_seqlens for flash attention
    cu_seqlens = torch.tensor([0, chunk_size], dtype=torch.int32, device=DEVICE)
    # Set context for this chunk
    set_context(
        is_prefill=True,
        is_chunked_prefill=True,
        cu_seqlens_q=cu_seqlens,
        cu_seqlens_k=cu_seqlens,
        max_seqlen_q=chunk_size,
        max_seqlen_k=chunk_size,
        slot_mapping=slot_mapping,
        kvcache_manager=manager,
        chunked_seq=seq,
        current_chunk_idx=chunk_idx,
    )
    # Forward through all layers
    output = hidden
    for layer in layers:
        k = torch.randn(chunk_size, NUM_KV_HEADS, HEAD_DIM, dtype=DTYPE, device=DEVICE)
        v = torch.randn(chunk_size, NUM_KV_HEADS, HEAD_DIM, dtype=DTYPE, device=DEVICE)
        output = layer.forward(output, k, v)
    # Offload current chunk to CPU
    logical_id = seq.block_table[chunk_idx]
    cpu_block_id = manager.logical_blocks[logical_id].cpu_block_id
    manager.offload_engine.offload_slot_to_cpu(write_slot, cpu_block_id)
    manager.prefilled_blocks.add(logical_id)
    return output
 # ============================================================
 # Main Test
 # ============================================================
 print("=" * 60)
 print("Test: Attention Layer with KV Cache Offload")
 print("=" * 60)
 # 1. Setup
 print("\n[1] Creating manager and attention layers...")
 manager = create_manager()
 layers = create_attention_layers(manager)
 print(f"  - Manager: {NUM_GPU_SLOTS} GPU slots, {NUM_CPU_BLOCKS} CPU blocks")
 print(f"  - Layers: {NUM_LAYERS} layers, {NUM_HEADS} heads, {HEAD_DIM} head_dim")
 print(f"  - OffloadEngine initialized: {manager.offload_engine is not None}")
 # 2. Setup
 print("\n[2] Test configuration...")
 NUM_CHUNKS = NUM_CPU_BLOCKS  # Use all CPU blocks
 print(f"  - Total tokens: {NUM_CHUNKS * CHUNK_SIZE}")
 print(f"  - Chunks: {NUM_CHUNKS}")
 # 3. Warmup runs
 print(f"\n[3] Warmup runs (3 iterations)...")
 for warmup_iter in range(3):
    manager.prefilled_blocks.clear()
    seq = create_test_sequence(manager, num_chunks=NUM_CHUNKS)
    for chunk_idx in range(NUM_CHUNKS):
        write_slot = manager.offload_engine.get_write_slot_for_prefill(chunk_idx)
        output = simulate_chunk_forward(layers, manager, seq, chunk_idx, CHUNK_SIZE)
    manager.deallocate(seq)
    print(f"  - Warmup {warmup_iter + 1}/3 completed")
 # 4. Benchmark runs
 print(f"\n[4] Benchmark runs (10 iterations)...")
 for bench_iter in range(10):
    manager.prefilled_blocks.clear()
    seq = create_test_sequence(manager, num_chunks=NUM_CHUNKS)
    for chunk_idx in range(NUM_CHUNKS):
        write_slot = manager.offload_engine.get_write_slot_for_prefill(chunk_idx)
        load_slots = manager.offload_engine.get_load_slots_for_prefill(write_slot)
        output = simulate_chunk_forward(layers, manager, seq, chunk_idx, CHUNK_SIZE)
    manager.deallocate(seq)
    print(f"  - Iteration {bench_iter + 1}/10 completed")
 # 5. Verify results (using last iteration's seq)
 print("\n[5] Verifying ring buffer and offload...")
 for chunk_idx in range(NUM_CHUNKS):
    expected_slot = chunk_idx % NUM_GPU_SLOTS
    actual_slot = manager.offload_engine.get_write_slot_for_prefill(chunk_idx)
    assert actual_slot == expected_slot, f"Chunk {chunk_idx}: expected slot {expected_slot}, got {actual_slot}"
 cpu_block_table = manager.get_prefilled_cpu_blocks(seq)
 assert cpu_block_table == seq.block_table[:NUM_CHUNKS], "CPU block table mismatch"
 print("  - Ring buffer cycling verified ✓")
 print("  - CPU offload verified ✓")
 # Cleanup
 manager.deallocate(seq)
 # Cleanup
 reset_context()
 print("\n" + "=" * 60)
 print("test_attention_offload: PASSED")
 print("=" * 60)
--- a/tests/test_sgdma.py
+++ b/tests/test_sgdma.py
@@ -6,7 +6,7 @@ Author: Zijie Tian
 import torch
 import time
-from nanovllm.comm import memcpy_2d, memcpy_2d_async
+from nanovllm.comm import memcpy_2d
 # ============================================================
 # Configuration
@@ -34,64 +34,12 @@ class Config:
 # ============================================================
-# Test 1: Async Transfer
+# Performance Benchmark
 # ============================================================
 def test_async_transfer():
    """Test asynchronous transfer with CUDA stream."""
    print("\n[Test 1] Async Transfer Test")
    cfg = Config()
    # Create test data
    cpu_data = torch.randn(
        cfg.num_layers,
        cfg.num_blocks,
        cfg.features_per_block,
        dtype=cfg.dtype,
        pin_memory=True
    )
    gpu_buffer = torch.empty(
        cfg.num_layers,
        cfg.features_per_block,
        dtype=cfg.dtype,
        device='cuda'
    )
    # Create CUDA stream
    stream = torch.cuda.Stream()
    test_block_id = 5
    spitch = cfg.bytes_per_layer
    dpitch = cfg.bytes_per_block
    width = cfg.bytes_per_block
    height = cfg.num_layers
    # Async transfer
    with torch.cuda.stream(stream):
        src_view = cpu_data[:, test_block_id, :]
        memcpy_2d_async(gpu_buffer, src_view, dpitch, spitch, width, height, "h2d", stream)
    # Wait for completion
    stream.synchronize()
    # Verify
    expected = cpu_data[:, test_block_id, :].cuda()
    if torch.allclose(gpu_buffer, expected, rtol=1e-3, atol=1e-3):
        print("  Result: PASSED ✓")
        return True
    else:
        print("  Result: FAILED ✗")
        return False
 # ============================================================
 # Test 2: Performance Benchmark
 # ============================================================
 def benchmark_sgdma():
    """Benchmark cudaMemcpy2D vs standard PyTorch methods."""
-    print("\n[Test 2] Performance Benchmark")
+    print("\n=== Performance Benchmark ===")
    cfg = Config()
@@ -212,19 +160,17 @@ def benchmark_sgdma():
 # ============================================================
 if __name__ == "__main__":
-    print("=== CUDA sgDMA (cudaMemcpy2D) Tests ===")
+    print("=== CUDA sgDMA (cudaMemcpy2D) Benchmark ===")
    # Check CUDA availability
    if not torch.cuda.is_available():
-        print("CUDA not available. Skipping tests.")
+        print("CUDA not available. Skipping benchmark.")
        exit(1)
    # Print GPU info
    print(f"Using GPU: {torch.cuda.get_device_name()}")
-    # Run tests
+    # Run benchmark
    test1_passed = test_async_transfer()
    benchmark_sgdma()
-    print("\n=== Tests Complete ===")
+    print("\n=== Benchmark Complete ===")
    print(f"All tests {'PASSED ✓' if test1_passed else 'FAILED ✗'}")