[opt] optimize nanovllm performance compareable with vllm.

2025-12-25 03:47:07 +08:00
parent 16fcf8350b
commit 82ed34fc2d
7 changed files with 450 additions and 208 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -37,7 +37,22 @@ Decode:  slot[0] = decode, slots[1:] = load previous chunks
 - `offload_slot_to_cpu(slot, cpu_block)`: Async D2H offload
 - Per-slot per-layer CUDA events for fine-grained synchronization

-**Pipeline**: Double buffering with `compute_done` events prevents data races. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).
+**Pipeline**: N-way pipeline with dedicated streams for full compute-transfer overlap. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).
+
+### Stream Architecture
+
+```
+Transfer Streams: [slot_0_stream] [slot_1_stream] ... [slot_N_stream]
+                       ↓              ↓                    ↓
+GPU Slots:          [slot_0]      [slot_1]    ...     [slot_N]
+                       ↓              ↓                    ↓
+Compute Stream:    ←←←←←←←←←←←← [dedicated compute stream] →→→→→→→→→→→→
+```
+
+**Key Design Decisions**:
+- **Per-slot transfer streams**: Each GPU slot has its own CUDA stream for H2D transfers, enabling parallel loading
+- **Dedicated compute stream**: Created with `torch.cuda.Stream()` (NOT `current_stream()`) to avoid implicit synchronization with default stream
+- **CUDA Events**: `ring_slot_ready` (transfer complete), `ring_slot_compute_done` (safe to overwrite)

 ## Scatter-Gather DMA (sgDMA) - INTEGRATED ✓

@@ -112,6 +127,99 @@ memcpy_2d_async(

 **Actual Impact**: 15.35x faster D2H transfers, eliminates memory transfer bottleneck. Expected 2-3x overall prefill throughput improvement.

+## Online Softmax Merge - Triton Fused Kernel ✓
+
+### Problem & Solution
+
+**Problem**: Original PyTorch implementation of `merge_attention_outputs()` launches 7 separate kernels per merge operation:
+1. `torch.maximum()` - max(lse1, lse2)
+2. `torch.exp()` (2x) - exp(lse1-max), exp(lse2-max)
+3. `transpose()` + `unsqueeze()` - reshape for broadcasting
+4. Accumulation (6x) - weighted sum operations
+5. Division - normalize output
+6. `torch.log()` - merge LSE
+7. `.to()` - type conversion
+
+**Profiling revealed**: In ChunkedPrefill with 8 layers, these operations consumed **698 ms** GPU time (vs FlashAttention 603 ms), becoming a major bottleneck.
+
+**Solution**: Implemented Triton fused kernels that combine all operations into 2 kernels. **Integration complete** as of 2025-12-25.
+
+### Implementation
+
+**File**: `nanovllm/kvcache/chunked_attention.py:278-408`
+
+Two Triton kernels replace all PyTorch operations:
+
+```python
+@triton.jit
+def _merge_lse_kernel(...):
+    """Fused: max + exp + log"""
+    max_lse = tl.maximum(lse1, lse2)
+    exp1 = tl.exp(lse1 - max_lse)
+    exp2 = tl.exp(lse2 - max_lse)
+    lse_merged = max_lse + tl.log(exp1 + exp2)
+    tl.store(lse_out_ptr + offsets, lse_merged, mask=mask)
+
+@triton.jit
+def _merge_output_kernel(...):
+    """Fused: broadcast + weighted sum + division"""
+    # Load LSE, compute scaling factors
+    exp1 = tl.exp(lse1 - max_lse)
+    exp2 = tl.exp(lse2 - max_lse)
+    sum_exp = exp1 + exp2
+
+    # Process headdim in chunks
+    for d_offset in range(0, headdim, BLOCK_SIZE):
+        o1_val = tl.load(o1_ptr + o_idx, mask=mask)
+        o2_val = tl.load(o2_ptr + o_idx, mask=mask)
+        o_merged = (o1_val * exp1 + o2_val * exp2) / sum_exp
+        tl.store(o_out_ptr + o_idx, o_merged, mask=mask)
+```
+
+### Performance Results
+
+**From `test_attention_offload.py` profiling** (8 layers, 16K tokens, 16 chunks, 10 iterations):
+
+| Metric | PyTorch (7 kernels) | Triton (2 kernels) | Speedup |
+|--------|---------------------|---------------------|---------|
+| **GPU time (8 layers)** | 698 ms | 160.7 ms | **4.3x** |
+| **Per-layer time** | 87.3 ms | 20.1 ms | **4.3x** |
+| **Avg per merge** | 56 µs | 12.9 µs | **4.3x** |
+| **Kernel launches** | 10,920 | 3,120 | **71% reduction** |
+
+**Breakdown** (per-layer, 1,560 merges):
+- `_merge_output_kernel`: 126.9 ms / 8 = 15.9 ms/layer (avg 10.2 µs/call)
+- `_merge_lse_kernel`: 33.8 ms / 8 = 4.2 ms/layer (avg 2.7 µs/call)
+
+### Overall ChunkedPrefill Impact
+
+**GPU time distribution** (test_attention_offload.py):
+
+| Component | Time (ms) | Percentage |
+|-----------|-----------|------------|
+| FlashAttention | 603.2 | 74.8% |
+| Triton Merge | 160.7 | 19.9% |
+| Other | 42.1 | 5.3% |
+| **Total** | **806.0** | **100%** |
+
+**If using PyTorch merge** (estimated):
+- Total GPU time: ~1,343 ms
+- **Overall speedup with Triton**: 1.67x
+
+### Correctness Verification
+
+**Test**: `tests/test_chunked_attention.py`
+- 12 test cases (6 configs × 2 dtypes)
+- All tests PASS with max error < 0.01
+- float16: max_diff=0.000488, mean_diff~0.00001
+- bfloat16: max_diff=0.003906, mean_diff~0.0001
+
+### Key Files
+
+- `nanovllm/kvcache/chunked_attention.py`: Triton kernels + merge function
+- `tests/test_chunked_attention.py`: Correctness tests
+- `tests/test_attention_offload.py`: Performance profiling
+
 ## Configuration

 | Parameter | Default | Notes |
@@ -134,38 +242,57 @@ memcpy_2d_async(
 - Qwen3-0.6B/4B: 40960 tokens
 - Qwen2.5-7B-Instruct-1M: 1048576 tokens

-**Performance (Qwen3-0.6B, 40K)**:
+**Performance (Qwen3-0.6B)**:
 - GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
- CPU Offload: ~7.2k tok/s (prefill), ~3.5 tok/s (decode)
+- CPU Offload (16K): ~14k tok/s (prefill)
+- CPU Offload (32K): ~13k tok/s (prefill)

-## TODO: Alternative Optimizations
+## Performance Summary

-### 1. Pure PyTorch Layout Reorganization (Alternative to sgDMA)
+### Completed Optimizations ✓

-**Note**: sgDMA (above) already solves this. This is a pure-PyTorch alternative requiring more code changes.
+1. **sgDMA Integration** (2025-12-25)
+   - Eliminated Device→Pageable transfers
+   - Achieved 21-23 GB/s bandwidth (near PCIe limit)
+   - 15.35x speedup on memory transfers

-**Change Layout**:
-```python
-# Current (non-contiguous access)
-k_cache_cpu = torch.zeros(num_layers, num_cpu_blocks, block_size, kv_heads, head_dim,
-                          pin_memory=True)
-# Access: k_cache_cpu[:, block_id]  -> strided, slow
+2. **Triton Fused Merge Kernel** (2025-12-25)
+   - Reduced 7 PyTorch kernels → 2 Triton kernels
+   - 4.3x speedup on merge operations
+   - 1.67x overall ChunkedPrefill speedup

-# Optimized (contiguous access)
-k_cache_cpu = torch.zeros(num_cpu_blocks, num_layers, block_size, kv_heads, head_dim,
-                          pin_memory=True)
-# Access: k_cache_cpu[block_id]  -> contiguous, fast
-```
+3. **N-way Pipeline with Dedicated Streams** (2025-12-25)
+   - Per-slot transfer streams for parallel H2D across slots
+   - Dedicated compute stream (avoids CUDA default stream implicit sync)
+   - N-way pipeline using all available slots (not just 2-slot double buffering)
+   - **2.0x improvement**: 7.2k → 14.1k tok/s (16K tokens prefill)

-**Files to Modify**:
- `kvcache/offload_engine.py`: Update all indexing in `load_to_slot_layer()`, `offload_slot_to_cpu()`
- Audit all `k_cache_cpu`/`v_cache_cpu` accesses
+### Current Performance Bottlenecks

-**Trade-off**:
- **sgDMA**: Minimal code changes, requires CUDA extension, 24.95 GB/s
- **Layout Change**: Pure PyTorch, extensive refactoring, 24.91 GB/s (same performance)
+**From profiling** (`test_attention_offload.py`, 8 layers, 16K tokens):

-**Recommendation**: Use sgDMA for faster implementation with same performance.
+| Component | GPU Time | Percentage | Optimization Potential |
+|-----------|----------|------------|------------------------|
+| FlashAttention | 603 ms | 74.8% | ⚠️ Main bottleneck |
+| Triton Merge | 161 ms | 19.9% | ✓ Optimized |
+| Other | 42 ms | 5.3% | Minor |
+
+### Future Optimization Directions
+
+1. **FlashAttention Optimization** (highest priority)
+   - Current: 74.8% of GPU time
+   - Potential: Custom FlashAttention kernel for chunked case
+   - Expected: 1.5-2x additional speedup
+
+2. ~~**Pipeline Optimization**~~ ✓ COMPLETED
+   - ~~Better overlap between compute and memory transfer~~
+   - ~~Multi-stream execution~~
+   - See: N-way Pipeline with Dedicated Streams above
+
+3. **Alternative to sgDMA** (lower priority, PyTorch-only)
+   - Reorganize cache layout: `[num_cpu_blocks, num_layers, ...]` instead of `[num_layers, num_cpu_blocks, ...]`
+   - Trade-off: Extensive refactoring vs minimal sgDMA approach
+   - Same performance as sgDMA (~24 GB/s)

 ---