[opt] optimize nanovllm performance compareable with vllm.

This commit is contained in:
Zijie Tian
2025-12-25 03:47:07 +08:00
parent 16fcf8350b
commit 82ed34fc2d
7 changed files with 450 additions and 208 deletions

175
CLAUDE.md
View File

@@ -37,7 +37,22 @@ Decode: slot[0] = decode, slots[1:] = load previous chunks
- `offload_slot_to_cpu(slot, cpu_block)`: Async D2H offload
- Per-slot per-layer CUDA events for fine-grained synchronization
**Pipeline**: Double buffering with `compute_done` events prevents data races. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).
**Pipeline**: N-way pipeline with dedicated streams for full compute-transfer overlap. Pipeline depth = N-1 (prefill), (N-1)/2 (decode).
### Stream Architecture
```
Transfer Streams: [slot_0_stream] [slot_1_stream] ... [slot_N_stream]
↓ ↓ ↓
GPU Slots: [slot_0] [slot_1] ... [slot_N]
↓ ↓ ↓
Compute Stream: ←←←←←←←←←←←← [dedicated compute stream] →→→→→→→→→→→→
```
**Key Design Decisions**:
- **Per-slot transfer streams**: Each GPU slot has its own CUDA stream for H2D transfers, enabling parallel loading
- **Dedicated compute stream**: Created with `torch.cuda.Stream()` (NOT `current_stream()`) to avoid implicit synchronization with default stream
- **CUDA Events**: `ring_slot_ready` (transfer complete), `ring_slot_compute_done` (safe to overwrite)
## Scatter-Gather DMA (sgDMA) - INTEGRATED ✓
@@ -112,6 +127,99 @@ memcpy_2d_async(
**Actual Impact**: 15.35x faster D2H transfers, eliminates memory transfer bottleneck. Expected 2-3x overall prefill throughput improvement.
## Online Softmax Merge - Triton Fused Kernel ✓
### Problem & Solution
**Problem**: Original PyTorch implementation of `merge_attention_outputs()` launches 7 separate kernels per merge operation:
1. `torch.maximum()` - max(lse1, lse2)
2. `torch.exp()` (2x) - exp(lse1-max), exp(lse2-max)
3. `transpose()` + `unsqueeze()` - reshape for broadcasting
4. Accumulation (6x) - weighted sum operations
5. Division - normalize output
6. `torch.log()` - merge LSE
7. `.to()` - type conversion
**Profiling revealed**: In ChunkedPrefill with 8 layers, these operations consumed **698 ms** GPU time (vs FlashAttention 603 ms), becoming a major bottleneck.
**Solution**: Implemented Triton fused kernels that combine all operations into 2 kernels. **Integration complete** as of 2025-12-25.
### Implementation
**File**: `nanovllm/kvcache/chunked_attention.py:278-408`
Two Triton kernels replace all PyTorch operations:
```python
@triton.jit
def _merge_lse_kernel(...):
"""Fused: max + exp + log"""
max_lse = tl.maximum(lse1, lse2)
exp1 = tl.exp(lse1 - max_lse)
exp2 = tl.exp(lse2 - max_lse)
lse_merged = max_lse + tl.log(exp1 + exp2)
tl.store(lse_out_ptr + offsets, lse_merged, mask=mask)
@triton.jit
def _merge_output_kernel(...):
"""Fused: broadcast + weighted sum + division"""
# Load LSE, compute scaling factors
exp1 = tl.exp(lse1 - max_lse)
exp2 = tl.exp(lse2 - max_lse)
sum_exp = exp1 + exp2
# Process headdim in chunks
for d_offset in range(0, headdim, BLOCK_SIZE):
o1_val = tl.load(o1_ptr + o_idx, mask=mask)
o2_val = tl.load(o2_ptr + o_idx, mask=mask)
o_merged = (o1_val * exp1 + o2_val * exp2) / sum_exp
tl.store(o_out_ptr + o_idx, o_merged, mask=mask)
```
### Performance Results
**From `test_attention_offload.py` profiling** (8 layers, 16K tokens, 16 chunks, 10 iterations):
| Metric | PyTorch (7 kernels) | Triton (2 kernels) | Speedup |
|--------|---------------------|---------------------|---------|
| **GPU time (8 layers)** | 698 ms | 160.7 ms | **4.3x** |
| **Per-layer time** | 87.3 ms | 20.1 ms | **4.3x** |
| **Avg per merge** | 56 µs | 12.9 µs | **4.3x** |
| **Kernel launches** | 10,920 | 3,120 | **71% reduction** |
**Breakdown** (per-layer, 1,560 merges):
- `_merge_output_kernel`: 126.9 ms / 8 = 15.9 ms/layer (avg 10.2 µs/call)
- `_merge_lse_kernel`: 33.8 ms / 8 = 4.2 ms/layer (avg 2.7 µs/call)
### Overall ChunkedPrefill Impact
**GPU time distribution** (test_attention_offload.py):
| Component | Time (ms) | Percentage |
|-----------|-----------|------------|
| FlashAttention | 603.2 | 74.8% |
| Triton Merge | 160.7 | 19.9% |
| Other | 42.1 | 5.3% |
| **Total** | **806.0** | **100%** |
**If using PyTorch merge** (estimated):
- Total GPU time: ~1,343 ms
- **Overall speedup with Triton**: 1.67x
### Correctness Verification
**Test**: `tests/test_chunked_attention.py`
- 12 test cases (6 configs × 2 dtypes)
- All tests PASS with max error < 0.01
- float16: max_diff=0.000488, mean_diff~0.00001
- bfloat16: max_diff=0.003906, mean_diff~0.0001
### Key Files
- `nanovllm/kvcache/chunked_attention.py`: Triton kernels + merge function
- `tests/test_chunked_attention.py`: Correctness tests
- `tests/test_attention_offload.py`: Performance profiling
## Configuration
| Parameter | Default | Notes |
@@ -134,38 +242,57 @@ memcpy_2d_async(
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
**Performance (Qwen3-0.6B, 40K)**:
**Performance (Qwen3-0.6B)**:
- GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
- CPU Offload: ~7.2k tok/s (prefill), ~3.5 tok/s (decode)
- CPU Offload (16K): ~14k tok/s (prefill)
- CPU Offload (32K): ~13k tok/s (prefill)
## TODO: Alternative Optimizations
## Performance Summary
### 1. Pure PyTorch Layout Reorganization (Alternative to sgDMA)
### Completed Optimizations ✓
**Note**: sgDMA (above) already solves this. This is a pure-PyTorch alternative requiring more code changes.
1. **sgDMA Integration** (2025-12-25)
- Eliminated Device→Pageable transfers
- Achieved 21-23 GB/s bandwidth (near PCIe limit)
- 15.35x speedup on memory transfers
**Change Layout**:
```python
# Current (non-contiguous access)
k_cache_cpu = torch.zeros(num_layers, num_cpu_blocks, block_size, kv_heads, head_dim,
pin_memory=True)
# Access: k_cache_cpu[:, block_id] -> strided, slow
2. **Triton Fused Merge Kernel** (2025-12-25)
- Reduced 7 PyTorch kernels → 2 Triton kernels
- 4.3x speedup on merge operations
- 1.67x overall ChunkedPrefill speedup
# Optimized (contiguous access)
k_cache_cpu = torch.zeros(num_cpu_blocks, num_layers, block_size, kv_heads, head_dim,
pin_memory=True)
# Access: k_cache_cpu[block_id] -> contiguous, fast
```
3. **N-way Pipeline with Dedicated Streams** (2025-12-25)
- Per-slot transfer streams for parallel H2D across slots
- Dedicated compute stream (avoids CUDA default stream implicit sync)
- N-way pipeline using all available slots (not just 2-slot double buffering)
- **2.0x improvement**: 7.2k → 14.1k tok/s (16K tokens prefill)
**Files to Modify**:
- `kvcache/offload_engine.py`: Update all indexing in `load_to_slot_layer()`, `offload_slot_to_cpu()`
- Audit all `k_cache_cpu`/`v_cache_cpu` accesses
### Current Performance Bottlenecks
**Trade-off**:
- **sgDMA**: Minimal code changes, requires CUDA extension, 24.95 GB/s
- **Layout Change**: Pure PyTorch, extensive refactoring, 24.91 GB/s (same performance)
**From profiling** (`test_attention_offload.py`, 8 layers, 16K tokens):
**Recommendation**: Use sgDMA for faster implementation with same performance.
| Component | GPU Time | Percentage | Optimization Potential |
|-----------|----------|------------|------------------------|
| FlashAttention | 603 ms | 74.8% | ⚠️ Main bottleneck |
| Triton Merge | 161 ms | 19.9% | ✓ Optimized |
| Other | 42 ms | 5.3% | Minor |
### Future Optimization Directions
1. **FlashAttention Optimization** (highest priority)
- Current: 74.8% of GPU time
- Potential: Custom FlashAttention kernel for chunked case
- Expected: 1.5-2x additional speedup
2. ~~**Pipeline Optimization**~~ ✓ COMPLETED
- ~~Better overlap between compute and memory transfer~~
- ~~Multi-stream execution~~
- See: N-way Pipeline with Dedicated Streams above
3. **Alternative to sgDMA** (lower priority, PyTorch-only)
- Reorganize cache layout: `[num_cpu_blocks, num_layers, ...]` instead of `[num_layers, num_cpu_blocks, ...]`
- Trade-off: Extensive refactoring vs minimal sgDMA approach
- Same performance as sgDMA (~24 GB/s)
---