✨ feat: add comprehensive RULER benchmark testing
- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)
This commit is contained in:
125
docs/architecture_guide.md
Normal file
125
docs/architecture_guide.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Architecture Guide
|
||||
|
||||
This document describes the core components and design of nano-vLLM, with detailed focus on the CPU offload system.
|
||||
|
||||
## Core Components
|
||||
|
||||
### LLMEngine (`llm_engine.py`)
|
||||
Main entry point that runs the prefill-decode loop. Manages the overall inference workflow.
|
||||
|
||||
### ModelRunner (`model_runner.py`)
|
||||
- Loads model weights
|
||||
- Allocates KV cache
|
||||
- Manages CUDA graphs for decode acceleration
|
||||
|
||||
### Scheduler (`scheduler.py`)
|
||||
Two-phase scheduling system:
|
||||
- **Prefill phase**: Processes prompt tokens
|
||||
- **Decode phase**: Generates output tokens autoregressively
|
||||
|
||||
### BlockManager (`block_manager.py`)
|
||||
- Paged attention implementation
|
||||
- Prefix caching using xxhash
|
||||
- Default block size: 4096 tokens
|
||||
|
||||
### Attention (`layers/attention.py`)
|
||||
- FlashAttention for efficient computation
|
||||
- Chunked methods for CPU offload mode
|
||||
|
||||
---
|
||||
|
||||
## CPU Offload System
|
||||
|
||||
### Ring Buffer Design
|
||||
|
||||
The CPU offload system uses a unified ring buffer to manage GPU memory slots:
|
||||
|
||||
```
|
||||
GPU Slots: [0] [1] [2] [3] ... (unified ring buffer)
|
||||
Prefill: slot = chunk_idx % N
|
||||
Decode: slot[0] = decode, slots[1:] = load previous chunks
|
||||
```
|
||||
|
||||
**Key Files**: `kvcache/offload_engine.py`, `kvcache/hybrid_manager.py`
|
||||
|
||||
### Memory Layout
|
||||
|
||||
**GPU Memory**:
|
||||
```
|
||||
[num_layers, num_gpu_blocks, block_size, kv_heads, head_dim]
|
||||
```
|
||||
|
||||
**CPU Memory** (pinned):
|
||||
```
|
||||
[num_layers, num_cpu_blocks, block_size, kv_heads, head_dim]
|
||||
```
|
||||
|
||||
### Key Methods
|
||||
|
||||
| Method | Purpose |
|
||||
|--------|---------|
|
||||
| `load_to_slot_layer(slot, layer, cpu_block)` | Async H2D load for specific layer |
|
||||
| `offload_slot_to_cpu(slot, cpu_block)` | Async D2H offload |
|
||||
| Per-slot per-layer CUDA events | Fine-grained synchronization |
|
||||
|
||||
### Pipeline Architecture
|
||||
|
||||
**N-way Pipeline** with dedicated streams for full compute-transfer overlap:
|
||||
|
||||
- **Prefill pipeline depth**: N-1
|
||||
- **Decode pipeline depth**: (N-1)/2
|
||||
|
||||
### Stream Architecture
|
||||
|
||||
```
|
||||
Transfer Streams: [slot_0_stream] [slot_1_stream] ... [slot_N_stream]
|
||||
↓ ↓ ↓
|
||||
GPU Slots: [slot_0] [slot_1] ... [slot_N]
|
||||
↓ ↓ ↓
|
||||
Compute Stream: ←←←←←←←←←←←← [dedicated compute stream] →→→→→→→→→→→→
|
||||
```
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
1. **Per-slot transfer streams**: Each GPU slot has its own CUDA stream for H2D transfers, enabling parallel loading
|
||||
|
||||
2. **Dedicated compute stream**: Created with `torch.cuda.Stream()` (NOT `current_stream()`) to avoid implicit synchronization with CUDA default stream
|
||||
|
||||
3. **CUDA Events**:
|
||||
- `ring_slot_ready`: Signals transfer complete
|
||||
- `ring_slot_compute_done`: Signals safe to overwrite slot
|
||||
|
||||
### Chunked Offload Flow
|
||||
|
||||
**Prefill Phase**:
|
||||
1. For each chunk, assign `slot = chunk_idx % N`
|
||||
2. Load required KV blocks from CPU to assigned slot
|
||||
3. Compute attention on current chunk
|
||||
4. Offload results back to CPU if needed
|
||||
|
||||
**Decode Phase**:
|
||||
1. Use `slot[0]` for active decode computation
|
||||
2. Use `slots[1:]` to prefetch upcoming chunks
|
||||
3. Rotate slots as decoding progresses
|
||||
|
||||
---
|
||||
|
||||
## Configuration Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `kvcache_block_size` | 1024 | Tokens per KV cache block |
|
||||
| `num_gpu_blocks` | 2 | Number of GPU blocks for offload |
|
||||
| `num_kv_buffers` | 4 | Ring buffer size (1-4), lower = less memory but slower decode |
|
||||
| `enable_cpu_offload` | False | Enable CPU offload mode |
|
||||
|
||||
### Trade-offs
|
||||
|
||||
- **More GPU blocks**: Higher memory usage, faster prefill (fewer transfers)
|
||||
- **Fewer GPU blocks**: Lower memory usage, more frequent transfers
|
||||
- **Larger ring buffer**: More memory, better prefetch overlap
|
||||
- **Smaller ring buffer**: Less memory, potential compute stalls
|
||||
|
||||
---
|
||||
|
||||
**Author**: Zijie Tian
|
||||
Reference in New Issue
Block a user