45 lines
1.4 KiB
Markdown
45 lines
1.4 KiB
Markdown
# Nano-vLLM
|
||
|
||
A lightweight vLLM implementation built from scratch.
|
||
|
||
## Key Features
|
||
|
||
* 🚀 **Fast offline inference** - Comparable inference speeds to vLLM
|
||
* 📖 **Readable codebase** - Clean implementation in ~ 1,200 lines of Python code
|
||
* ⚡ **Optimization Suite** - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
|
||
```
|
||
|
||
## Quick Start
|
||
|
||
See `example.py` for usage. The API mirrors vLLM's interface with minor differences in the `LLM.generate` method.
|
||
```python
|
||
from nanovllm import LLM, SamplingParams
|
||
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
|
||
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
|
||
prompts = ["Hello, Nano-vLLM."]
|
||
outputs = llm.generate(prompts, sampling_params)
|
||
outputs[0]["text"]
|
||
```
|
||
|
||
## Benchmark
|
||
|
||
See `bench.py` for benchmark.
|
||
|
||
**Test Configuration:**
|
||
- Hardware: RTX 4070
|
||
- Model: Qwen3-0.6B
|
||
- Total Requests: 256 sequences
|
||
- Input Length: Randomly sampled between 100–1024 tokens
|
||
- Output Length: Randomly sampled between 100–1024 tokens
|
||
|
||
**Performance Results:**
|
||
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|
||
|----------------|-------------|----------|-----------------------|
|
||
| vLLM | 133,966 | 98.95 | 1353.86 |
|
||
| Nano-vLLM | 133,966 | 101.90 | 1314.65 |
|