zijie-tian/nano-vllm

Go to file

GeeeekExplorer 135d1b38a2 release

2025-06-13 09:01:08 +08:00

support CUDA_VISIBLE_DEVICES

2025-06-12 23:14:01 +08:00

.gitignore

init commit

2025-06-10 00:27:01 +08:00

bench.py

release

2025-06-13 09:01:08 +08:00

example.py

release

2025-06-13 09:01:08 +08:00

LICENSE

init commit

2025-06-10 00:27:01 +08:00

pyproject.toml

release

2025-06-13 09:01:08 +08:00

README.md

release

2025-06-13 09:01:08 +08:00

README.md

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fase offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation under 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 4070
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.95	1353.86
Nano-vLLM	133,966	101.90	1314.65