zijie-tian/nano-vllm

Go to file

GeeeekExplorer b6136383c9 support fast pickle

2025-06-14 13:36:57 +08:00

support fast pickle

2025-06-14 13:36:57 +08:00

.gitignore

init commit

2025-06-10 00:27:01 +08:00

bench.py

better

2025-06-13 13:07:33 +08:00

example.py

release

2025-06-13 09:01:08 +08:00

LICENSE

init commit

2025-06-10 00:27:01 +08:00

pyproject.toml

fix

2025-06-14 00:56:07 +08:00

README.md

fix

2025-06-14 00:56:07 +08:00

README.md

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fast offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 4070
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.95	1353.86
Nano-vLLM	133,966	101.90	1314.65