2025-06-14 13:36:57 +08:00
2025-06-14 13:36:57 +08:00
2025-06-10 00:27:01 +08:00
2025-06-13 13:07:33 +08:00
2025-06-13 09:01:08 +08:00
2025-06-10 00:27:01 +08:00
fix
2025-06-14 00:56:07 +08:00
fix
2025-06-14 00:56:07 +08:00

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

  • 🚀 Fast offline inference - Comparable inference speeds to vLLM
  • 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
  • Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

  • Hardware: RTX 4070
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 1001024 tokens
  • Output Length: Randomly sampled between 1001024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM 133,966 98.95 1353.86
Nano-vLLM 133,966 101.90 1314.65
Description
No description provided
Readme MIT 763 KiB
Languages
Python 93.8%
C++ 4.6%
Shell 1.1%
Cuda 0.4%
CMake 0.1%