GeeeekExplorer 801365a611 update bench
2025-06-19 23:28:11 +08:00
2025-06-10 00:27:01 +08:00
2025-06-19 23:28:11 +08:00
2025-06-17 23:49:15 +08:00
2025-06-10 00:27:01 +08:00
2025-06-15 10:36:45 +08:00
2025-06-19 23:28:11 +08:00

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

  • 🚀 Fast offline inference - Comparable inference speeds to vLLM
  • 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
  • Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration:

  • Hardware: RTX 4070 Laptop (8GB)
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 1001024 tokens
  • Output Length: Randomly sampled between 1001024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM 133,966 98.37 1361.84
Nano-vLLM 133,966 93.41 1434.13
Description
No description provided
Readme MIT 770 KiB
Languages
Python 93.8%
C++ 4.6%
Shell 1.1%
Cuda 0.4%
CMake 0.1%