# Nano-vLLM A lightweight vLLM implementation built from scratch. ## Key Features * 🚀 **Fast offline inference** - Comparable inference speeds to vLLM * 📖 **Readable codebase** - Clean implementation in ~ 1,200 lines of Python code * ⚡ **Optimization Suite** - Prefix caching, Torch compilation, CUDA graph, etc. ## Installation ```bash pip install git+https://github.com/GeeeekExplorer/nano-vllm.git ``` ## Quick Start See `example.py` for usage. The API mirrors vLLM's interface with minor differences in the `LLM.generate` method. ## Benchmark See `bench.py` for benchmark. **Test Configuration:** - Hardware: RTX 4070 - Model: Qwen3-0.6B - Total Requests: 256 sequences - Input Length: Randomly sampled between 100–1024 tokens - Output Length: Randomly sampled between 100–1024 tokens **Performance Results:** | Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) | |----------------|-------------|----------|-----------------------| | vLLM | 133,966 | 98.95 | 1353.86 | | Nano-vLLM | 133,966 | 101.90 | 1314.65 |