11a867f6fba4b97dd42885a35e3e23e0a31c89f8
In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not needed since compute_chunked_prefill() handles GQA inline. Previously, these buffers were always allocated based on max_model_len, causing OOM on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer). Changes: - Add enable_cpu_offload parameter to alloc_policy_metadata() in base class - Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy - Pass enable_cpu_offload from model_runner to policy Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Nano-vLLM
A lightweight vLLM implementation built from scratch.
Key Features
- 🚀 Fast offline inference - Comparable inference speeds to vLLM
- 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
- ⚡ Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
Installation
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
Model Download
To download the model weights manually, use the following command:
huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
--local-dir ~/huggingface/Qwen3-0.6B/ \
--local-dir-use-symlinks False
Quick Start
See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:
from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]
Benchmark
See bench.py for benchmark.
Test Configuration:
- Hardware: RTX 4070 Laptop (8GB)
- Model: Qwen3-0.6B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100–1024 tokens
- Output Length: Randomly sampled between 100–1024 tokens
Performance Results:
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM | 133,966 | 98.37 | 1361.84 |
| Nano-vLLM | 133,966 | 93.41 | 1434.13 |
Star History
Languages
Python
96.2%
Shell
2.4%
C++
1.1%
Cuda
0.3%
