better
This commit is contained in:
10
README.md
10
README.md
@@ -6,7 +6,7 @@ A lightweight vLLM implementation built from scratch.
|
||||
|
||||
* 🚀 **Fast offline inference** - Comparable inference speeds to vLLM
|
||||
* 📖 **Readable codebase** - Clean implementation in ~ 1,200 lines of Python code
|
||||
* ⚡ **Optimization Suite** - Prefix caching, Torch compilation, CUDA graph, etc.
|
||||
* ⚡ **Optimization Suite** - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
|
||||
|
||||
## Installation
|
||||
|
||||
@@ -17,6 +17,14 @@ pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
|
||||
## Quick Start
|
||||
|
||||
See `example.py` for usage. The API mirrors vLLM's interface with minor differences in the `LLM.generate` method.
|
||||
```python
|
||||
from nanovllm import LLM, SamplingParams
|
||||
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
|
||||
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
|
||||
prompts = ["Hello, Nano-vLLM."]
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
outputs[0]["text"]
|
||||
```
|
||||
|
||||
## Benchmark
|
||||
|
||||
|
||||
Reference in New Issue
Block a user