Go to file

Zijie Tian 726e4b58cf ✨ feat: add GLM-4-9B-Chat-1M model support

Add support for GLM-4 model architecture with the following changes:

- Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP
- Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2)
- Add apply_rotary_emb_interleaved function for GLM-4 style RoPE
- Add GLM-4 weight name conversion and loading in loader.py
- Add GLM-4 chat template conversion in test_ruler.py
- Add trust_remote_code=True for GLM-4 config loading

Key GLM-4 specific adaptations:
- QKV bias enabled (add_qkv_bias: true)
- RoPE with rope_ratio scaling (base = 10000 * rope_ratio)
- Interleaved RoPE (pairs adjacent elements, not first/second half)
- Partial rotation (only half of head_dim is rotated)
- Uses multi_query_group_num instead of num_key_value_heads
- Uses kv_channels instead of head_dim
- Uses ffn_hidden_size instead of intermediate_size

Tested with RULER niah_single_1 (5 samples): 100% accuracy
Both GPU-only and CPU offload modes verified

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-28 13:15:57 +08:00

.claude

✨ feat: add nsys-profiler agent for kernel performance analysis

2026-01-28 06:24:09 +08:00

3rdparty

chore: add Block-SparseAttention submodule from tzj/vs_offload

2026-01-18 19:22:40 +08:00

assets

add logo and trendshift

2025-11-04 00:45:10 +08:00

csrc

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

docs

⚡️ perf: replace Triton merge with FlashInfer merge_state

2026-01-28 10:04:38 +08:00

nanovllm

✨ feat: add GLM-4-9B-Chat-1M model support

2026-01-28 13:15:57 +08:00

scripts

✨ feat: add context length and error handling to profile_offload.sh

2026-01-28 00:28:37 +08:00

tests

✨ feat: add GLM-4-9B-Chat-1M model support

2026-01-28 13:15:57 +08:00

.gitignore

🔧 chore: add nsys profiling rule and update gitignore

2026-01-27 03:42:17 +08:00

.gitmodules

chore: add Block-SparseAttention submodule from tzj/vs_offload

2026-01-18 19:22:40 +08:00

bench_offload.py

📈 feat: add MemoryObserver for GPU-CPU communication tracking

2026-01-28 04:06:45 +08:00

bench_vllm.py

🔧 chore: add --use-v1 flag to bench_vllm.py

2026-01-27 09:14:55 +08:00

bench.py

♻️ refactor: restructure Observer as base class with InferenceObserver

2026-01-28 03:15:33 +08:00

CLAUDE.md

📚 docs: add 1M+ context length models reference list

2026-01-28 09:04:55 +08:00

example.py

simplify

2025-08-31 20:02:51 +08:00

LICENSE

init commit

2025-06-10 00:27:01 +08:00

pyproject.toml

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

README.md

support qwen2

2025-11-04 01:44:42 +08:00

setup.py

[fix] Fixed compile problem.

2025-12-26 21:02:43 +08:00

README.md

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fast offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 4070 Laptop (8GB)
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.37	1361.84
Nano-vLLM	133,966	93.41	1434.13

README.md Unescape Escape

Nano-vLLM

Key Features

Installation

Model Download

Quick Start

Benchmark

Star History

README.md