Go to file

Zijie Tian 86633004ca 📝 docs: add 64k memory analysis and test configuration updates

Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-14 07:02:09 +08:00

.claude

Merge branch 'zijie/add-llama-1': Add multi-model support

2026-01-10 21:20:53 +08:00

assets

add logo and trendshift

2025-11-04 00:45:10 +08:00

csrc

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

docs

📝 docs: add 64k memory analysis and test configuration updates

2026-01-14 07:02:09 +08:00

nanovllm

🐛 fix: remove torch.compile from add_rms_forward to avoid recompilation

2026-01-14 07:02:02 +08:00

scripts

[WIP] NEED to modify communication.

2025-12-24 21:57:51 +08:00

tests

📝 docs: add 64k memory analysis and test configuration updates

2026-01-14 07:02:09 +08:00

.gitignore

[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST

2026-01-13 02:01:07 +08:00

bench_offload.py

[refactor] Aligned the bench.

2026-01-07 04:25:06 +08:00

bench_vllm.py

[bench] Modify bench_vllm.py

2026-01-09 15:20:37 +08:00

bench.py

[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST

2026-01-08 03:36:39 +08:00

CLAUDE.md

📝 docs: add 64k memory analysis and test configuration updates

2026-01-14 07:02:09 +08:00

DEBUG_SUMMARY.md

[refactor] Refactor the kvcache offload.

2026-01-04 19:37:03 +08:00

example.py

simplify

2025-08-31 20:02:51 +08:00

findings.md

[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST

2026-01-13 02:01:07 +08:00

LICENSE

init commit

2025-06-10 00:27:01 +08:00

notes.md

[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST

2026-01-08 23:22:38 +08:00

progress.md

[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST

2026-01-13 02:01:07 +08:00

pyproject.toml

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

README.md

support qwen2

2025-11-04 01:44:42 +08:00

setup.py

[fix] Fixed compile problem.

2025-12-26 21:02:43 +08:00

task_plan.md

[claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST

2026-01-13 02:01:07 +08:00

README.md

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fast offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method:

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 4070 Laptop (8GB)
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.37	1361.84
Nano-vLLM	133,966	93.41	1434.13

README.md Unescape Escape

Nano-vLLM

Key Features

Installation

Model Download

Quick Start

Benchmark

Star History

README.md