From de6eae472debc62dea7160fcf880316caefd37f2 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Sat, 10 Jan 2026 21:29:39 +0800 Subject: [PATCH] [docs] Update CLAUDE.md with multi-model support documentation - Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/CLAUDE.md b/CLAUDE.md index e181e67..c65436a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ This file provides guidance to Claude Code when working with this repository. ## Overview -Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference. +Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference. ## GPU Mutex for Multi-Instance Debugging @@ -60,6 +60,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py | Document | Purpose | |----------|---------| | [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details | +| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling | | [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup | | [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow | | [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design | @@ -91,6 +92,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py **Model Limits**: - Qwen3-0.6B/4B: 40960 tokens - Qwen2.5-7B-Instruct-1M: 1048576 tokens +- Llama-3.1-8B-Instruct: 131072 tokens **Performance (Qwen3-4B, CPU Offload)**: - Prefill: ~5700-8000 tok/s (varies by context length)