From de6eae472debc62dea7160fcf880316caefd37f2 Mon Sep 17 00:00:00 2001
From: Zijie Tian <zijietian@mail.xmu.edu.cn>
Date: Sat, 10 Jan 2026 21:29:39 +0800
Subject: [PATCH] [docs] Update CLAUDE.md with multi-model support
 documentation

- Update overview to reflect Qwen3/Qwen2/Llama support
- Add docs/multi_model_support.md to documentation index
- Add Llama-3.1-8B-Instruct to model limits

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 CLAUDE.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index e181e67..c65436a 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code when working with this repository.
 
 ## Overview
 
-Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
+Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
 
 ## GPU Mutex for Multi-Instance Debugging
 
@@ -60,6 +60,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 | Document | Purpose |
 |----------|---------|
 | [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
+| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
 | [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
 | [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
 | [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
@@ -91,6 +92,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 **Model Limits**:
 - Qwen3-0.6B/4B: 40960 tokens
 - Qwen2.5-7B-Instruct-1M: 1048576 tokens
+- Llama-3.1-8B-Instruct: 131072 tokens
 
 **Performance (Qwen3-4B, CPU Offload)**:
 - Prefill: ~5700-8000 tok/s (varies by context length)