[docs] Update CLAUDE.md with multi-model support documentation

- Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:29:39 +08:00
parent e23be2e844
commit de6eae472d
1 changed files with 3 additions and 1 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code when working with this repository.

 ## Overview

-Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
+Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.

 ## GPU Mutex for Multi-Instance Debugging

@@ -60,6 +60,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 | Document | Purpose |
 |----------|---------|
 | [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
+| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
 | [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
 | [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
 | [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
@@ -91,6 +92,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 **Model Limits**:
 - Qwen3-0.6B/4B: 40960 tokens
 - Qwen2.5-7B-Instruct-1M: 1048576 tokens
+- Llama-3.1-8B-Instruct: 131072 tokens

 **Performance (Qwen3-4B, CPU Offload)**:
 - Prefill: ~5700-8000 tok/s (varies by context length)