[docs] Update CLAUDE.md with multi-model support documentation
- Update overview to reflect Qwen3/Qwen2/Llama support - Add docs/multi_model_support.md to documentation index - Add Llama-3.1-8B-Instruct to model limits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code when working with this repository.
|
||||
|
||||
## Overview
|
||||
|
||||
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
|
||||
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
|
||||
|
||||
## GPU Mutex for Multi-Instance Debugging
|
||||
|
||||
@@ -60,6 +60,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
| Document | Purpose |
|
||||
|----------|---------|
|
||||
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
|
||||
| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
|
||||
| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
|
||||
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
|
||||
| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
|
||||
@@ -91,6 +92,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
**Model Limits**:
|
||||
- Qwen3-0.6B/4B: 40960 tokens
|
||||
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
|
||||
- Llama-3.1-8B-Instruct: 131072 tokens
|
||||
|
||||
**Performance (Qwen3-4B, CPU Offload)**:
|
||||
- Prefill: ~5700-8000 tok/s (varies by context length)
|
||||
|
||||
Reference in New Issue
Block a user