4.4 KiB
4.4 KiB
Task Plan: Multi-Model Support for nanovllm
Goal
扩展 nanovllm 框架以支持多种模型(当前只支持 Qwen3),特别是添加 Llama-3.1-8B-Instruct 支持,并建立可扩展的模型添加范式。
Current State Analysis
硬编码问题位置
nanovllm/engine/model_runner.py:35: 直接实例化Qwen3ForCausalLM(hf_config)nanovllm/engine/model_runner.py:9: 硬编码导入from nanovllm.models.qwen3 import Qwen3ForCausalLM
Qwen3 vs Llama 3.1 架构差异
| Feature | Qwen3 | Llama 3.1 |
|---|---|---|
| Config Class | Qwen3Config | LlamaConfig |
| attention_bias | True (可配置) | False |
| q_norm/k_norm | 有 (when bias=False) | 无 |
| mlp_bias | N/A | False |
| RoPE Scaling | None (目前) | llama3 类型 |
| RoPE theta | 1000000 | 500000 |
| hidden_act | silu | silu |
| tie_word_embeddings | True | False |
关键限制
rotary_embedding.py:59:assert rope_scaling is None- 不支持 RoPE scaling
Phases
Phase 1: Create Model Registry Pattern [pending]
Files to modify:
nanovllm/models/__init__.py(new)nanovllm/models/registry.py(new)
Tasks:
- 创建模型注册表机制
- 定义模型注册装饰器
@register_model - 实现
get_model_class(hf_config)函数,根据architectures字段自动选择模型
Design:
MODEL_REGISTRY: dict[str, type] = {}
def register_model(*architectures):
"""Decorator to register a model class for given architecture names."""
def decorator(cls):
for arch in architectures:
MODEL_REGISTRY[arch] = cls
return cls
return decorator
def get_model_class(hf_config) -> type:
"""Get model class based on HF config architectures."""
for arch in hf_config.architectures:
if arch in MODEL_REGISTRY:
return MODEL_REGISTRY[arch]
raise ValueError(f"Unsupported architecture: {hf_config.architectures}")
Phase 2: Add Llama3 RoPE Scaling Support [pending]
Files to modify:
nanovllm/layers/rotary_embedding.py
Tasks:
- 实现
Llama3RotaryEmbedding类,支持 llama3 rope_type - 修改
get_rope()函数,根据 rope_scaling 类型选择实现 - 保持向后兼容(rope_scaling=None 使用原实现)
Llama3 RoPE Scaling Formula:
# From transformers:
# low_freq_factor, high_freq_factor, original_max_position_embeddings
# Adjust frequencies based on wavelength thresholds
Phase 3: Implement Llama Model [pending]
Files to create:
nanovllm/models/llama.py
Tasks:
- 创建
LlamaAttention类(无 q_norm/k_norm,无 QKV bias) - 创建
LlamaMLP类(与 Qwen3MLP 类似,无 bias) - 创建
LlamaDecoderLayer类 - 创建
LlamaModel和LlamaForCausalLM类 - 添加
packed_modules_mapping以支持权重加载 - 使用
@register_model("LlamaForCausalLM")注册
Phase 4: Modify ModelRunner for Dynamic Loading [pending]
Files to modify:
nanovllm/engine/model_runner.py
Tasks:
- 移除硬编码
from nanovllm.models.qwen3 import Qwen3ForCausalLM - 导入
from nanovllm.models import get_model_class - 替换
self.model = Qwen3ForCausalLM(hf_config)为:model_class = get_model_class(hf_config) self.model = model_class(hf_config)
Phase 5: Register Qwen3 Model [pending]
Files to modify:
nanovllm/models/qwen3.py
Tasks:
- 导入
from nanovllm.models.registry import register_model - 添加
@register_model("Qwen3ForCausalLM", "Qwen2ForCausalLM")装饰器
Phase 6: Test with Llama-3.1-8B-Instruct [pending]
Files:
tests/test_needle.py(existing, use for validation)
Tasks:
- 运行 needle 测试:
python tests/test_needle.py --model ~/models/Llama-3.1-8B-Instruct - 验证模型加载正确
- 验证推理输出正确
Errors Encountered
| Error | Attempt | Resolution |
|---|---|---|
| (none yet) |
Success Criteria
- 分析完成:理解当前架构和需要的改动
- Phase 1: 模型注册表实现
- Phase 2: Llama3 RoPE scaling 支持
- Phase 3: Llama 模型实现
- Phase 4: ModelRunner 动态加载
- Phase 5: Qwen3 模型注册
- Phase 6: Llama needle 测试通过
Notes
- 保持现有 Qwen3 功能不变
- 遵循现有代码风格
- 复用现有 layers 组件(Linear, RMSNorm, Embedding 等)
- 只添加必要的代码,不过度工程化