Files
nano-vllm/task_plan.md

4.4 KiB
Raw Blame History

Task Plan: Multi-Model Support for nanovllm

Goal

扩展 nanovllm 框架以支持多种模型(当前只支持 Qwen3特别是添加 Llama-3.1-8B-Instruct 支持,并建立可扩展的模型添加范式。

Current State Analysis

硬编码问题位置

  • nanovllm/engine/model_runner.py:35: 直接实例化 Qwen3ForCausalLM(hf_config)
  • nanovllm/engine/model_runner.py:9: 硬编码导入 from nanovllm.models.qwen3 import Qwen3ForCausalLM

Qwen3 vs Llama 3.1 架构差异

Feature Qwen3 Llama 3.1
Config Class Qwen3Config LlamaConfig
attention_bias True (可配置) False
q_norm/k_norm 有 (when bias=False)
mlp_bias N/A False
RoPE Scaling None (目前) llama3 类型
RoPE theta 1000000 500000
hidden_act silu silu
tie_word_embeddings True False

关键限制

  • rotary_embedding.py:59: assert rope_scaling is None - 不支持 RoPE scaling

Phases

Phase 1: Create Model Registry Pattern [pending]

Files to modify:

  • nanovllm/models/__init__.py (new)
  • nanovllm/models/registry.py (new)

Tasks:

  1. 创建模型注册表机制
  2. 定义模型注册装饰器 @register_model
  3. 实现 get_model_class(hf_config) 函数,根据 architectures 字段自动选择模型

Design:

MODEL_REGISTRY: dict[str, type] = {}

def register_model(*architectures):
    """Decorator to register a model class for given architecture names."""
    def decorator(cls):
        for arch in architectures:
            MODEL_REGISTRY[arch] = cls
        return cls
    return decorator

def get_model_class(hf_config) -> type:
    """Get model class based on HF config architectures."""
    for arch in hf_config.architectures:
        if arch in MODEL_REGISTRY:
            return MODEL_REGISTRY[arch]
    raise ValueError(f"Unsupported architecture: {hf_config.architectures}")

Phase 2: Add Llama3 RoPE Scaling Support [pending]

Files to modify:

  • nanovllm/layers/rotary_embedding.py

Tasks:

  1. 实现 Llama3RotaryEmbedding 类,支持 llama3 rope_type
  2. 修改 get_rope() 函数,根据 rope_scaling 类型选择实现
  3. 保持向后兼容rope_scaling=None 使用原实现)

Llama3 RoPE Scaling Formula:

# From transformers:
# low_freq_factor, high_freq_factor, original_max_position_embeddings
# Adjust frequencies based on wavelength thresholds

Phase 3: Implement Llama Model [pending]

Files to create:

  • nanovllm/models/llama.py

Tasks:

  1. 创建 LlamaAttention 类(无 q_norm/k_norm无 QKV bias
  2. 创建 LlamaMLP 类(与 Qwen3MLP 类似,无 bias
  3. 创建 LlamaDecoderLayer
  4. 创建 LlamaModelLlamaForCausalLM
  5. 添加 packed_modules_mapping 以支持权重加载
  6. 使用 @register_model("LlamaForCausalLM") 注册

Phase 4: Modify ModelRunner for Dynamic Loading [pending]

Files to modify:

  • nanovllm/engine/model_runner.py

Tasks:

  1. 移除硬编码 from nanovllm.models.qwen3 import Qwen3ForCausalLM
  2. 导入 from nanovllm.models import get_model_class
  3. 替换 self.model = Qwen3ForCausalLM(hf_config) 为:
    model_class = get_model_class(hf_config)
    self.model = model_class(hf_config)
    

Phase 5: Register Qwen3 Model [pending]

Files to modify:

  • nanovllm/models/qwen3.py

Tasks:

  1. 导入 from nanovllm.models.registry import register_model
  2. 添加 @register_model("Qwen3ForCausalLM", "Qwen2ForCausalLM") 装饰器

Phase 6: Test with Llama-3.1-8B-Instruct [pending]

Files:

  • tests/test_needle.py (existing, use for validation)

Tasks:

  1. 运行 needle 测试: python tests/test_needle.py --model ~/models/Llama-3.1-8B-Instruct
  2. 验证模型加载正确
  3. 验证推理输出正确

Errors Encountered

Error Attempt Resolution
(none yet)

Success Criteria

  • 分析完成:理解当前架构和需要的改动
  • Phase 1: 模型注册表实现
  • Phase 2: Llama3 RoPE scaling 支持
  • Phase 3: Llama 模型实现
  • Phase 4: ModelRunner 动态加载
  • Phase 5: Qwen3 模型注册
  • Phase 6: Llama needle 测试通过

Notes

  • 保持现有 Qwen3 功能不变
  • 遵循现有代码风格
  • 复用现有 layers 组件Linear, RMSNorm, Embedding 等)
  • 只添加必要的代码,不过度工程化