Files
nano-vllm/findings.md

4.8 KiB
Raw Blame History

Findings: Multi-Model Support Analysis

Current Architecture Analysis

Model Loading Flow

LLM(model_path)
  → LLMEngine.__init__()
    → Config.__post_init__()
      → hf_config = AutoConfig.from_pretrained(model)
    → ModelRunner.__init__()
      → model = Qwen3ForCausalLM(hf_config)  ← HARDCODED
      → load_model(model, config.model)

Key Files

File Purpose
nanovllm/engine/model_runner.py 模型加载和运行
nanovllm/models/qwen3.py Qwen3 模型定义
nanovllm/utils/loader.py safetensors 权重加载
nanovllm/layers/rotary_embedding.py RoPE 实现

Llama 3.1 Config Analysis

{
  "architectures": ["LlamaForCausalLM"],
  "model_type": "llama",
  "attention_bias": false,
  "mlp_bias": false,
  "head_dim": 128,
  "hidden_size": 4096,
  "intermediate_size": 14336,
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "hidden_act": "silu",
  "rms_norm_eps": 1e-05,
  "rope_theta": 500000.0,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "max_position_embeddings": 131072,
  "tie_word_embeddings": false,
  "vocab_size": 128256
}

Llama 3 RoPE Scaling

Llama 3 使用特殊的 RoPE scaling 策略 (rope_type: "llama3")

  • 低频分量保持不变(对应短距离依赖)
  • 高频分量线性插值(对应长距离依赖)
  • 参数: factor, low_freq_factor, high_freq_factor, original_max_position_embeddings

参考实现 (transformers):

def _compute_llama3_parameters(config, device, inv_freq):
    factor = config.factor
    low_freq_factor = config.low_freq_factor
    high_freq_factor = config.high_freq_factor
    old_context_len = config.original_max_position_embeddings

    low_freq_wavelen = old_context_len / low_freq_factor
    high_freq_wavelen = old_context_len / high_freq_factor

    wavelen = 2 * math.pi / inv_freq
    inv_freq_llama = torch.where(
        wavelen > low_freq_wavelen,
        inv_freq / factor,
        inv_freq
    )
    smooth_factor = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
    smoothed_inv_freq = (1 - smooth_factor) * inv_freq_llama + smooth_factor * inv_freq
    is_medium_freq = (wavelen >= high_freq_wavelen) & (wavelen <= low_freq_wavelen)
    inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
    return inv_freq_llama

Weight Mapping Analysis

Qwen3 packed_modules_mapping

packed_modules_mapping = {
    "q_proj": ("qkv_proj", "q"),
    "k_proj": ("qkv_proj", "k"),
    "v_proj": ("qkv_proj", "v"),
    "gate_proj": ("gate_up_proj", 0),
    "up_proj": ("gate_up_proj", 1),
}

Llama Weight Names (from safetensors)

预期 Llama 权重命名与 Qwen3 类似:

  • model.layers.{i}.self_attn.q_proj.weight
  • model.layers.{i}.self_attn.k_proj.weight
  • model.layers.{i}.self_attn.v_proj.weight
  • model.layers.{i}.self_attn.o_proj.weight
  • model.layers.{i}.mlp.gate_proj.weight
  • model.layers.{i}.mlp.up_proj.weight
  • model.layers.{i}.mlp.down_proj.weight
  • model.layers.{i}.input_layernorm.weight
  • model.layers.{i}.post_attention_layernorm.weight

结论: Llama 的 packed_modules_mapping 与 Qwen3 相同,可以复用。


Shared Components (Can Reuse)

Component File Notes
RMSNorm layers/layernorm.py 通用
SiluAndMul layers/activation.py 通用
Attention layers/attention.py FlashAttention wrapper
QKVParallelLinear layers/linear.py 支持 bias=False
RowParallelLinear layers/linear.py 通用
MergedColumnParallelLinear layers/linear.py 通用
VocabParallelEmbedding layers/embed_head.py 通用
ParallelLMHead layers/embed_head.py 通用
load_model utils/loader.py 通用

Llama vs Qwen3 Implementation Diff

Attention

Feature Qwen3Attention LlamaAttention
QKV bias 可配置 (attention_bias) 始终 False
q_norm 有 (when bias=False)
k_norm 有 (when bias=False)
RoPE Standard Llama3 scaled

MLP

Feature Qwen3MLP LlamaMLP
gate/up bias False False
down bias False False
hidden_act silu silu

结论: Llama MLP 与 Qwen3 MLP 几乎相同,可以直接复用或简化。


Risk Assessment

Risk Impact Mitigation
RoPE 实现错误 高 - 导致错误输出 参考 transformers 实现,单元测试
权重映射错误 高 - 模型无法加载 检查 safetensors 键名
注册表循环导入 中 - 启动失败 延迟导入