Files
nano-vllm/progress.md

2.3 KiB
Raw Blame History

Progress Log: Multi-Model Support

Session: 2026-01-10

Initial Analysis Complete

Time: Session start

Actions:

  1. Read nanovllm/engine/model_runner.py - 确认硬编码位置 (line 35)
  2. Read nanovllm/models/qwen3.py - 理解 Qwen3 模型结构
  3. Read nanovllm/utils/loader.py - 理解权重加载机制
  4. Read nanovllm/layers/rotary_embedding.py - 发现 RoPE scaling 限制
  5. Read /home/zijie/models/Llama-3.1-8B-Instruct/config.json - 理解 Llama 配置

Key Findings:

  • 模型加载在 model_runner.py:35 硬编码为 Qwen3
  • RoPE 目前不支持 scaling (assert rope_scaling is None)
  • Llama 3.1 需要 "llama3" 类型的 RoPE scaling
  • Llama 无 q_norm/k_norm无 attention bias

Created:

  • task_plan.md - 6 阶段实施计划
  • findings.md - 技术分析和发现

Phase Status

Phase Status Notes
1. Model Registry COMPLETED registry.py, __init__.py
2. Llama3 RoPE COMPLETED rotary_embedding.py
3. Llama Model COMPLETED llama.py
4. ModelRunner COMPLETED Dynamic loading
5. Qwen3 Register COMPLETED @register_model decorator
6. Testing COMPLETED Both Llama & Qwen3 pass

Test Results

Llama 3.1-8B-Instruct (32K needle, GPU 0, offload)

Input: 32768 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 1644 tok/s

Qwen3-4B (8K needle, GPU 1, offload) - Regression Test

Input: 8192 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 3295 tok/s

Files Modified This Session

File Action Description
nanovllm/models/registry.py created Model registry with @register_model decorator
nanovllm/models/__init__.py created Export registry functions, import models
nanovllm/models/llama.py created Llama model implementation
nanovllm/models/qwen3.py modified Added @register_model decorator
nanovllm/layers/rotary_embedding.py modified Added Llama3 RoPE scaling
nanovllm/engine/model_runner.py modified Dynamic model loading via registry
.claude/rules/gpu-testing.md created GPU testing rules
task_plan.md created Implementation plan
findings.md created Technical findings
progress.md created Progress tracking