2.3 KiB
2.3 KiB
Progress Log: Multi-Model Support
Session: 2026-01-10
Initial Analysis Complete
Time: Session start
Actions:
- Read
nanovllm/engine/model_runner.py- 确认硬编码位置 (line 35) - Read
nanovllm/models/qwen3.py- 理解 Qwen3 模型结构 - Read
nanovllm/utils/loader.py- 理解权重加载机制 - Read
nanovllm/layers/rotary_embedding.py- 发现 RoPE scaling 限制 - Read
/home/zijie/models/Llama-3.1-8B-Instruct/config.json- 理解 Llama 配置
Key Findings:
- 模型加载在
model_runner.py:35硬编码为 Qwen3 - RoPE 目前不支持 scaling (
assert rope_scaling is None) - Llama 3.1 需要 "llama3" 类型的 RoPE scaling
- Llama 无 q_norm/k_norm,无 attention bias
Created:
task_plan.md- 6 阶段实施计划findings.md- 技术分析和发现
Phase Status
| Phase | Status | Notes |
|---|---|---|
| 1. Model Registry | COMPLETED | registry.py, __init__.py |
| 2. Llama3 RoPE | COMPLETED | rotary_embedding.py |
| 3. Llama Model | COMPLETED | llama.py |
| 4. ModelRunner | COMPLETED | Dynamic loading |
| 5. Qwen3 Register | COMPLETED | @register_model decorator |
| 6. Testing | COMPLETED | Both Llama & Qwen3 pass |
Test Results
Llama 3.1-8B-Instruct (32K needle, GPU 0, offload)
Input: 32768 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 1644 tok/s
Qwen3-4B (8K needle, GPU 1, offload) - Regression Test
Input: 8192 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 3295 tok/s
Files Modified This Session
| File | Action | Description |
|---|---|---|
nanovllm/models/registry.py |
created | Model registry with @register_model decorator |
nanovllm/models/__init__.py |
created | Export registry functions, import models |
nanovllm/models/llama.py |
created | Llama model implementation |
nanovllm/models/qwen3.py |
modified | Added @register_model decorator |
nanovllm/layers/rotary_embedding.py |
modified | Added Llama3 RoPE scaling |
nanovllm/engine/model_runner.py |
modified | Dynamic model loading via registry |
.claude/rules/gpu-testing.md |
created | GPU testing rules |
task_plan.md |
created | Implementation plan |
findings.md |
created | Technical findings |
progress.md |
created | Progress tracking |