Files
nano-vllm/progress.md

77 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Progress Log: Multi-Model Support
## Session: 2026-01-10
### Initial Analysis Complete
**Time**: Session start
**Actions:**
1. Read `nanovllm/engine/model_runner.py` - 确认硬编码位置 (line 35)
2. Read `nanovllm/models/qwen3.py` - 理解 Qwen3 模型结构
3. Read `nanovllm/utils/loader.py` - 理解权重加载机制
4. Read `nanovllm/layers/rotary_embedding.py` - 发现 RoPE scaling 限制
5. Read `/home/zijie/models/Llama-3.1-8B-Instruct/config.json` - 理解 Llama 配置
**Key Findings:**
- 模型加载在 `model_runner.py:35` 硬编码为 Qwen3
- RoPE 目前不支持 scaling (`assert rope_scaling is None`)
- Llama 3.1 需要 "llama3" 类型的 RoPE scaling
- Llama 无 q_norm/k_norm无 attention bias
**Created:**
- `task_plan.md` - 6 阶段实施计划
- `findings.md` - 技术分析和发现
---
### Phase Status
| Phase | Status | Notes |
|-------|--------|-------|
| 1. Model Registry | **COMPLETED** | `registry.py`, `__init__.py` |
| 2. Llama3 RoPE | **COMPLETED** | `rotary_embedding.py` |
| 3. Llama Model | **COMPLETED** | `llama.py` |
| 4. ModelRunner | **COMPLETED** | Dynamic loading |
| 5. Qwen3 Register | **COMPLETED** | `@register_model` decorator |
| 6. Testing | **COMPLETED** | Both Llama & Qwen3 pass |
---
## Test Results
### Llama 3.1-8B-Instruct (32K needle, GPU 0, offload)
```
Input: 32768 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 1644 tok/s
```
### Qwen3-4B (8K needle, GPU 1, offload) - Regression Test
```
Input: 8192 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 3295 tok/s
```
---
## Files Modified This Session
| File | Action | Description |
|------|--------|-------------|
| `nanovllm/models/registry.py` | created | Model registry with `@register_model` decorator |
| `nanovllm/models/__init__.py` | created | Export registry functions, import models |
| `nanovllm/models/llama.py` | created | Llama model implementation |
| `nanovllm/models/qwen3.py` | modified | Added `@register_model` decorator |
| `nanovllm/layers/rotary_embedding.py` | modified | Added Llama3 RoPE scaling |
| `nanovllm/engine/model_runner.py` | modified | Dynamic model loading via registry |
| `.claude/rules/gpu-testing.md` | created | GPU testing rules |
| `task_plan.md` | created | Implementation plan |
| `findings.md` | created | Technical findings |
| `progress.md` | created | Progress tracking |