📚 docs: add new model integration guide

Summarizes lessons learned from GLM-4 integration:
- Config field mapping (multi_query_group_num, kv_channels, etc.)
- RoPE variants (interleaved vs half, partial vs full rotation)
- EOS token handling for multi-EOS models
- Weight name conversion patterns
- Verification checklist

Also updates CLAUDE.md to reflect GLM-4 support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Zijie Tian
2026-01-28 13:36:24 +08:00
parent 29e102720b
commit a239bfb40d
2 changed files with 325 additions and 1 deletions

View File

@@ -4,7 +4,7 @@ This file provides guidance to Claude Code when working with this repository.
## Overview
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3, Llama-3, and GLM-4 models with CPU offload for long-context inference.
## Documentation Index
@@ -35,6 +35,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`docs/memory_communication_benchmark.md`](docs/memory_communication_benchmark.md) | 📊 通信量测试: Full vs XAttention 通信量对比 (32K/64K)、阶段分离统计 |
| [`docs/estimate_block_size_performance.md`](docs/estimate_block_size_performance.md) | 🔥 PERF: estimate 阶段 block_size 性能分析softmax_fuse_block_sum 最优点 (512-1024),当前 4096 慢 15x |
| [`docs/long_context_models_1m.md`](docs/long_context_models_1m.md) | 📚 REF: 1M+ 上下文长度模型列表 (Qwen/GLM/InternLM/Llama/VL)≤10B 推荐模型 |
| [`docs/new_model_integration_guide.md`](docs/new_model_integration_guide.md) | 🔧 GUIDE: 新模型整合指南 - 配置映射、RoPE变体、EOS处理、权重转换、验证清单 |
## Rules Index