- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
170 lines
4.7 KiB
Markdown
170 lines
4.7 KiB
Markdown
# Findings: Torch Distributed Port Conflict
|
||
|
||
## Problem Analysis
|
||
|
||
### Issue Summary
|
||
创建多个 LLM 实例时出现端口冲突 (EADDRINUSE),导致第二个实例无法启动。
|
||
|
||
### Root Cause Deep Dive
|
||
|
||
#### 1. 资源绑定位置
|
||
```python
|
||
# nanovllm/engine/model_runner.py:30-32
|
||
import os
|
||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
||
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
||
```
|
||
|
||
- 默认端口 **2333**,可通过 `NANOVLLM_DIST_PORT` 环境变量配置
|
||
- `init_process_group()` 绑定 TCP 端口用于进程间通信
|
||
- 端口绑定持续到 `destroy_process_group()` 被调用
|
||
|
||
#### 2. 清理机制缺陷
|
||
```python
|
||
# nanovllm/engine/llm_engine.py:37
|
||
atexit.register(self.exit)
|
||
|
||
# nanovllm/engine/llm_engine.py:39-43
|
||
def exit(self):
|
||
self.model_runner.call("exit")
|
||
del self.model_runner
|
||
for p in self.ps:
|
||
p.join()
|
||
|
||
# nanovllm/engine/model_runner.py:66-78
|
||
def exit(self):
|
||
# ... cleanup code ...
|
||
dist.destroy_process_group()
|
||
```
|
||
|
||
**关键问题**: `atexit` 只在 **Python 解释器退出** 时触发,而非对象被删除时!
|
||
|
||
#### 3. 问题时间线
|
||
```
|
||
1. 创建 LLM #1
|
||
├── init_process_group() 绑定端口 2333 ✓
|
||
└── atexit.register(self.exit) 注册
|
||
|
||
2. LLM #1 超出作用域或被 del
|
||
├── Python GC 回收对象内存
|
||
├── atexit handler 未触发(进程未退出)
|
||
├── Worker 进程仍在运行
|
||
└── 端口 2333 仍被占用 ❌
|
||
|
||
3. 创建 LLM #2
|
||
├── init_process_group() 尝试绑定端口 2333
|
||
└── EADDRINUSE 错误 ❌
|
||
|
||
4. 程序退出(此时 atexit 才运行)
|
||
└── 为时已晚 - 已经崩溃
|
||
```
|
||
|
||
---
|
||
|
||
## Solution Analysis
|
||
|
||
### 方案对比
|
||
|
||
| 方案 | 可靠性 | 向后兼容 | 实现复杂度 | 推荐度 |
|
||
|------|--------|----------|------------|--------|
|
||
| `close()` 方法 | 最高 | 是 | 低 | ★★★★★ |
|
||
| `__del__` 方法 | 中等 | 是 | 低 | ★★★☆☆ |
|
||
| 端口检测重试 | 中等 | 是 | 低 | ★★★☆☆ |
|
||
| Context Manager | 最高 | 需要代码修改 | 低 | ★★★★☆ |
|
||
| 动态端口 | 低 | 是 | 低 | ★★☆☆☆ |
|
||
|
||
### 为什么选择三层防护
|
||
|
||
1. **Layer 1: close()** - 用户显式控制,最可靠
|
||
2. **Layer 2: __del__** - 自动清理,覆盖大部分场景
|
||
3. **Layer 3: 端口检测** - 最后防线,提供清晰错误信息
|
||
|
||
### `__del__` 的限制
|
||
|
||
Python 的 `__del__` 不保证被调用:
|
||
- 循环引用时可能不触发
|
||
- 解释器关闭时可能无法访问依赖模块
|
||
- 不应依赖于 `__del__` 进行关键资源清理
|
||
|
||
但作为**额外防护层**是有价值的,因为:
|
||
- 大多数情况下会被调用
|
||
- 比没有好
|
||
- 不影响其他清理机制
|
||
|
||
---
|
||
|
||
## Code Structure Analysis
|
||
|
||
### LLMEngine 生命周期
|
||
```
|
||
__init__()
|
||
├── 创建 worker 进程 (self.ps)
|
||
├── 创建 ModelRunner (self.model_runner)
|
||
├── 注册 atexit handler
|
||
└── 设置 scheduler, tokenizer
|
||
|
||
close() [新增]
|
||
├── 检查 _closed 标志(幂等)
|
||
├── 注销 atexit handler
|
||
├── 调用 model_runner.exit()
|
||
├── join worker 进程
|
||
└── 设置 _closed = True
|
||
|
||
__del__() [新增]
|
||
└── 调用 close()(忽略异常)
|
||
|
||
__enter__/__exit__() [新增]
|
||
└── Context manager 支持
|
||
```
|
||
|
||
### ModelRunner 资源
|
||
```
|
||
__init__()
|
||
├── torch.distributed 初始化(绑定端口)
|
||
├── 模型加载
|
||
├── KV cache 分配
|
||
├── CUDA graph 捕获(可选)
|
||
└── SharedMemory 创建(多GPU)
|
||
|
||
exit()
|
||
├── SharedMemory 清理
|
||
├── CUDA graph 清理
|
||
└── dist.destroy_process_group()
|
||
```
|
||
|
||
---
|
||
|
||
## Risk Assessment
|
||
|
||
| 风险 | 影响 | 缓解措施 |
|
||
|------|------|----------|
|
||
| `__del__` 不被调用 | 中 - 端口泄漏 | Layer 3 端口检测提供清晰错误 |
|
||
| close() 重复调用 | 低 | `_closed` 标志保证幂等 |
|
||
| atexit 双重调用 | 低 | 注销机制防止 |
|
||
| 子进程残留 | 高 | join() 确保子进程退出 |
|
||
| CUDA 资源泄漏 | 中 | ModelRunner.exit() 清理 |
|
||
|
||
---
|
||
|
||
## Implementation Notes
|
||
|
||
### atexit.unregister 兼容性
|
||
- Python 3.7+ 支持
|
||
- 需要传入同一个函数对象
|
||
- 使用 `self._atexit_handler` 而非 `self.exit` 以便正确注销
|
||
|
||
### 端口检测方法
|
||
```python
|
||
def _check_port_available(port: int, host: str = "localhost") -> bool:
|
||
"""使用 socket connect_ex 检测端口是否被占用."""
|
||
try:
|
||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||
s.settimeout(1)
|
||
result = s.connect_ex((host, port))
|
||
return result != 0 # 0 = connected = port in use
|
||
except Exception:
|
||
return True # 假设可用
|
||
```
|
||
|
||
**注意**: 这种检测存在 TOCTOU (Time-of-check to time-of-use) 竞争条件,但对于我们的用例足够了。
|