- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
90 lines
2.8 KiB
Markdown
90 lines
2.8 KiB
Markdown
# Progress Log: Fix Torch Distributed Port Conflict
|
||
|
||
## Status: COMPLETED & CLEANED UP
|
||
|
||
## Session: 2026-01-12
|
||
|
||
### Task Overview
|
||
修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题,以及支持多卡环境下同时启动多个独立进程。
|
||
|
||
---
|
||
|
||
### Phase Status
|
||
|
||
| Phase | Description | Status |
|
||
|-------|-------------|--------|
|
||
| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
|
||
| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
|
||
| Phase 3 | 测试验证(GPU 4,5) | COMPLETED |
|
||
| Phase 4 | 更新文档 | COMPLETED |
|
||
|
||
---
|
||
|
||
### Implementation Summary
|
||
|
||
#### Phase 1: Dynamic Port Allocation
|
||
**File**: `nanovllm/engine/model_runner.py`
|
||
- Added `_find_free_port()` function using socket binding
|
||
- Modified port selection logic: use env var if set, otherwise auto-assign
|
||
- Added logging for auto-assigned ports
|
||
|
||
#### Phase 2: Resource Cleanup Enhancement
|
||
**File**: `nanovllm/engine/llm_engine.py`
|
||
- Added `_closed` flag for idempotent cleanup
|
||
- Added `close()` method for explicit resource release
|
||
- Added `__del__()` for GC fallback
|
||
- Added `__enter__()` and `__exit__()` for context manager support
|
||
- Modified atexit registration to use `_atexit_handler`
|
||
|
||
#### Phase 3: Testing (GPU 4,5)
|
||
**File**: `tests/test_port_conflict.py`
|
||
- Created comprehensive test script
|
||
|
||
**Test Results**:
|
||
| Test | Status | Notes |
|
||
|------|--------|-------|
|
||
| Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
|
||
| Context manager | PASSED | Auto-cleanup works |
|
||
| Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |
|
||
|
||
#### Phase 4: Documentation
|
||
**File**: `docs/torch_distributed_port_issue.md`
|
||
- Updated status to RESOLVED
|
||
- Documented solution details
|
||
- Added usage examples
|
||
|
||
---
|
||
|
||
### Files Modified
|
||
|
||
| File | Action | Description |
|
||
|------|--------|-------------|
|
||
| `nanovllm/engine/model_runner.py` | Modified | Added `_find_free_port()`, dynamic port logic |
|
||
| `nanovllm/engine/llm_engine.py` | Modified | Added `close()`, `__del__`, context manager |
|
||
| `tests/test_port_conflict.py` | Created | Test script for port conflict fix |
|
||
| `docs/torch_distributed_port_issue.md` | Deleted | Issue resolved, doc removed |
|
||
| `CLAUDE.md` | Modified | Removed port conflict warnings, updated doc index |
|
||
|
||
---
|
||
|
||
### Key Features After Fix
|
||
|
||
1. **Multi-GPU Parallel Testing**
|
||
```bash
|
||
CUDA_VISIBLE_DEVICES=0 python test1.py &
|
||
CUDA_VISIBLE_DEVICES=1 python test2.py &
|
||
# Both run with different auto-assigned ports
|
||
```
|
||
|
||
2. **Sequential LLM Creation**
|
||
```python
|
||
for i in range(3):
|
||
with LLM(model_path) as llm:
|
||
outputs = llm.generate(prompts, params)
|
||
# Automatically cleaned up
|
||
```
|
||
|
||
3. **Backward Compatible**
|
||
- `NANOVLLM_DIST_PORT` env var still works
|
||
- `llm.exit()` still works (alias for `close()`)
|