- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.8 KiB
2.8 KiB
Progress Log: Fix Torch Distributed Port Conflict
Status: COMPLETED & CLEANED UP
Session: 2026-01-12
Task Overview
修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题,以及支持多卡环境下同时启动多个独立进程。
Phase Status
| Phase | Description | Status |
|---|---|---|
| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
| Phase 3 | 测试验证(GPU 4,5) | COMPLETED |
| Phase 4 | 更新文档 | COMPLETED |
Implementation Summary
Phase 1: Dynamic Port Allocation
File: nanovllm/engine/model_runner.py
- Added
_find_free_port()function using socket binding - Modified port selection logic: use env var if set, otherwise auto-assign
- Added logging for auto-assigned ports
Phase 2: Resource Cleanup Enhancement
File: nanovllm/engine/llm_engine.py
- Added
_closedflag for idempotent cleanup - Added
close()method for explicit resource release - Added
__del__()for GC fallback - Added
__enter__()and__exit__()for context manager support - Modified atexit registration to use
_atexit_handler
Phase 3: Testing (GPU 4,5)
File: tests/test_port_conflict.py
- Created comprehensive test script
Test Results:
| Test | Status | Notes |
|---|---|---|
| Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
| Context manager | PASSED | Auto-cleanup works |
| Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |
Phase 4: Documentation
File: docs/torch_distributed_port_issue.md
- Updated status to RESOLVED
- Documented solution details
- Added usage examples
Files Modified
| File | Action | Description |
|---|---|---|
nanovllm/engine/model_runner.py |
Modified | Added _find_free_port(), dynamic port logic |
nanovllm/engine/llm_engine.py |
Modified | Added close(), __del__, context manager |
tests/test_port_conflict.py |
Created | Test script for port conflict fix |
docs/torch_distributed_port_issue.md |
Deleted | Issue resolved, doc removed |
CLAUDE.md |
Modified | Removed port conflict warnings, updated doc index |
Key Features After Fix
-
Multi-GPU Parallel Testing
CUDA_VISIBLE_DEVICES=0 python test1.py & CUDA_VISIBLE_DEVICES=1 python test2.py & # Both run with different auto-assigned ports -
Sequential LLM Creation
for i in range(3): with LLM(model_path) as llm: outputs = llm.generate(prompts, params) # Automatically cleaned up -
Backward Compatible
NANOVLLM_DIST_PORTenv var still worksllm.exit()still works (alias forclose())