Files
nano-vllm/progress.md
Zijie Tian 64971c8e8a Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:27:25 +08:00

90 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Progress Log: Fix Torch Distributed Port Conflict
## Status: COMPLETED & CLEANED UP
## Session: 2026-01-12
### Task Overview
修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题,以及支持多卡环境下同时启动多个独立进程。
---
### Phase Status
| Phase | Description | Status |
|-------|-------------|--------|
| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
| Phase 3 | 测试验证GPU 4,5 | COMPLETED |
| Phase 4 | 更新文档 | COMPLETED |
---
### Implementation Summary
#### Phase 1: Dynamic Port Allocation
**File**: `nanovllm/engine/model_runner.py`
- Added `_find_free_port()` function using socket binding
- Modified port selection logic: use env var if set, otherwise auto-assign
- Added logging for auto-assigned ports
#### Phase 2: Resource Cleanup Enhancement
**File**: `nanovllm/engine/llm_engine.py`
- Added `_closed` flag for idempotent cleanup
- Added `close()` method for explicit resource release
- Added `__del__()` for GC fallback
- Added `__enter__()` and `__exit__()` for context manager support
- Modified atexit registration to use `_atexit_handler`
#### Phase 3: Testing (GPU 4,5)
**File**: `tests/test_port_conflict.py`
- Created comprehensive test script
**Test Results**:
| Test | Status | Notes |
|------|--------|-------|
| Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
| Context manager | PASSED | Auto-cleanup works |
| Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |
#### Phase 4: Documentation
**File**: `docs/torch_distributed_port_issue.md`
- Updated status to RESOLVED
- Documented solution details
- Added usage examples
---
### Files Modified
| File | Action | Description |
|------|--------|-------------|
| `nanovllm/engine/model_runner.py` | Modified | Added `_find_free_port()`, dynamic port logic |
| `nanovllm/engine/llm_engine.py` | Modified | Added `close()`, `__del__`, context manager |
| `tests/test_port_conflict.py` | Created | Test script for port conflict fix |
| `docs/torch_distributed_port_issue.md` | Deleted | Issue resolved, doc removed |
| `CLAUDE.md` | Modified | Removed port conflict warnings, updated doc index |
---
### Key Features After Fix
1. **Multi-GPU Parallel Testing**
```bash
CUDA_VISIBLE_DEVICES=0 python test1.py &
CUDA_VISIBLE_DEVICES=1 python test2.py &
# Both run with different auto-assigned ports
```
2. **Sequential LLM Creation**
```python
for i in range(3):
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# Automatically cleaned up
```
3. **Backward Compatible**
- `NANOVLLM_DIST_PORT` env var still works
- `llm.exit()` still works (alias for `close()`)