# Progress Log: Fix Torch Distributed Port Conflict

## Status: COMPLETED & CLEANED UP

## Session: 2026-01-12

### Task Overview
修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题，以及支持多卡环境下同时启动多个独立进程。

---

### Phase Status

| Phase | Description | Status |
|-------|-------------|--------|
| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
| Phase 3 | 测试验证（GPU 4,5） | COMPLETED |
| Phase 4 | 更新文档 | COMPLETED |

---

### Implementation Summary

#### Phase 1: Dynamic Port Allocation
**File**: `nanovllm/engine/model_runner.py`
- Added `_find_free_port()` function using socket binding
- Modified port selection logic: use env var if set, otherwise auto-assign
- Added logging for auto-assigned ports

#### Phase 2: Resource Cleanup Enhancement
**File**: `nanovllm/engine/llm_engine.py`
- Added `_closed` flag for idempotent cleanup
- Added `close()` method for explicit resource release
- Added `__del__()` for GC fallback
- Added `__enter__()` and `__exit__()` for context manager support
- Modified atexit registration to use `_atexit_handler`

#### Phase 3: Testing (GPU 4,5)
**File**: `tests/test_port_conflict.py`
- Created comprehensive test script

**Test Results**:
| Test | Status | Notes |
|------|--------|-------|
| Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
| Context manager | PASSED | Auto-cleanup works |
| Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |

#### Phase 4: Documentation
**File**: `docs/torch_distributed_port_issue.md`
- Updated status to RESOLVED
- Documented solution details
- Added usage examples

---

### Files Modified

| File | Action | Description |
|------|--------|-------------|
| `nanovllm/engine/model_runner.py` | Modified | Added `_find_free_port()`, dynamic port logic |
| `nanovllm/engine/llm_engine.py` | Modified | Added `close()`, `__del__`, context manager |
| `tests/test_port_conflict.py` | Created | Test script for port conflict fix |
| `docs/torch_distributed_port_issue.md` | Deleted | Issue resolved, doc removed |
| `CLAUDE.md` | Modified | Removed port conflict warnings, updated doc index |

---

### Key Features After Fix

1. **Multi-GPU Parallel Testing**
   ```bash
   CUDA_VISIBLE_DEVICES=0 python test1.py &
   CUDA_VISIBLE_DEVICES=1 python test2.py &
   # Both run with different auto-assigned ports
   ```

2. **Sequential LLM Creation**
   ```python
   for i in range(3):
       with LLM(model_path) as llm:
           outputs = llm.generate(prompts, params)
       # Automatically cleaned up
   ```

3. **Backward Compatible**
   - `NANOVLLM_DIST_PORT` env var still works
   - `llm.exit()` still works (alias for `close()`)