nano-vllm/task_plan.md

# Task Plan: Fix Torch Distributed Port Conflict

## Goal
支持多卡环境下同时启动多个独立的 nanovllm 进程进行测试，无需手动管理端口。

## Problem Analysis

### 核心问题
```
当前：所有 nanovllm 实例默认使用端口 2333
     └── 多个独立进程同时运行时会冲突！

CUDA_VISIBLE_DEVICES=0 python test1.py  # 绑定端口 2333 ✓
CUDA_VISIBLE_DEVICES=1 python test2.py  # 尝试绑定 2333 → EADDRINUSE ❌
```

### 根本原因
- 端口是系统级资源，与 GPU 无关
- 即使使用不同 GPU，端口仍会冲突
- 当前硬编码默认端口 `2333`

---

## Solution: Dynamic Port Allocation

### 核心方案
```python
def _find_free_port() -> int:
    """让系统自动分配一个空闲端口"""
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        return s.getsockname()[1]

# 优先使用环境变量，否则自动分配
port = os.environ.get("NANOVLLM_DIST_PORT")
if port is None:
    port = _find_free_port()
else:
    port = int(port)
```

### 效果
```bash
# 无需手动指定端口，可以同时运行多个测试
CUDA_VISIBLE_DEVICES=0 python test1.py &  # 自动端口 54321
CUDA_VISIBLE_DEVICES=1 python test2.py &  # 自动端口 54322
CUDA_VISIBLE_DEVICES=2 python test3.py &  # 自动端口 54323

# 仍然支持手动指定（向后兼容）
NANOVLLM_DIST_PORT=2333 python test.py
```

---

## Implementation Phases

### Phase 1: ModelRunner 动态端口 [pending]
**File**: `nanovllm/engine/model_runner.py`

```python
import socket

def _find_free_port() -> int:
    """Find a free port for distributed communication."""
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return s.getsockname()[1]

class ModelRunner:
    def __init__(self, config: Config, rank: int, event: Event | list[Event]):
        # ... existing code ...

        import os
        port = os.environ.get("NANOVLLM_DIST_PORT")
        if port is None:
            port = _find_free_port()
            logger.info(f"Auto-assigned distributed port: {port}")
        else:
            port = int(port)

        dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
```

### Phase 2: LLMEngine 资源清理增强 [pending]
**File**: `nanovllm/engine/llm_engine.py`

添加 `close()` 方法和 context manager 支持，确保资源正确释放：

```python
class LLMEngine:
    def __init__(self, model, **kwargs):
        # ... existing code ...
        self._closed = False
        atexit.register(self._atexit_handler)

    def _atexit_handler(self):
        if not self._closed:
            self.close()

    def close(self):
        """Explicitly close the engine and release all resources."""
        if self._closed:
            return
        self._closed = True
        try:
            atexit.unregister(self._atexit_handler)
        except Exception:
            pass
        self.model_runner.call("exit")
        del self.model_runner
        for p in self.ps:
            p.join()

    def exit(self):
        """Alias for close() - backward compatibility."""
        self.close()

    def __del__(self):
        try:
            self.close()
        except Exception:
            pass

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.close()
        return False
```

### Phase 3: 测试验证 [pending]
**File**: `tests/test_multiple_processes.py` (新建)

```python
"""Test multiple independent nanovllm processes."""
import subprocess
import sys
import time

def test_parallel_processes():
    """Test running multiple nanovllm processes in parallel."""
    script = '''
import sys
sys.path.insert(0, ".")
from nanovllm import LLM, SamplingParams
import os

gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "0")
print(f"[GPU {gpu}] Starting LLM")
llm = LLM("path/to/model", enable_cpu_offload=True)
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
print(f"[GPU {gpu}] Output: {outputs[0]['text'][:50]}")
llm.close()
print(f"[GPU {gpu}] Done")
'''

    # Start 2 processes on different GPUs
    procs = []
    for gpu in [0, 1]:
        env = {"CUDA_VISIBLE_DEVICES": str(gpu)}
        p = subprocess.Popen(
            [sys.executable, "-c", script],
            env={**os.environ, **env}
        )
        procs.append(p)
        time.sleep(1)  # Stagger start slightly

    # Wait for all
    for p in procs:
        assert p.wait() == 0, f"Process failed with code {p.returncode}"

    print("PASSED: test_parallel_processes")

if __name__ == "__main__":
    test_parallel_processes()
```

### Phase 4: 文档更新 [pending]
**File**: `docs/torch_distributed_port_issue.md`

更新文档标记问题已通过动态端口分配解决。

---

## Usage After Fix

### 场景 1: 多进程并行测试（主要场景）
```bash
# 无需任何额外配置，直接运行
CUDA_VISIBLE_DEVICES=0 python test_group1.py &
CUDA_VISIBLE_DEVICES=1 python test_group2.py &
CUDA_VISIBLE_DEVICES=2 python test_group3.py &
wait
```

### 场景 2: 同一进程顺序创建（也支持）
```python
for i in range(3):
    with LLM(model_path) as llm:
        outputs = llm.generate(prompts, params)
    # 自动清理，下一个可以使用新的随机端口
```

### 场景 3: 手动指定端口（向后兼容）
```bash
NANOVLLM_DIST_PORT=2333 python test.py
```

---

## Success Criteria

- [ ] 多个独立进程可以同时运行（不同 GPU）
- [ ] 无需手动指定端口
- [ ] 向后兼容（环境变量仍有效）
- [ ] 同一进程顺序创建也能工作
- [ ] 资源正确清理

---

## Files to Modify

| File | Action | Status |
|------|--------|--------|
| `nanovllm/engine/model_runner.py` | Add `_find_free_port()` | pending |
| `nanovllm/engine/llm_engine.py` | Add `close()`, context manager | pending |
| `tests/test_multiple_processes.py` | Create | pending |
| `docs/torch_distributed_port_issue.md` | Update | pending |