Files
nano-vllm/docs/torch_distributed_port_issue.md
2026-01-12 15:16:39 +08:00

309 lines
7.4 KiB
Markdown

# Torch Distributed Port Conflict Issue
## Problem Summary
When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
```
## Root Cause Analysis
### 1. Distributed Process Group Initialization
In `nanovllm/engine/model_runner.py:30-32`:
```python
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
```
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
- `init_process_group()` binds a TCP socket to this port
- This binding persists until `destroy_process_group()` is called
### 2. Cleanup Mechanism
In `nanovllm/engine/llm_engine.py:37`:
```python
atexit.register(self.exit)
```
In `nanovllm/engine/llm_engine.py:39-43`:
```python
def exit(self):
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
```
In `nanovllm/engine/model_runner.py:66-78`:
```python
def exit(self):
# ... cleanup code ...
dist.destroy_process_group()
```
### 3. The Problem
**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
Timeline of the bug:
```
1. Create LLM instance #1
├── init_process_group() binds port 2333 ✓
└── atexit.register(self.exit) registered
2. LLM #1 goes out of scope (garbage collected)
├── Python's GC deletes the object
├── BUT atexit handler NOT triggered yet
└── Port 2333 still bound! ❌
3. Create LLM instance #2
├── init_process_group() tries to bind port 2333
└── EADDRINUSE error! ❌
4. Program exits (only now atexit runs)
└── Too late - already crashed
```
## Impact
This issue affects:
1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
- Each group needs a fresh LLM instance
- Second group fails with port conflict
2. **Multiple LLM instances in same process**
- Any code that creates LLM, deletes it, then creates another
3. **Interactive/notebook usage**
- Re-running cells that create LLM instances
## Proposed Solutions
### Solution A: Add `__del__` Method (Quick Fix)
Add destructor to `LLMEngine` that calls cleanup:
```python
# In nanovllm/engine/llm_engine.py
def __del__(self):
try:
self.exit()
except Exception:
pass # Ignore errors during cleanup
```
**Pros**: Simple, backwards compatible
**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
### Solution B: Context Manager Pattern (Recommended)
Make `LLMEngine` a context manager:
```python
# In nanovllm/engine/llm_engine.py
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.exit()
return False
```
Usage:
```python
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# Cleanup happens automatically here
```
**Pros**: Explicit, guaranteed cleanup, Pythonic
**Cons**: Requires usage pattern change
### Solution C: Check and Cleanup Before Init (Defensive)
In `ModelRunner.__init__`, check if process group exists:
```python
# In nanovllm/engine/model_runner.py
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
```
**Pros**: Self-healing, no usage pattern change
**Cons**: May mask other issues, global state manipulation
### Solution D: Subprocess Isolation (For Testing)
For grouped testing specifically, run each group in a subprocess:
```python
import subprocess
for group in groups:
subprocess.run([sys.executable, "test_ruler_niah.py",
"--sample-indices", f"{start}-{end}"])
```
**Pros**: Complete isolation, no code changes to nanovllm
**Cons**: More overhead, only solves testing use case
### Solution E: Dynamic Port Allocation
Instead of fixed port 2333, use dynamic port:
```python
import socket
def find_free_port():
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
return s.getsockname()[1]
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
```
**Pros**: Avoids conflicts entirely
**Cons**: More complex, may have side effects
## Recommended Implementation
**Combine Solutions A + B + C** for maximum robustness:
1. Add `__del__` for best-effort cleanup
2. Add context manager for explicit cleanup
3. Add `is_initialized()` check as defensive measure
```python
# nanovllm/engine/llm_engine.py
class LLMEngine:
def __init__(self, model, **kwargs):
# ... existing code ...
atexit.register(self.exit)
self._exited = False
def exit(self):
if self._exited:
return
self._exited = True
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
def __del__(self):
try:
self.exit()
except Exception:
pass
def __enter__(self):
return self
def __exit__(self, *args):
self.exit()
return False
# nanovllm/engine/model_runner.py
class ModelRunner:
def __init__(self, config: Config, rank: int, event):
# ... existing code before init_process_group ...
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
# Defensive cleanup
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}",
world_size=self.world_size, rank=rank)
# ... rest of init ...
```
## Workaround for Current Code
Until the fix is implemented, use one of these workarounds:
### Workaround 1: Manual Cleanup
```python
import torch.distributed as dist
llm = LLM(model_path)
outputs = llm.generate(...)
llm.model_runner.call("exit") # Manual cleanup
del llm
# Now can create new LLM
llm2 = LLM(model_path)
```
### Workaround 2: Subprocess Testing
```bash
# Run each test group as separate process
for i in $(seq 0 5 95); do
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
done
```
### Workaround 3: Environment Variable Port
```bash
# Use different port for each run
NANOVLLM_DIST_PORT=2334 python test.py
NANOVLLM_DIST_PORT=2335 python test.py
```
## Related Files
| File | Relevant Code |
|------|---------------|
| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
## Testing the Fix
After implementing the fix, verify with:
```python
# test_multiple_llm.py
from nanovllm import LLM, SamplingParams
for i in range(3):
print(f"Creating LLM instance {i+1}")
llm = LLM("path/to/model", enable_cpu_offload=True)
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
print(f"Instance {i+1} output: {outputs[0]['text']}")
del llm
print(f"Instance {i+1} deleted\n")
print("All instances created and deleted successfully!")
```
Expected: No port conflict errors, all 3 instances work.
## Priority
**High** - This blocks grouped testing and any multi-LLM-instance workflows.