Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,308 +0,0 @@
|
||||
# Torch Distributed Port Conflict Issue
|
||||
|
||||
## Problem Summary
|
||||
|
||||
When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
|
||||
|
||||
```
|
||||
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
|
||||
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### 1. Distributed Process Group Initialization
|
||||
|
||||
In `nanovllm/engine/model_runner.py:30-32`:
|
||||
|
||||
```python
|
||||
import os
|
||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
||||
```
|
||||
|
||||
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
|
||||
- `init_process_group()` binds a TCP socket to this port
|
||||
- This binding persists until `destroy_process_group()` is called
|
||||
|
||||
### 2. Cleanup Mechanism
|
||||
|
||||
In `nanovllm/engine/llm_engine.py:37`:
|
||||
|
||||
```python
|
||||
atexit.register(self.exit)
|
||||
```
|
||||
|
||||
In `nanovllm/engine/llm_engine.py:39-43`:
|
||||
|
||||
```python
|
||||
def exit(self):
|
||||
self.model_runner.call("exit")
|
||||
del self.model_runner
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
```
|
||||
|
||||
In `nanovllm/engine/model_runner.py:66-78`:
|
||||
|
||||
```python
|
||||
def exit(self):
|
||||
# ... cleanup code ...
|
||||
dist.destroy_process_group()
|
||||
```
|
||||
|
||||
### 3. The Problem
|
||||
|
||||
**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
|
||||
|
||||
Timeline of the bug:
|
||||
|
||||
```
|
||||
1. Create LLM instance #1
|
||||
├── init_process_group() binds port 2333 ✓
|
||||
└── atexit.register(self.exit) registered
|
||||
|
||||
2. LLM #1 goes out of scope (garbage collected)
|
||||
├── Python's GC deletes the object
|
||||
├── BUT atexit handler NOT triggered yet
|
||||
└── Port 2333 still bound! ❌
|
||||
|
||||
3. Create LLM instance #2
|
||||
├── init_process_group() tries to bind port 2333
|
||||
└── EADDRINUSE error! ❌
|
||||
|
||||
4. Program exits (only now atexit runs)
|
||||
└── Too late - already crashed
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
This issue affects:
|
||||
|
||||
1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
|
||||
- Each group needs a fresh LLM instance
|
||||
- Second group fails with port conflict
|
||||
|
||||
2. **Multiple LLM instances in same process**
|
||||
- Any code that creates LLM, deletes it, then creates another
|
||||
|
||||
3. **Interactive/notebook usage**
|
||||
- Re-running cells that create LLM instances
|
||||
|
||||
## Proposed Solutions
|
||||
|
||||
### Solution A: Add `__del__` Method (Quick Fix)
|
||||
|
||||
Add destructor to `LLMEngine` that calls cleanup:
|
||||
|
||||
```python
|
||||
# In nanovllm/engine/llm_engine.py
|
||||
|
||||
def __del__(self):
|
||||
try:
|
||||
self.exit()
|
||||
except Exception:
|
||||
pass # Ignore errors during cleanup
|
||||
```
|
||||
|
||||
**Pros**: Simple, backwards compatible
|
||||
**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
|
||||
|
||||
### Solution B: Context Manager Pattern (Recommended)
|
||||
|
||||
Make `LLMEngine` a context manager:
|
||||
|
||||
```python
|
||||
# In nanovllm/engine/llm_engine.py
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||
self.exit()
|
||||
return False
|
||||
```
|
||||
|
||||
Usage:
|
||||
```python
|
||||
with LLM(model_path) as llm:
|
||||
outputs = llm.generate(prompts, params)
|
||||
# Cleanup happens automatically here
|
||||
```
|
||||
|
||||
**Pros**: Explicit, guaranteed cleanup, Pythonic
|
||||
**Cons**: Requires usage pattern change
|
||||
|
||||
### Solution C: Check and Cleanup Before Init (Defensive)
|
||||
|
||||
In `ModelRunner.__init__`, check if process group exists:
|
||||
|
||||
```python
|
||||
# In nanovllm/engine/model_runner.py
|
||||
|
||||
if dist.is_initialized():
|
||||
dist.destroy_process_group()
|
||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
|
||||
```
|
||||
|
||||
**Pros**: Self-healing, no usage pattern change
|
||||
**Cons**: May mask other issues, global state manipulation
|
||||
|
||||
### Solution D: Subprocess Isolation (For Testing)
|
||||
|
||||
For grouped testing specifically, run each group in a subprocess:
|
||||
|
||||
```python
|
||||
import subprocess
|
||||
for group in groups:
|
||||
subprocess.run([sys.executable, "test_ruler_niah.py",
|
||||
"--sample-indices", f"{start}-{end}"])
|
||||
```
|
||||
|
||||
**Pros**: Complete isolation, no code changes to nanovllm
|
||||
**Cons**: More overhead, only solves testing use case
|
||||
|
||||
### Solution E: Dynamic Port Allocation
|
||||
|
||||
Instead of fixed port 2333, use dynamic port:
|
||||
|
||||
```python
|
||||
import socket
|
||||
|
||||
def find_free_port():
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||
s.bind(('', 0))
|
||||
return s.getsockname()[1]
|
||||
|
||||
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
|
||||
```
|
||||
|
||||
**Pros**: Avoids conflicts entirely
|
||||
**Cons**: More complex, may have side effects
|
||||
|
||||
## Recommended Implementation
|
||||
|
||||
**Combine Solutions A + B + C** for maximum robustness:
|
||||
|
||||
1. Add `__del__` for best-effort cleanup
|
||||
2. Add context manager for explicit cleanup
|
||||
3. Add `is_initialized()` check as defensive measure
|
||||
|
||||
```python
|
||||
# nanovllm/engine/llm_engine.py
|
||||
|
||||
class LLMEngine:
|
||||
def __init__(self, model, **kwargs):
|
||||
# ... existing code ...
|
||||
atexit.register(self.exit)
|
||||
self._exited = False
|
||||
|
||||
def exit(self):
|
||||
if self._exited:
|
||||
return
|
||||
self._exited = True
|
||||
self.model_runner.call("exit")
|
||||
del self.model_runner
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
|
||||
def __del__(self):
|
||||
try:
|
||||
self.exit()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
self.exit()
|
||||
return False
|
||||
|
||||
|
||||
# nanovllm/engine/model_runner.py
|
||||
|
||||
class ModelRunner:
|
||||
def __init__(self, config: Config, rank: int, event):
|
||||
# ... existing code before init_process_group ...
|
||||
|
||||
import os
|
||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
||||
|
||||
# Defensive cleanup
|
||||
if dist.is_initialized():
|
||||
dist.destroy_process_group()
|
||||
|
||||
dist.init_process_group("nccl", f"tcp://localhost:{port}",
|
||||
world_size=self.world_size, rank=rank)
|
||||
# ... rest of init ...
|
||||
```
|
||||
|
||||
## Workaround for Current Code
|
||||
|
||||
Until the fix is implemented, use one of these workarounds:
|
||||
|
||||
### Workaround 1: Manual Cleanup
|
||||
|
||||
```python
|
||||
import torch.distributed as dist
|
||||
|
||||
llm = LLM(model_path)
|
||||
outputs = llm.generate(...)
|
||||
llm.model_runner.call("exit") # Manual cleanup
|
||||
del llm
|
||||
|
||||
# Now can create new LLM
|
||||
llm2 = LLM(model_path)
|
||||
```
|
||||
|
||||
### Workaround 2: Subprocess Testing
|
||||
|
||||
```bash
|
||||
# Run each test group as separate process
|
||||
for i in $(seq 0 5 95); do
|
||||
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
|
||||
done
|
||||
```
|
||||
|
||||
### Workaround 3: Environment Variable Port
|
||||
|
||||
```bash
|
||||
# Use different port for each run
|
||||
NANOVLLM_DIST_PORT=2334 python test.py
|
||||
NANOVLLM_DIST_PORT=2335 python test.py
|
||||
```
|
||||
|
||||
## Related Files
|
||||
|
||||
| File | Relevant Code |
|
||||
|------|---------------|
|
||||
| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
|
||||
| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
|
||||
| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
|
||||
| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
|
||||
|
||||
## Testing the Fix
|
||||
|
||||
After implementing the fix, verify with:
|
||||
|
||||
```python
|
||||
# test_multiple_llm.py
|
||||
from nanovllm import LLM, SamplingParams
|
||||
|
||||
for i in range(3):
|
||||
print(f"Creating LLM instance {i+1}")
|
||||
llm = LLM("path/to/model", enable_cpu_offload=True)
|
||||
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
|
||||
print(f"Instance {i+1} output: {outputs[0]['text']}")
|
||||
del llm
|
||||
print(f"Instance {i+1} deleted\n")
|
||||
|
||||
print("All instances created and deleted successfully!")
|
||||
```
|
||||
|
||||
Expected: No port conflict errors, all 3 instances work.
|
||||
|
||||
## Priority
|
||||
|
||||
**High** - This blocks grouped testing and any multi-LLM-instance workflows.
|
||||
Reference in New Issue
Block a user