7.4 KiB
Torch Distributed Port Conflict Issue
Problem Summary
When attempting to create multiple LLM instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
Root Cause Analysis
1. Distributed Process Group Initialization
In nanovllm/engine/model_runner.py:30-32:
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
- Default port is 2333 (configurable via
NANOVLLM_DIST_PORTenv var) init_process_group()binds a TCP socket to this port- This binding persists until
destroy_process_group()is called
2. Cleanup Mechanism
In nanovllm/engine/llm_engine.py:37:
atexit.register(self.exit)
In nanovllm/engine/llm_engine.py:39-43:
def exit(self):
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
In nanovllm/engine/model_runner.py:66-78:
def exit(self):
# ... cleanup code ...
dist.destroy_process_group()
3. The Problem
atexit only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.
Timeline of the bug:
1. Create LLM instance #1
├── init_process_group() binds port 2333 ✓
└── atexit.register(self.exit) registered
2. LLM #1 goes out of scope (garbage collected)
├── Python's GC deletes the object
├── BUT atexit handler NOT triggered yet
└── Port 2333 still bound! ❌
3. Create LLM instance #2
├── init_process_group() tries to bind port 2333
└── EADDRINUSE error! ❌
4. Program exits (only now atexit runs)
└── Too late - already crashed
Impact
This issue affects:
-
Grouped testing mode (
test_ruler_niah.py --group-size N)- Each group needs a fresh LLM instance
- Second group fails with port conflict
-
Multiple LLM instances in same process
- Any code that creates LLM, deletes it, then creates another
-
Interactive/notebook usage
- Re-running cells that create LLM instances
Proposed Solutions
Solution A: Add __del__ Method (Quick Fix)
Add destructor to LLMEngine that calls cleanup:
# In nanovllm/engine/llm_engine.py
def __del__(self):
try:
self.exit()
except Exception:
pass # Ignore errors during cleanup
Pros: Simple, backwards compatible
Cons: __del__ is not guaranteed to be called (circular references, etc.)
Solution B: Context Manager Pattern (Recommended)
Make LLMEngine a context manager:
# In nanovllm/engine/llm_engine.py
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.exit()
return False
Usage:
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# Cleanup happens automatically here
Pros: Explicit, guaranteed cleanup, Pythonic Cons: Requires usage pattern change
Solution C: Check and Cleanup Before Init (Defensive)
In ModelRunner.__init__, check if process group exists:
# In nanovllm/engine/model_runner.py
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
Pros: Self-healing, no usage pattern change Cons: May mask other issues, global state manipulation
Solution D: Subprocess Isolation (For Testing)
For grouped testing specifically, run each group in a subprocess:
import subprocess
for group in groups:
subprocess.run([sys.executable, "test_ruler_niah.py",
"--sample-indices", f"{start}-{end}"])
Pros: Complete isolation, no code changes to nanovllm Cons: More overhead, only solves testing use case
Solution E: Dynamic Port Allocation
Instead of fixed port 2333, use dynamic port:
import socket
def find_free_port():
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
return s.getsockname()[1]
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
Pros: Avoids conflicts entirely Cons: More complex, may have side effects
Recommended Implementation
Combine Solutions A + B + C for maximum robustness:
- Add
__del__for best-effort cleanup - Add context manager for explicit cleanup
- Add
is_initialized()check as defensive measure
# nanovllm/engine/llm_engine.py
class LLMEngine:
def __init__(self, model, **kwargs):
# ... existing code ...
atexit.register(self.exit)
self._exited = False
def exit(self):
if self._exited:
return
self._exited = True
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
def __del__(self):
try:
self.exit()
except Exception:
pass
def __enter__(self):
return self
def __exit__(self, *args):
self.exit()
return False
# nanovllm/engine/model_runner.py
class ModelRunner:
def __init__(self, config: Config, rank: int, event):
# ... existing code before init_process_group ...
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
# Defensive cleanup
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}",
world_size=self.world_size, rank=rank)
# ... rest of init ...
Workaround for Current Code
Until the fix is implemented, use one of these workarounds:
Workaround 1: Manual Cleanup
import torch.distributed as dist
llm = LLM(model_path)
outputs = llm.generate(...)
llm.model_runner.call("exit") # Manual cleanup
del llm
# Now can create new LLM
llm2 = LLM(model_path)
Workaround 2: Subprocess Testing
# Run each test group as separate process
for i in $(seq 0 5 95); do
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
done
Workaround 3: Environment Variable Port
# Use different port for each run
NANOVLLM_DIST_PORT=2334 python test.py
NANOVLLM_DIST_PORT=2335 python test.py
Related Files
| File | Relevant Code |
|---|---|
nanovllm/engine/model_runner.py:30-32 |
init_process_group() call |
nanovllm/engine/model_runner.py:66-78 |
exit() and destroy_process_group() |
nanovllm/engine/llm_engine.py:37 |
atexit.register() |
nanovllm/engine/llm_engine.py:39-43 |
exit() method |
Testing the Fix
After implementing the fix, verify with:
# test_multiple_llm.py
from nanovllm import LLM, SamplingParams
for i in range(3):
print(f"Creating LLM instance {i+1}")
llm = LLM("path/to/model", enable_cpu_offload=True)
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
print(f"Instance {i+1} output: {outputs[0]['text']}")
del llm
print(f"Instance {i+1} deleted\n")
print("All instances created and deleted successfully!")
Expected: No port conflict errors, all 3 instances work.
Priority
High - This blocks grouped testing and any multi-LLM-instance workflows.