# Torch Distributed Port Conflict Issue ## Problem Summary When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with: ``` torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use ``` ## Root Cause Analysis ### 1. Distributed Process Group Initialization In `nanovllm/engine/model_runner.py:30-32`: ```python import os port = os.environ.get("NANOVLLM_DIST_PORT", "2333") dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank) ``` - Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var) - `init_process_group()` binds a TCP socket to this port - This binding persists until `destroy_process_group()` is called ### 2. Cleanup Mechanism In `nanovllm/engine/llm_engine.py:37`: ```python atexit.register(self.exit) ``` In `nanovllm/engine/llm_engine.py:39-43`: ```python def exit(self): self.model_runner.call("exit") del self.model_runner for p in self.ps: p.join() ``` In `nanovllm/engine/model_runner.py:66-78`: ```python def exit(self): # ... cleanup code ... dist.destroy_process_group() ``` ### 3. The Problem **`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.** Timeline of the bug: ``` 1. Create LLM instance #1 ├── init_process_group() binds port 2333 ✓ └── atexit.register(self.exit) registered 2. LLM #1 goes out of scope (garbage collected) ├── Python's GC deletes the object ├── BUT atexit handler NOT triggered yet └── Port 2333 still bound! ❌ 3. Create LLM instance #2 ├── init_process_group() tries to bind port 2333 └── EADDRINUSE error! ❌ 4. Program exits (only now atexit runs) └── Too late - already crashed ``` ## Impact This issue affects: 1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`) - Each group needs a fresh LLM instance - Second group fails with port conflict 2. **Multiple LLM instances in same process** - Any code that creates LLM, deletes it, then creates another 3. **Interactive/notebook usage** - Re-running cells that create LLM instances ## Proposed Solutions ### Solution A: Add `__del__` Method (Quick Fix) Add destructor to `LLMEngine` that calls cleanup: ```python # In nanovllm/engine/llm_engine.py def __del__(self): try: self.exit() except Exception: pass # Ignore errors during cleanup ``` **Pros**: Simple, backwards compatible **Cons**: `__del__` is not guaranteed to be called (circular references, etc.) ### Solution B: Context Manager Pattern (Recommended) Make `LLMEngine` a context manager: ```python # In nanovllm/engine/llm_engine.py def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.exit() return False ``` Usage: ```python with LLM(model_path) as llm: outputs = llm.generate(prompts, params) # Cleanup happens automatically here ``` **Pros**: Explicit, guaranteed cleanup, Pythonic **Cons**: Requires usage pattern change ### Solution C: Check and Cleanup Before Init (Defensive) In `ModelRunner.__init__`, check if process group exists: ```python # In nanovllm/engine/model_runner.py if dist.is_initialized(): dist.destroy_process_group() dist.init_process_group("nccl", f"tcp://localhost:{port}", ...) ``` **Pros**: Self-healing, no usage pattern change **Cons**: May mask other issues, global state manipulation ### Solution D: Subprocess Isolation (For Testing) For grouped testing specifically, run each group in a subprocess: ```python import subprocess for group in groups: subprocess.run([sys.executable, "test_ruler_niah.py", "--sample-indices", f"{start}-{end}"]) ``` **Pros**: Complete isolation, no code changes to nanovllm **Cons**: More overhead, only solves testing use case ### Solution E: Dynamic Port Allocation Instead of fixed port 2333, use dynamic port: ```python import socket def find_free_port(): with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: s.bind(('', 0)) return s.getsockname()[1] port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port() ``` **Pros**: Avoids conflicts entirely **Cons**: More complex, may have side effects ## Recommended Implementation **Combine Solutions A + B + C** for maximum robustness: 1. Add `__del__` for best-effort cleanup 2. Add context manager for explicit cleanup 3. Add `is_initialized()` check as defensive measure ```python # nanovllm/engine/llm_engine.py class LLMEngine: def __init__(self, model, **kwargs): # ... existing code ... atexit.register(self.exit) self._exited = False def exit(self): if self._exited: return self._exited = True self.model_runner.call("exit") del self.model_runner for p in self.ps: p.join() def __del__(self): try: self.exit() except Exception: pass def __enter__(self): return self def __exit__(self, *args): self.exit() return False # nanovllm/engine/model_runner.py class ModelRunner: def __init__(self, config: Config, rank: int, event): # ... existing code before init_process_group ... import os port = os.environ.get("NANOVLLM_DIST_PORT", "2333") # Defensive cleanup if dist.is_initialized(): dist.destroy_process_group() dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank) # ... rest of init ... ``` ## Workaround for Current Code Until the fix is implemented, use one of these workarounds: ### Workaround 1: Manual Cleanup ```python import torch.distributed as dist llm = LLM(model_path) outputs = llm.generate(...) llm.model_runner.call("exit") # Manual cleanup del llm # Now can create new LLM llm2 = LLM(model_path) ``` ### Workaround 2: Subprocess Testing ```bash # Run each test group as separate process for i in $(seq 0 5 95); do python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload done ``` ### Workaround 3: Environment Variable Port ```bash # Use different port for each run NANOVLLM_DIST_PORT=2334 python test.py NANOVLLM_DIST_PORT=2335 python test.py ``` ## Related Files | File | Relevant Code | |------|---------------| | `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call | | `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` | | `nanovllm/engine/llm_engine.py:37` | `atexit.register()` | | `nanovllm/engine/llm_engine.py:39-43` | `exit()` method | ## Testing the Fix After implementing the fix, verify with: ```python # test_multiple_llm.py from nanovllm import LLM, SamplingParams for i in range(3): print(f"Creating LLM instance {i+1}") llm = LLM("path/to/model", enable_cpu_offload=True) outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10)) print(f"Instance {i+1} output: {outputs[0]['text']}") del llm print(f"Instance {i+1} deleted\n") print("All instances created and deleted successfully!") ``` Expected: No port conflict errors, all 3 instances work. ## Priority **High** - This blocks grouped testing and any multi-LLM-instance workflows.