Merge branch 'zijie/fix-dist-3': Fix distributed port conflict

- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:20:44 +08:00
parent de6f36bdb2
commit 64971c8e8a
10 changed files with 784 additions and 792 deletions
--- a/docs/torch_distributed_port_issue.md
+++ b/docs/torch_distributed_port_issue.md
@@ -1,308 +0,0 @@
-# Torch Distributed Port Conflict Issue
-
-## Problem Summary
-
-When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
-
-```
-torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
-port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
-```
-
-## Root Cause Analysis
-
-### 1. Distributed Process Group Initialization
-
-In `nanovllm/engine/model_runner.py:30-32`:
-
-```python
-import os
-port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
-dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
-```
-
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
- `init_process_group()` binds a TCP socket to this port
- This binding persists until `destroy_process_group()` is called
-
-### 2. Cleanup Mechanism
-
-In `nanovllm/engine/llm_engine.py:37`:
-
-```python
-atexit.register(self.exit)
-```
-
-In `nanovllm/engine/llm_engine.py:39-43`:
-
-```python
-def exit(self):
-    self.model_runner.call("exit")
-    del self.model_runner
-    for p in self.ps:
-        p.join()
-```
-
-In `nanovllm/engine/model_runner.py:66-78`:
-
-```python
-def exit(self):
-    # ... cleanup code ...
-    dist.destroy_process_group()
-```
-
-### 3. The Problem
-
-**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
-
-Timeline of the bug:
-
-```
-1. Create LLM instance #1
-   ├── init_process_group() binds port 2333 ✓
-   └── atexit.register(self.exit) registered
-
-2. LLM #1 goes out of scope (garbage collected)
-   ├── Python's GC deletes the object
-   ├── BUT atexit handler NOT triggered yet
-   └── Port 2333 still bound! ❌
-
-3. Create LLM instance #2
-   ├── init_process_group() tries to bind port 2333
-   └── EADDRINUSE error! ❌
-
-4. Program exits (only now atexit runs)
-   └── Too late - already crashed
-```
-
-## Impact
-
-This issue affects:
-
-1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
-   - Each group needs a fresh LLM instance
-   - Second group fails with port conflict
-
-2. **Multiple LLM instances in same process**
-   - Any code that creates LLM, deletes it, then creates another
-
-3. **Interactive/notebook usage**
-   - Re-running cells that create LLM instances
-
-## Proposed Solutions
-
-### Solution A: Add `__del__` Method (Quick Fix)
-
-Add destructor to `LLMEngine` that calls cleanup:
-
-```python
-# In nanovllm/engine/llm_engine.py
-
-def __del__(self):
-    try:
-        self.exit()
-    except Exception:
-        pass  # Ignore errors during cleanup
-```
-
-**Pros**: Simple, backwards compatible
-**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
-
-### Solution B: Context Manager Pattern (Recommended)
-
-Make `LLMEngine` a context manager:
-
-```python
-# In nanovllm/engine/llm_engine.py
-
-def __enter__(self):
-    return self
-
-def __exit__(self, exc_type, exc_val, exc_tb):
-    self.exit()
-    return False
-```
-
-Usage:
-```python
-with LLM(model_path) as llm:
-    outputs = llm.generate(prompts, params)
-# Cleanup happens automatically here
-```
-
-**Pros**: Explicit, guaranteed cleanup, Pythonic
-**Cons**: Requires usage pattern change
-
-### Solution C: Check and Cleanup Before Init (Defensive)
-
-In `ModelRunner.__init__`, check if process group exists:
-
-```python
-# In nanovllm/engine/model_runner.py
-
-if dist.is_initialized():
-    dist.destroy_process_group()
-dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
-```
-
-**Pros**: Self-healing, no usage pattern change
-**Cons**: May mask other issues, global state manipulation
-
-### Solution D: Subprocess Isolation (For Testing)
-
-For grouped testing specifically, run each group in a subprocess:
-
-```python
-import subprocess
-for group in groups:
-    subprocess.run([sys.executable, "test_ruler_niah.py",
-                    "--sample-indices", f"{start}-{end}"])
-```
-
-**Pros**: Complete isolation, no code changes to nanovllm
-**Cons**: More overhead, only solves testing use case
-
-### Solution E: Dynamic Port Allocation
-
-Instead of fixed port 2333, use dynamic port:
-
-```python
-import socket
-
-def find_free_port():
-    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(('', 0))
-        return s.getsockname()[1]
-
-port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
-```
-
-**Pros**: Avoids conflicts entirely
-**Cons**: More complex, may have side effects
-
-## Recommended Implementation
-
-**Combine Solutions A + B + C** for maximum robustness:
-
-1. Add `__del__` for best-effort cleanup
-2. Add context manager for explicit cleanup
-3. Add `is_initialized()` check as defensive measure
-
-```python
-# nanovllm/engine/llm_engine.py
-
-class LLMEngine:
-    def __init__(self, model, **kwargs):
-        # ... existing code ...
-        atexit.register(self.exit)
-        self._exited = False
-
-    def exit(self):
-        if self._exited:
-            return
-        self._exited = True
-        self.model_runner.call("exit")
-        del self.model_runner
-        for p in self.ps:
-            p.join()
-
-    def __del__(self):
-        try:
-            self.exit()
-        except Exception:
-            pass
-
-    def __enter__(self):
-        return self
-
-    def __exit__(self, *args):
-        self.exit()
-        return False
-
-
-# nanovllm/engine/model_runner.py
-
-class ModelRunner:
-    def __init__(self, config: Config, rank: int, event):
-        # ... existing code before init_process_group ...
-
-        import os
-        port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
-
-        # Defensive cleanup
-        if dist.is_initialized():
-            dist.destroy_process_group()
-
-        dist.init_process_group("nccl", f"tcp://localhost:{port}",
-                                world_size=self.world_size, rank=rank)
-        # ... rest of init ...
-```
-
-## Workaround for Current Code
-
-Until the fix is implemented, use one of these workarounds:
-
-### Workaround 1: Manual Cleanup
-
-```python
-import torch.distributed as dist
-
-llm = LLM(model_path)
-outputs = llm.generate(...)
-llm.model_runner.call("exit")  # Manual cleanup
-del llm
-
-# Now can create new LLM
-llm2 = LLM(model_path)
-```
-
-### Workaround 2: Subprocess Testing
-
-```bash
-# Run each test group as separate process
-for i in $(seq 0 5 95); do
-    python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
-done
-```
-
-### Workaround 3: Environment Variable Port
-
-```bash
-# Use different port for each run
-NANOVLLM_DIST_PORT=2334 python test.py
-NANOVLLM_DIST_PORT=2335 python test.py
-```
-
-## Related Files
-
-| File | Relevant Code |
-|------|---------------|
-| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
-| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
-| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
-| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
-
-## Testing the Fix
-
-After implementing the fix, verify with:
-
-```python
-# test_multiple_llm.py
-from nanovllm import LLM, SamplingParams
-
-for i in range(3):
-    print(f"Creating LLM instance {i+1}")
-    llm = LLM("path/to/model", enable_cpu_offload=True)
-    outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
-    print(f"Instance {i+1} output: {outputs[0]['text']}")
-    del llm
-    print(f"Instance {i+1} deleted\n")
-
-print("All instances created and deleted successfully!")
-```
-
-Expected: No port conflict errors, all 3 instances work.
-
-## Priority
-
-**High** - This blocks grouped testing and any multi-LLM-instance workflows.