Files
nano-vllm/docs/torch_distributed_port_issue.md
2026-01-12 15:16:39 +08:00

7.4 KiB

Torch Distributed Port Conflict Issue

Problem Summary

When attempting to create multiple LLM instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

Root Cause Analysis

1. Distributed Process Group Initialization

In nanovllm/engine/model_runner.py:30-32:

import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
  • Default port is 2333 (configurable via NANOVLLM_DIST_PORT env var)
  • init_process_group() binds a TCP socket to this port
  • This binding persists until destroy_process_group() is called

2. Cleanup Mechanism

In nanovllm/engine/llm_engine.py:37:

atexit.register(self.exit)

In nanovllm/engine/llm_engine.py:39-43:

def exit(self):
    self.model_runner.call("exit")
    del self.model_runner
    for p in self.ps:
        p.join()

In nanovllm/engine/model_runner.py:66-78:

def exit(self):
    # ... cleanup code ...
    dist.destroy_process_group()

3. The Problem

atexit only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.

Timeline of the bug:

1. Create LLM instance #1
   ├── init_process_group() binds port 2333 ✓
   └── atexit.register(self.exit) registered

2. LLM #1 goes out of scope (garbage collected)
   ├── Python's GC deletes the object
   ├── BUT atexit handler NOT triggered yet
   └── Port 2333 still bound! ❌

3. Create LLM instance #2
   ├── init_process_group() tries to bind port 2333
   └── EADDRINUSE error! ❌

4. Program exits (only now atexit runs)
   └── Too late - already crashed

Impact

This issue affects:

  1. Grouped testing mode (test_ruler_niah.py --group-size N)

    • Each group needs a fresh LLM instance
    • Second group fails with port conflict
  2. Multiple LLM instances in same process

    • Any code that creates LLM, deletes it, then creates another
  3. Interactive/notebook usage

    • Re-running cells that create LLM instances

Proposed Solutions

Solution A: Add __del__ Method (Quick Fix)

Add destructor to LLMEngine that calls cleanup:

# In nanovllm/engine/llm_engine.py

def __del__(self):
    try:
        self.exit()
    except Exception:
        pass  # Ignore errors during cleanup

Pros: Simple, backwards compatible Cons: __del__ is not guaranteed to be called (circular references, etc.)

Make LLMEngine a context manager:

# In nanovllm/engine/llm_engine.py

def __enter__(self):
    return self

def __exit__(self, exc_type, exc_val, exc_tb):
    self.exit()
    return False

Usage:

with LLM(model_path) as llm:
    outputs = llm.generate(prompts, params)
# Cleanup happens automatically here

Pros: Explicit, guaranteed cleanup, Pythonic Cons: Requires usage pattern change

Solution C: Check and Cleanup Before Init (Defensive)

In ModelRunner.__init__, check if process group exists:

# In nanovllm/engine/model_runner.py

if dist.is_initialized():
    dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)

Pros: Self-healing, no usage pattern change Cons: May mask other issues, global state manipulation

Solution D: Subprocess Isolation (For Testing)

For grouped testing specifically, run each group in a subprocess:

import subprocess
for group in groups:
    subprocess.run([sys.executable, "test_ruler_niah.py",
                    "--sample-indices", f"{start}-{end}"])

Pros: Complete isolation, no code changes to nanovllm Cons: More overhead, only solves testing use case

Solution E: Dynamic Port Allocation

Instead of fixed port 2333, use dynamic port:

import socket

def find_free_port():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        return s.getsockname()[1]

port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()

Pros: Avoids conflicts entirely Cons: More complex, may have side effects

Combine Solutions A + B + C for maximum robustness:

  1. Add __del__ for best-effort cleanup
  2. Add context manager for explicit cleanup
  3. Add is_initialized() check as defensive measure
# nanovllm/engine/llm_engine.py

class LLMEngine:
    def __init__(self, model, **kwargs):
        # ... existing code ...
        atexit.register(self.exit)
        self._exited = False

    def exit(self):
        if self._exited:
            return
        self._exited = True
        self.model_runner.call("exit")
        del self.model_runner
        for p in self.ps:
            p.join()

    def __del__(self):
        try:
            self.exit()
        except Exception:
            pass

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.exit()
        return False


# nanovllm/engine/model_runner.py

class ModelRunner:
    def __init__(self, config: Config, rank: int, event):
        # ... existing code before init_process_group ...

        import os
        port = os.environ.get("NANOVLLM_DIST_PORT", "2333")

        # Defensive cleanup
        if dist.is_initialized():
            dist.destroy_process_group()

        dist.init_process_group("nccl", f"tcp://localhost:{port}",
                                world_size=self.world_size, rank=rank)
        # ... rest of init ...

Workaround for Current Code

Until the fix is implemented, use one of these workarounds:

Workaround 1: Manual Cleanup

import torch.distributed as dist

llm = LLM(model_path)
outputs = llm.generate(...)
llm.model_runner.call("exit")  # Manual cleanup
del llm

# Now can create new LLM
llm2 = LLM(model_path)

Workaround 2: Subprocess Testing

# Run each test group as separate process
for i in $(seq 0 5 95); do
    python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
done

Workaround 3: Environment Variable Port

# Use different port for each run
NANOVLLM_DIST_PORT=2334 python test.py
NANOVLLM_DIST_PORT=2335 python test.py
File Relevant Code
nanovllm/engine/model_runner.py:30-32 init_process_group() call
nanovllm/engine/model_runner.py:66-78 exit() and destroy_process_group()
nanovllm/engine/llm_engine.py:37 atexit.register()
nanovllm/engine/llm_engine.py:39-43 exit() method

Testing the Fix

After implementing the fix, verify with:

# test_multiple_llm.py
from nanovllm import LLM, SamplingParams

for i in range(3):
    print(f"Creating LLM instance {i+1}")
    llm = LLM("path/to/model", enable_cpu_offload=True)
    outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
    print(f"Instance {i+1} output: {outputs[0]['text']}")
    del llm
    print(f"Instance {i+1} deleted\n")

print("All instances created and deleted successfully!")

Expected: No port conflict errors, all 3 instances work.

Priority

High - This blocks grouped testing and any multi-LLM-instance workflows.