Merge branch 'zijie/fix-dist-3': Fix distributed port conflict

- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:20:44 +08:00
parent de6f36bdb2
commit 64971c8e8a
10 changed files with 784 additions and 792 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -224,3 +224,6 @@ coordination/orchestration/*
 claude-flow
 # Removed Windows wrapper files per user request
 hive-mind-prompt-*.txt
 # Test data
 tests/data/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -22,19 +22,9 @@ while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
 done
 ```
-### Other Scripts (tests, examples) - Port Conflict Check Only
+### Other Scripts (tests, examples) - No Special Requirements
-For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
+For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for `torch.distributed` communication.
 ```bash
 # Check if port 2333 (nanovllm default) is in use
 if lsof -i :2333 >/dev/null 2>&1; then
  echo "Port 2333 in use, waiting 10s..."
  sleep 10
 fi
 ```
 **Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process.
 ## Multi-Instance Development with PYTHONPATH
@@ -68,7 +58,6 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 | [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
 | [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
 | [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
 | [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions |
 | [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
 ## Configuration
--- a/docs/torch_distributed_port_issue.md
+++ b/docs/torch_distributed_port_issue.md
@@ -1,308 +0,0 @@
 # Torch Distributed Port Conflict Issue
 ## Problem Summary
 When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
 ```
 torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
 port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
 ```
 ## Root Cause Analysis
 ### 1. Distributed Process Group Initialization
 In `nanovllm/engine/model_runner.py:30-32`:
 ```python
 import os
 port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
 dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
 ```
 - Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
 - `init_process_group()` binds a TCP socket to this port
 - This binding persists until `destroy_process_group()` is called
 ### 2. Cleanup Mechanism
 In `nanovllm/engine/llm_engine.py:37`:
 ```python
 atexit.register(self.exit)
 ```
 In `nanovllm/engine/llm_engine.py:39-43`:
 ```python
 def exit(self):
    self.model_runner.call("exit")
    del self.model_runner
    for p in self.ps:
        p.join()
 ```
 In `nanovllm/engine/model_runner.py:66-78`:
 ```python
 def exit(self):
    # ... cleanup code ...
    dist.destroy_process_group()
 ```
 ### 3. The Problem
 **`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
 Timeline of the bug:
 ```
 1. Create LLM instance #1
   ├── init_process_group() binds port 2333 ✓
   └── atexit.register(self.exit) registered
 2. LLM #1 goes out of scope (garbage collected)
   ├── Python's GC deletes the object
   ├── BUT atexit handler NOT triggered yet
   └── Port 2333 still bound! ❌
 3. Create LLM instance #2
   ├── init_process_group() tries to bind port 2333
   └── EADDRINUSE error! ❌
 4. Program exits (only now atexit runs)
   └── Too late - already crashed
 ```
 ## Impact
 This issue affects:
 1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
   - Each group needs a fresh LLM instance
   - Second group fails with port conflict
 2. **Multiple LLM instances in same process**
   - Any code that creates LLM, deletes it, then creates another
 3. **Interactive/notebook usage**
   - Re-running cells that create LLM instances
 ## Proposed Solutions
 ### Solution A: Add `__del__` Method (Quick Fix)
 Add destructor to `LLMEngine` that calls cleanup:
 ```python
 # In nanovllm/engine/llm_engine.py
 def __del__(self):
    try:
        self.exit()
    except Exception:
        pass  # Ignore errors during cleanup
 ```
 **Pros**: Simple, backwards compatible
 **Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
 ### Solution B: Context Manager Pattern (Recommended)
 Make `LLMEngine` a context manager:
 ```python
 # In nanovllm/engine/llm_engine.py
 def __enter__(self):
    return self
 def __exit__(self, exc_type, exc_val, exc_tb):
    self.exit()
    return False
 ```
 Usage:
 ```python
 with LLM(model_path) as llm:
    outputs = llm.generate(prompts, params)
 # Cleanup happens automatically here
 ```
 **Pros**: Explicit, guaranteed cleanup, Pythonic
 **Cons**: Requires usage pattern change
 ### Solution C: Check and Cleanup Before Init (Defensive)
 In `ModelRunner.__init__`, check if process group exists:
 ```python
 # In nanovllm/engine/model_runner.py
 if dist.is_initialized():
    dist.destroy_process_group()
 dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
 ```
 **Pros**: Self-healing, no usage pattern change
 **Cons**: May mask other issues, global state manipulation
 ### Solution D: Subprocess Isolation (For Testing)
 For grouped testing specifically, run each group in a subprocess:
 ```python
 import subprocess
 for group in groups:
    subprocess.run([sys.executable, "test_ruler_niah.py",
                    "--sample-indices", f"{start}-{end}"])
 ```
 **Pros**: Complete isolation, no code changes to nanovllm
 **Cons**: More overhead, only solves testing use case
 ### Solution E: Dynamic Port Allocation
 Instead of fixed port 2333, use dynamic port:
 ```python
 import socket
 def find_free_port():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        return s.getsockname()[1]
 port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
 ```
 **Pros**: Avoids conflicts entirely
 **Cons**: More complex, may have side effects
 ## Recommended Implementation
 **Combine Solutions A + B + C** for maximum robustness:
 1. Add `__del__` for best-effort cleanup
 2. Add context manager for explicit cleanup
 3. Add `is_initialized()` check as defensive measure
 ```python
 # nanovllm/engine/llm_engine.py
 class LLMEngine:
    def __init__(self, model, **kwargs):
        # ... existing code ...
        atexit.register(self.exit)
        self._exited = False
    def exit(self):
        if self._exited:
            return
        self._exited = True
        self.model_runner.call("exit")
        del self.model_runner
        for p in self.ps:
            p.join()
    def __del__(self):
        try:
            self.exit()
        except Exception:
            pass
    def __enter__(self):
        return self
    def __exit__(self, *args):
        self.exit()
        return False
 # nanovllm/engine/model_runner.py
 class ModelRunner:
    def __init__(self, config: Config, rank: int, event):
        # ... existing code before init_process_group ...
        import os
        port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
        # Defensive cleanup
        if dist.is_initialized():
            dist.destroy_process_group()
        dist.init_process_group("nccl", f"tcp://localhost:{port}",
                                world_size=self.world_size, rank=rank)
        # ... rest of init ...
 ```
 ## Workaround for Current Code
 Until the fix is implemented, use one of these workarounds:
 ### Workaround 1: Manual Cleanup
 ```python
 import torch.distributed as dist
 llm = LLM(model_path)
 outputs = llm.generate(...)
 llm.model_runner.call("exit")  # Manual cleanup
 del llm
 # Now can create new LLM
 llm2 = LLM(model_path)
 ```
 ### Workaround 2: Subprocess Testing
 ```bash
 # Run each test group as separate process
 for i in $(seq 0 5 95); do
    python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
 done
 ```
 ### Workaround 3: Environment Variable Port
 ```bash
 # Use different port for each run
 NANOVLLM_DIST_PORT=2334 python test.py
 NANOVLLM_DIST_PORT=2335 python test.py
 ```
 ## Related Files
 | File | Relevant Code |
 |------|---------------|
 | `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
 | `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
 | `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
 | `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
 ## Testing the Fix
 After implementing the fix, verify with:
 ```python
 # test_multiple_llm.py
 from nanovllm import LLM, SamplingParams
 for i in range(3):
    print(f"Creating LLM instance {i+1}")
    llm = LLM("path/to/model", enable_cpu_offload=True)
    outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
    print(f"Instance {i+1} output: {outputs[0]['text']}")
    del llm
    print(f"Instance {i+1} deleted\n")
 print("All instances created and deleted successfully!")
 ```
 Expected: No port conflict errors, all 3 instances work.
 ## Priority
 **High** - This blocks grouped testing and any multi-LLM-instance workflows.
--- a/findings.md
+++ b/findings.md
@@ -1,160 +1,169 @@
-# Findings: Multi-Model Support Analysis
+# Findings: Torch Distributed Port Conflict
-## Current Architecture Analysis
+## Problem Analysis
-### Model Loading Flow
+### Issue Summary
-```
+创建多个 LLM 实例时出现端口冲突 (EADDRINUSE)，导致第二个实例无法启动。
 LLM(model_path)
  → LLMEngine.__init__()
    → Config.__post_init__()
      → hf_config = AutoConfig.from_pretrained(model)
    → ModelRunner.__init__()
      → model = Qwen3ForCausalLM(hf_config)  ← HARDCODED
      → load_model(model, config.model)
 ```
-### Key Files
+### Root Cause Deep Dive
 | File | Purpose |
 |------|---------|
 | `nanovllm/engine/model_runner.py` | 模型加载和运行 |
 | `nanovllm/models/qwen3.py` | Qwen3 模型定义 |
 | `nanovllm/utils/loader.py` | safetensors 权重加载 |
 | `nanovllm/layers/rotary_embedding.py` | RoPE 实现 |
---
+#### 1. 资源绑定位置
 ## Llama 3.1 Config Analysis
 ```json
 {
  "architectures": ["LlamaForCausalLM"],
  "model_type": "llama",
  "attention_bias": false,
  "mlp_bias": false,
  "head_dim": 128,
  "hidden_size": 4096,
  "intermediate_size": 14336,
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "hidden_act": "silu",
  "rms_norm_eps": 1e-05,
  "rope_theta": 500000.0,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "max_position_embeddings": 131072,
  "tie_word_embeddings": false,
  "vocab_size": 128256
 }
 ```
 ### Llama 3 RoPE Scaling
 Llama 3 使用特殊的 RoPE scaling 策略 (`rope_type: "llama3"`)：
 - 低频分量保持不变（对应短距离依赖）
 - 高频分量线性插值（对应长距离依赖）
 - 参数: `factor`, `low_freq_factor`, `high_freq_factor`, `original_max_position_embeddings`
 参考实现 (transformers):
 ```python
-def _compute_llama3_parameters(config, device, inv_freq):
+# nanovllm/engine/model_runner.py:30-32
-    factor = config.factor
+import os
-    low_freq_factor = config.low_freq_factor
+port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
-    high_freq_factor = config.high_freq_factor
+dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
    old_context_len = config.original_max_position_embeddings
    low_freq_wavelen = old_context_len / low_freq_factor
    high_freq_wavelen = old_context_len / high_freq_factor
    wavelen = 2 * math.pi / inv_freq
    inv_freq_llama = torch.where(
        wavelen > low_freq_wavelen,
        inv_freq / factor,
        inv_freq
    )
    smooth_factor = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
    smoothed_inv_freq = (1 - smooth_factor) * inv_freq_llama + smooth_factor * inv_freq
    is_medium_freq = (wavelen >= high_freq_wavelen) & (wavelen <= low_freq_wavelen)
    inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
    return inv_freq_llama
 ```
---
+- 默认端口 **2333**，可通过 `NANOVLLM_DIST_PORT` 环境变量配置
 - `init_process_group()` 绑定 TCP 端口用于进程间通信
 - 端口绑定持续到 `destroy_process_group()` 被调用
-## Weight Mapping Analysis
+#### 2. 清理机制缺陷
 ### Qwen3 packed_modules_mapping
 ```python
-packed_modules_mapping = {
+# nanovllm/engine/llm_engine.py:37
-    "q_proj": ("qkv_proj", "q"),
+atexit.register(self.exit)
-    "k_proj": ("qkv_proj", "k"),
+
-    "v_proj": ("qkv_proj", "v"),
+# nanovllm/engine/llm_engine.py:39-43
-    "gate_proj": ("gate_up_proj", 0),
+def exit(self):
-    "up_proj": ("gate_up_proj", 1),
+    self.model_runner.call("exit")
-}
+    del self.model_runner
    for p in self.ps:
        p.join()
 # nanovllm/engine/model_runner.py:66-78
 def exit(self):
    # ... cleanup code ...
    dist.destroy_process_group()
 ```
-### Llama Weight Names (from safetensors)
+**关键问题**: `atexit` 只在 **Python 解释器退出** 时触发，而非对象被删除时！
 预期 Llama 权重命名与 Qwen3 类似：
 - `model.layers.{i}.self_attn.q_proj.weight`
 - `model.layers.{i}.self_attn.k_proj.weight`
 - `model.layers.{i}.self_attn.v_proj.weight`
 - `model.layers.{i}.self_attn.o_proj.weight`
 - `model.layers.{i}.mlp.gate_proj.weight`
 - `model.layers.{i}.mlp.up_proj.weight`
 - `model.layers.{i}.mlp.down_proj.weight`
 - `model.layers.{i}.input_layernorm.weight`
 - `model.layers.{i}.post_attention_layernorm.weight`
-**结论**: Llama 的 `packed_modules_mapping` 与 Qwen3 相同，可以复用。
+#### 3. 问题时间线
 ```
 1. 创建 LLM #1
   ├── init_process_group() 绑定端口 2333 ✓
   └── atexit.register(self.exit) 注册
 2. LLM #1 超出作用域或被 del
   ├── Python GC 回收对象内存
   ├── atexit handler 未触发（进程未退出）
   ├── Worker 进程仍在运行
   └── 端口 2333 仍被占用 ❌
 3. 创建 LLM #2
   ├── init_process_group() 尝试绑定端口 2333
   └── EADDRINUSE 错误 ❌
 4. 程序退出（此时 atexit 才运行）
   └── 为时已晚 - 已经崩溃
 ```
 ---
-## Shared Components (Can Reuse)
+## Solution Analysis
-| Component | File | Notes |
+### 方案对比
-|-----------|------|-------|
+
-| `RMSNorm` | `layers/layernorm.py` | 通用 |
+| 方案 | 可靠性 | 向后兼容 | 实现复杂度 | 推荐度 |
-| `SiluAndMul` | `layers/activation.py` | 通用 |
+|------|--------|----------|------------|--------|
-| `Attention` | `layers/attention.py` | FlashAttention wrapper |
+| `close()` 方法 | 最高 | 是 | 低 | ★★★★★ |
-| `QKVParallelLinear` | `layers/linear.py` | 支持 bias=False |
+| `__del__` 方法 | 中等 | 是 | 低 | ★★★☆☆ |
-| `RowParallelLinear` | `layers/linear.py` | 通用 |
+| 端口检测重试 | 中等 | 是 | 低 | ★★★☆☆ |
-| `MergedColumnParallelLinear` | `layers/linear.py` | 通用 |
+| Context Manager | 最高 | 需要代码修改 | 低 | ★★★★☆ |
-| `VocabParallelEmbedding` | `layers/embed_head.py` | 通用 |
+| 动态端口 | 低 | 是 | 低 | ★★☆☆☆ |
-| `ParallelLMHead` | `layers/embed_head.py` | 通用 |
+
-| `load_model` | `utils/loader.py` | 通用 |
+### 为什么选择三层防护
 1. **Layer 1: close()** - 用户显式控制，最可靠
 2. **Layer 2: __del__** - 自动清理，覆盖大部分场景
 3. **Layer 3: 端口检测** - 最后防线，提供清晰错误信息
 ### `__del__` 的限制
 Python 的 `__del__` 不保证被调用：
 - 循环引用时可能不触发
 - 解释器关闭时可能无法访问依赖模块
 - 不应依赖于 `__del__` 进行关键资源清理
 但作为**额外防护层**是有价值的，因为：
 - 大多数情况下会被调用
 - 比没有好
 - 不影响其他清理机制
 ---
-## Llama vs Qwen3 Implementation Diff
+## Code Structure Analysis
-### Attention
+### LLMEngine 生命周期
-| Feature | Qwen3Attention | LlamaAttention |
+```
-|---------|----------------|----------------|
+__init__()
-| QKV bias | 可配置 (attention_bias) | 始终 False |
+├── 创建 worker 进程 (self.ps)
-| q_norm | 有 (when bias=False) | 无 |
+├── 创建 ModelRunner (self.model_runner)
-| k_norm | 有 (when bias=False) | 无 |
+├── 注册 atexit handler
-| RoPE | Standard | Llama3 scaled |
+└── 设置 scheduler, tokenizer
-### MLP
+close() [新增]
-| Feature | Qwen3MLP | LlamaMLP |
+├── 检查 _closed 标志（幂等）
-|---------|----------|----------|
+├── 注销 atexit handler
-| gate/up bias | False | False |
+├── 调用 model_runner.exit()
-| down bias | False | False |
+├── join worker 进程
-| hidden_act | silu | silu |
+└── 设置 _closed = True
-**结论**: Llama MLP 与 Qwen3 MLP 几乎相同，可以直接复用或简化。
+__del__() [新增]
 └── 调用 close()（忽略异常）
 __enter__/__exit__() [新增]
 └── Context manager 支持
 ```
 ### ModelRunner 资源
 ```
 __init__()
 ├── torch.distributed 初始化（绑定端口）
 ├── 模型加载
 ├── KV cache 分配
 ├── CUDA graph 捕获（可选）
 └── SharedMemory 创建（多GPU）
 exit()
 ├── SharedMemory 清理
 ├── CUDA graph 清理
 └── dist.destroy_process_group()
 ```
 ---
 ## Risk Assessment
-| Risk | Impact | Mitigation |
+| 风险 | 影响 | 缓解措施 |
-|------|--------|------------|
+|------|------|----------|
-| RoPE 实现错误 | 高 - 导致错误输出 | 参考 transformers 实现，单元测试 |
+| `__del__` 不被调用 | 中 - 端口泄漏 | Layer 3 端口检测提供清晰错误 |
-| 权重映射错误 | 高 - 模型无法加载 | 检查 safetensors 键名 |
+| close() 重复调用 | 低 | `_closed` 标志保证幂等 |
-| 注册表循环导入 | 中 - 启动失败 | 延迟导入 |
+| atexit 双重调用 | 低 | 注销机制防止 |
 | 子进程残留 | 高 | join() 确保子进程退出 |
 | CUDA 资源泄漏 | 中 | ModelRunner.exit() 清理 |
 ---
 ## Implementation Notes
 ### atexit.unregister 兼容性
 - Python 3.7+ 支持
 - 需要传入同一个函数对象
 - 使用 `self._atexit_handler` 而非 `self.exit` 以便正确注销
 ### 端口检测方法
 ```python
 def _check_port_available(port: int, host: str = "localhost") -> bool:
    """使用 socket connect_ex 检测端口是否被占用."""
    try:
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            s.settimeout(1)
            result = s.connect_ex((host, port))
            return result != 0  # 0 = connected = port in use
    except Exception:
        return True  # 假设可用
 ```
 **注意**: 这种检测存在 TOCTOU (Time-of-check to time-of-use) 竞争条件，但对于我们的用例足够了。
--- a/nanovllm/engine/llm_engine.py
+++ b/nanovllm/engine/llm_engine.py
@@ -34,14 +34,56 @@ class LLMEngine:
        # Set Sequence.block_size to match the KV cache block size
        Sequence.block_size = config.kvcache_block_size
        self.scheduler = Scheduler(config, self.model_runner.kvcache_manager)
-        atexit.register(self.exit)
+        self._closed = False
        atexit.register(self._atexit_handler)
-    def exit(self):
+    def _atexit_handler(self):
        """Handler for atexit - only runs if close() wasn't called."""
        if not self._closed:
            self.close()
    def close(self):
        """Explicitly close the engine and release all resources.
        This method is idempotent - calling it multiple times is safe.
        Supports: explicit close(), context manager, and __del__ fallback.
        """
        if self._closed:
            return
        self._closed = True
        # Unregister atexit to prevent double cleanup
        try:
            atexit.unregister(self._atexit_handler)
        except Exception:
            pass
        # Cleanup resources
        self.model_runner.call("exit")
        del self.model_runner
        for p in self.ps:
            p.join()
    def exit(self):
        """Alias for close() - kept for backward compatibility."""
        self.close()
    def __del__(self):
        """Destructor - attempt cleanup if not already done."""
        try:
            self.close()
        except Exception:
            pass
    def __enter__(self):
        """Context manager entry."""
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit - ensures cleanup."""
        self.close()
        return False
    def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
        if isinstance(prompt, str):
            prompt = self.tokenizer.encode(prompt)
--- a/nanovllm/engine/model_runner.py
+++ b/nanovllm/engine/model_runner.py
@@ -1,4 +1,6 @@
 import os
 import pickle
 import socket
 import torch
 import torch.distributed as dist
 from multiprocessing.synchronize import Event
@@ -16,6 +18,17 @@ from nanovllm.kvcache import create_kvcache_manager, KVCacheManager
 logger = get_logger("model_runner")
 def _find_free_port() -> int:
    """Find a free port for distributed communication.
    Uses socket binding with port 0 to let the OS assign an available port.
    """
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return s.getsockname()[1]
 class ModelRunner:
    def __init__(self, config: Config, rank: int, event: Event | list[Event]):
@@ -27,8 +40,13 @@ class ModelRunner:
        self.rank = rank
        self.event = event
-        import os
+        # Dynamic port allocation: use env var if set, otherwise find a free port
-        port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
+        env_port = os.environ.get("NANOVLLM_DIST_PORT")
        if env_port is not None:
            port = int(env_port)
        else:
            port = _find_free_port()
            logger.info(f"Auto-assigned distributed port: {port}")
        dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
        torch.cuda.set_device(rank)
        default_dtype = torch.get_default_dtype()
--- a/progress.md
+++ b/progress.md
@@ -1,76 +1,89 @@
-# Progress Log: Multi-Model Support
+# Progress Log: Fix Torch Distributed Port Conflict
-## Session: 2026-01-10
+## Status: COMPLETED & CLEANED UP
-### Initial Analysis Complete
+## Session: 2026-01-12
-**Time**: Session start
+### Task Overview
-
+修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题，以及支持多卡环境下同时启动多个独立进程。
 **Actions:**
 1. Read `nanovllm/engine/model_runner.py` - 确认硬编码位置 (line 35)
 2. Read `nanovllm/models/qwen3.py` - 理解 Qwen3 模型结构
 3. Read `nanovllm/utils/loader.py` - 理解权重加载机制
 4. Read `nanovllm/layers/rotary_embedding.py` - 发现 RoPE scaling 限制
 5. Read `/home/zijie/models/Llama-3.1-8B-Instruct/config.json` - 理解 Llama 配置
 **Key Findings:**
 - 模型加载在 `model_runner.py:35` 硬编码为 Qwen3
 - RoPE 目前不支持 scaling (`assert rope_scaling is None`)
 - Llama 3.1 需要 "llama3" 类型的 RoPE scaling
 - Llama 无 q_norm/k_norm，无 attention bias
 **Created:**
 - `task_plan.md` - 6 阶段实施计划
 - `findings.md` - 技术分析和发现
 ---
 ### Phase Status
-| Phase | Status | Notes |
+| Phase | Description | Status |
-|-------|--------|-------|
+|-------|-------------|--------|
-| 1. Model Registry | **COMPLETED** | `registry.py`, `__init__.py` |
+| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
-| 2. Llama3 RoPE | **COMPLETED** | `rotary_embedding.py` |
+| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
-| 3. Llama Model | **COMPLETED** | `llama.py` |
+| Phase 3 | 测试验证（GPU 4,5） | COMPLETED |
-| 4. ModelRunner | **COMPLETED** | Dynamic loading |
+| Phase 4 | 更新文档 | COMPLETED |
 | 5. Qwen3 Register | **COMPLETED** | `@register_model` decorator |
 | 6. Testing | **COMPLETED** | Both Llama & Qwen3 pass |
 ---
-## Test Results
+### Implementation Summary
-### Llama 3.1-8B-Instruct (32K needle, GPU 0, offload)
+#### Phase 1: Dynamic Port Allocation
-```
+**File**: `nanovllm/engine/model_runner.py`
-Input: 32768 tokens
+- Added `_find_free_port()` function using socket binding
-Expected: 7492
+- Modified port selection logic: use env var if set, otherwise auto-assign
-Output: 7492
+- Added logging for auto-assigned ports
 Status: PASSED
 Prefill: 1644 tok/s
 ```
-### Qwen3-4B (8K needle, GPU 1, offload) - Regression Test
+#### Phase 2: Resource Cleanup Enhancement
-```
+**File**: `nanovllm/engine/llm_engine.py`
-Input: 8192 tokens
+- Added `_closed` flag for idempotent cleanup
-Expected: 7492
+- Added `close()` method for explicit resource release
-Output: 7492
+- Added `__del__()` for GC fallback
-Status: PASSED
+- Added `__enter__()` and `__exit__()` for context manager support
-Prefill: 3295 tok/s
+- Modified atexit registration to use `_atexit_handler`
-```
+
 #### Phase 3: Testing (GPU 4,5)
 **File**: `tests/test_port_conflict.py`
 - Created comprehensive test script
 **Test Results**:
 | Test | Status | Notes |
 |------|--------|-------|
 | Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
 | Context manager | PASSED | Auto-cleanup works |
 | Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |
 #### Phase 4: Documentation
 **File**: `docs/torch_distributed_port_issue.md`
 - Updated status to RESOLVED
 - Documented solution details
 - Added usage examples
 ---
-## Files Modified This Session
+### Files Modified
 | File | Action | Description |
 |------|--------|-------------|
-| `nanovllm/models/registry.py` | created | Model registry with `@register_model` decorator |
+| `nanovllm/engine/model_runner.py` | Modified | Added `_find_free_port()`, dynamic port logic |
-| `nanovllm/models/__init__.py` | created | Export registry functions, import models |
+| `nanovllm/engine/llm_engine.py` | Modified | Added `close()`, `__del__`, context manager |
-| `nanovllm/models/llama.py` | created | Llama model implementation |
+| `tests/test_port_conflict.py` | Created | Test script for port conflict fix |
-| `nanovllm/models/qwen3.py` | modified | Added `@register_model` decorator |
+| `docs/torch_distributed_port_issue.md` | Deleted | Issue resolved, doc removed |
-| `nanovllm/layers/rotary_embedding.py` | modified | Added Llama3 RoPE scaling |
+| `CLAUDE.md` | Modified | Removed port conflict warnings, updated doc index |
-| `nanovllm/engine/model_runner.py` | modified | Dynamic model loading via registry |
+
-| `.claude/rules/gpu-testing.md` | created | GPU testing rules |
+---
-| `task_plan.md` | created | Implementation plan |
+
-| `findings.md` | created | Technical findings |
+### Key Features After Fix
-| `progress.md` | created | Progress tracking |
+
 1. **Multi-GPU Parallel Testing**
   ```bash
   CUDA_VISIBLE_DEVICES=0 python test1.py &
   CUDA_VISIBLE_DEVICES=1 python test2.py &
   # Both run with different auto-assigned ports
   ```
 2. **Sequential LLM Creation**
   ```python
   for i in range(3):
       with LLM(model_path) as llm:
           outputs = llm.generate(prompts, params)
       # Automatically cleaned up
   ```
 3. **Backward Compatible**
   - `NANOVLLM_DIST_PORT` env var still works
   - `llm.exit()` still works (alias for `close()`)
--- a/task_plan.md
+++ b/task_plan.md
@@ -1,314 +1,230 @@
-# Task Plan: Enable CUDA Graphs for CPU Offload Mode
+# Task Plan: Fix Torch Distributed Port Conflict
-## Current Status: ✅ COMPLETED
+## Goal
 支持多卡环境下同时启动多个独立的 nanovllm 进程进行测试，无需手动管理端口。
-### Phase 0 Completed: Refactor Offload Decode to Use Standard Attention Path
+## Problem Analysis
-### Phases 1-3 Completed: CUDA Graph Support for Offload Mode
+### 核心问题
 ```
 当前：所有 nanovllm 实例默认使用端口 2333
     └── 多个独立进程同时运行时会冲突！
-**Implementation**: Added per-layer CUDA graph capture and replay for offload decode path.
+CUDA_VISIBLE_DEVICES=0 python test1.py  # 绑定端口 2333 ✓
 CUDA_VISIBLE_DEVICES=1 python test2.py  # 尝试绑定 2333 → EADDRINUSE ❌
 ```
-**Key Changes**:
+### 根本原因
-1. `capture_offload_cudagraph()` captures one graph per transformer layer
+- 端口是系统级资源，与 GPU 无关
-2. Each graph uses the corresponding ring buffer slot based on `layer_id % num_buffers`
+- 即使使用不同 GPU，端口仍会冲突
-3. `run_layerwise_offload_decode()` replays graphs when `enforce_eager=False`
+- 当前硬编码默认端口 `2333`
 4. Synchronization added between graph replays to ensure correct data flow
 **Test Results**:
 - `test_needle.py --input-len 32768 --enable-offload --use-cuda-graph`: **PASSED**
 ---
-### Previous Work: Refactor Offload Decode to Use Standard Attention Path
+## Solution: Dynamic Port Allocation
-**Problem solved**: The original offload decode (`run_layerwise_offload_decode`) bypassed `Attention.forward()` by manually calling attention components. This was inconsistent with the standard execution path.
+### 核心方案
 **Solution implemented**: Refactored to use `layer.forward()` which goes through:
 ```
 Qwen3DecoderLayer.forward()
  → Qwen3Attention.forward()
    → Attention.forward()  ← Now properly used!
 ```
 ### Code Changes Made
 **File**: `nanovllm/engine/model_runner.py`
 1. **`run_layerwise_offload_decode()` (line 841-991)** - Completely refactored:
   Before (bypassed Attention):
 ```python
-   qkv = layer.self_attn.qkv_proj(hidden_ln)
+def _find_free_port() -> int:
-   q, k_new, v_new = qkv.split(...)
+    """让系统自动分配一个空闲端口"""
-   q = layer.self_attn.q_norm(...)
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-   k = layer.self_attn.k_norm(...)
+        s.bind(('', 0))
-   q, k = layer.self_attn.rotary_emb(...)
+        return s.getsockname()[1]
   attn_output = flash_attn_varlen_func(q, k_full, v_full, ...)  # Direct call!
   hidden_states = layer.self_attn.o_proj(attn_output)
   ```
-   After (uses standard path):
+# 优先使用环境变量，否则自动分配
-   ```python
+port = os.environ.get("NANOVLLM_DIST_PORT")
-   # Set up Attention module's cache to ring buffer
+if port is None:
-   attn_module.k_cache = offload_engine.layer_k_cache[buffer_idx:buffer_idx+1]
+    port = _find_free_port()
   attn_module.v_cache = offload_engine.layer_v_cache[buffer_idx:buffer_idx+1]
   # Set context for contiguous mode
   set_context(is_prefill=False, slot_mapping=..., context_lens=..., block_tables=None)
   # Standard layer forward - goes through Attention.forward()!
   hidden_states, residual = layer(positions, hidden_states, residual)
   ```
 2. **`ModelRunner.__init__()` (line 46-57)** - Conditional CUDA graph capture:
   ```python
   if not self.enforce_eager:
       if config.enable_cpu_offload:
           # TODO: Implement capture_offload_cudagraph()
           pass  # Temporarily use eager execution
 else:
-           self.capture_cudagraph()
+    port = int(port)
 ```
-### Test Results
+### 效果
 ```bash
 # 无需手动指定端口，可以同时运行多个测试
 CUDA_VISIBLE_DEVICES=0 python test1.py &  # 自动端口 54321
 CUDA_VISIBLE_DEVICES=1 python test2.py &  # 自动端口 54322
 CUDA_VISIBLE_DEVICES=2 python test3.py &  # 自动端口 54323
-| Test | Mode | Status |
+# 仍然支持手动指定（向后兼容）
-|------|------|--------|
+NANOVLLM_DIST_PORT=2333 python test.py
 | `test_needle.py --input-len 4096` | GPU-only | PASSED |
 | `test_needle.py --input-len 4096 --enable-offload` | CPU offload | PASSED |
 ## Remaining Work: Implement Offload CUDA Graph
 ### Why Standard `capture_cudagraph()` Cannot Be Used
 The standard capture function captures the PagedAttention decode path:
 ```python
 # capture_cudagraph() sets up:
 k_cache: [num_blocks, block_size, kv_heads, head_dim]  # PagedAttention format
 block_tables: [...] # Block indices for paged indexing
 ```
-But offload mode uses contiguous ring buffer:
+---
 ```python
 # Offload decode sets up:
 k_cache: [1, max_seq_len, kv_heads, head_dim]  # Contiguous format
 block_tables: None  # No paging
 ```
 ### Implementation Plan for `capture_offload_cudagraph()`
 #### Phase 1: Prepare Fixed-Address Tensors
 ```python
@torch.inference_mode()
 def capture_offload_cudagraph(self):
    """Capture CUDA graphs for offload decode using ring buffer."""
    offload_engine = self.kvcache_manager.offload_engine
    num_buffers = offload_engine.num_kv_buffers
    # Fixed-address tensors for graph capture
    input_ids = torch.zeros(1, dtype=torch.int64, device="cuda")
    positions = torch.zeros(1, dtype=torch.int64, device="cuda")
    slot_mapping = torch.zeros(1, dtype=torch.int32, device="cuda")
    context_lens = torch.zeros(1, dtype=torch.int32, device="cuda")
    self.offload_graphs = {}
    self.offload_graph_pool = None
 ```
 #### Phase 2: Capture Per-Buffer Graphs
 Since layer processing rotates through ring buffers (`layer_id % num_buffers`), we need graphs for each buffer slot:
 ```python
    for buffer_idx in range(num_buffers):
        graph = torch.cuda.CUDAGraph()
        # Set Attention cache to this buffer slot (fixed address)
        for layer in self.model.model.layers:
            layer.self_attn.attn.k_cache = offload_engine.layer_k_cache[buffer_idx:buffer_idx+1]
            layer.self_attn.attn.v_cache = offload_engine.layer_v_cache[buffer_idx:buffer_idx+1]
        # Set context
        set_context(is_prefill=False, slot_mapping=slot_mapping,
                    context_lens=context_lens, block_tables=None)
        # Warmup
        hidden = self.model.model.embed_tokens(input_ids)
        residual = None
        for layer_id, layer in enumerate(self.model.model.layers):
            if layer_id % num_buffers == buffer_idx:
                hidden, residual = layer(positions, hidden, residual)
        # Capture
        with torch.cuda.graph(graph, self.offload_graph_pool):
            # Same operations
            ...
        self.offload_graphs[buffer_idx] = graph
 ```
 #### Phase 3: Use Graphs in Decode
 Modify `run_layerwise_offload_decode()` to replay graphs:
 ```python
 for layer_id in range(num_layers):
    current_buffer = layer_id % num_buffers
    # Wait for H2D load
    offload_engine.wait_buffer_load(current_buffer)
    # Copy decode buffer to ring buffer (same as current)
    ...
    # Update graph variables
    self.offload_graph_vars["positions"][0] = positions[0]
    self.offload_graph_vars["slot_mapping"][0] = context_len
    self.offload_graph_vars["context_lens"][0] = context_len + 1
    # Replay graph instead of eager forward
    self.offload_graphs[current_buffer].replay()
    # Copy new KV to decode buffer (same as current)
    ...
 ```
 ### Challenges and Considerations
 | Challenge | Solution |
 |-----------|----------|
 | H2D transfers interleaved with compute | H2D happens outside graph, only compute is captured |
 | Different layers use different buffers | Capture per-buffer graphs, replay correct one |
 | Variable context length | Use `cache_seqlens` parameter (fixed address, variable value) |
 | Per-layer buffer rotation | Graph captures single-layer forward, loop in Python |
 ### Alternative: Full-Decode Graph (More Complex)
 Instead of per-layer graphs, capture entire decode step:
 1. Complete all H2D loads before graph
 2. Single graph covers all layers
 3. Better kernel fusion, less CPU overhead
 4. More complex to implement (need to handle buffer rotation inside graph)
 ## Implementation Phases
-| Phase | Description | Status |
+### Phase 1: ModelRunner 动态端口 [pending]
-|-------|-------------|--------|
+**File**: `nanovllm/engine/model_runner.py`
 | Phase 0 | Refactor offload decode to use Attention.forward() | ✅ Completed |
 | Phase 1 | Implement `capture_offload_cudagraph()` with per-layer graphs | ✅ Completed |
 | Phase 2 | Modify `run_layerwise_offload_decode()` to use graphs | ✅ Completed |
 | Phase 3 | Test and benchmark | ✅ Completed |
 | Phase 4 | (Optional) Optimize to full-decode graph | ⬜ Future |
 ## Architecture After Refactoring
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
 │                        Offload Decode Flow (After Refactoring)              │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                             │
 │  For each layer:                                                            │
 │    1. Wait for H2D load (ring buffer has prefill KV)                        │
 │    2. Copy decode buffer → ring buffer (at prefill_len offset)              │
 │    3. Set Attention.k_cache = ring_buffer[buffer_idx]                       │
 │    4. Set context (slot_mapping, context_lens, block_tables=None)           │
 │    5. layer.forward() → Qwen3Attention.forward() → Attention.forward()      │
 │       └── store_kvcache() stores new token to ring buffer                   │
 │       └── flash_attn_with_kvcache() computes attention                      │
 │    6. Copy new token KV: ring buffer → decode buffer                        │
 │    7. Start next layer H2D load                                             │
 │                                                                             │
 │  Key insight: Now uses standard Attention path, just with ring buffer       │
 │  as k_cache/v_cache in contiguous format (block_tables=None)                │
 │                                                                             │
 └─────────────────────────────────────────────────────────────────────────────┘
 ```
 ## Files Modified
 | File | Changes |
 |------|---------|
 | `model_runner.py:46-50` | Conditional CUDA graph capture: calls `capture_offload_cudagraph()` for offload mode |
 | `model_runner.py:69-73` | Updated `exit()` to clean up offload graph resources |
 | `model_runner.py:844-1031` | Refactored `run_layerwise_offload_decode()` to use standard `layer.forward()` with optional CUDA graph |
 | `model_runner.py:1075-1164` | New `capture_offload_cudagraph()` method for per-layer graph capture |
 | `tests/test_needle.py` | Added `--use-cuda-graph` flag to test CUDA graph mode |
 ## Implementation Details
 ### `capture_offload_cudagraph()` (line 1075-1164)
 Captures per-layer CUDA graphs for offload decode:
 ```python
-def capture_offload_cudagraph(self):
+import socket
    # Fixed-address tensors for graph capture
    hidden_states = torch.randn(1, hidden_size, ...)
    residual = torch.randn(1, hidden_size, ...)
    layer_outputs = torch.zeros(1, hidden_size, ...)
    layer_residual = torch.zeros(1, hidden_size, ...)
-    for layer_id in range(num_layers):
+def _find_free_port() -> int:
-        buffer_idx = layer_id % num_buffers
+    """Find a free port for distributed communication."""
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return s.getsockname()[1]
-        # Set Attention cache to ring buffer
+class ModelRunner:
-        attn_module.k_cache = ring_buffer[buffer_idx:buffer_idx+1]
+    def __init__(self, config: Config, rank: int, event: Event | list[Event]):
-        attn_module.v_cache = ring_buffer[buffer_idx:buffer_idx+1]
+        # ... existing code ...
-        # Warmup and capture
+        import os
-        with torch.cuda.graph(graph):
+        port = os.environ.get("NANOVLLM_DIST_PORT")
-            out_h, out_r = layer(positions, hidden_states, residual)
+        if port is None:
-            layer_outputs.copy_(out_h)
+            port = _find_free_port()
-            layer_residual.copy_(out_r)
+            logger.info(f"Auto-assigned distributed port: {port}")
        else:
            port = int(port)
-        # Update inputs for next layer
+        dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
        hidden_states.copy_(layer_outputs)
        residual.copy_(layer_residual)
 ```
-### `run_layerwise_offload_decode()` CUDA Graph Mode
+### Phase 2: LLMEngine 资源清理增强 [pending]
 **File**: `nanovllm/engine/llm_engine.py`
-When CUDA graphs are available:
+添加 `close()` 方法和 context manager 支持，确保资源正确释放：
 ```python
-use_cuda_graph = not self.enforce_eager and hasattr(self, 'offload_graphs')
+class LLMEngine:
    def __init__(self, model, **kwargs):
        # ... existing code ...
        self._closed = False
        atexit.register(self._atexit_handler)
-if use_cuda_graph:
+    def _atexit_handler(self):
-    # Use fixed-address tensors
+        if not self._closed:
-    graph_vars["positions"][0] = len(seq) - 1
+            self.close()
    graph_vars["slot_mapping"][0] = context_len
    graph_vars["context_lens"][0] = context_len + 1
    graph_vars["hidden_states"].copy_(embedding)
    graph_vars["residual"].zero_()
-    for layer_id in range(num_layers):
+    def close(self):
-        # Set up ring buffer and context
+        """Explicitly close the engine and release all resources."""
-        ...
+        if self._closed:
            return
        self._closed = True
        try:
            atexit.unregister(self._atexit_handler)
        except Exception:
            pass
        self.model_runner.call("exit")
        del self.model_runner
        for p in self.ps:
            p.join()
-        # Replay graph
+    def exit(self):
-        self.offload_graphs[layer_id].replay()
+        """Alias for close() - backward compatibility."""
-        torch.cuda.current_stream().synchronize()
+        self.close()
-        # Copy outputs to inputs for next layer
+    def __del__(self):
-        if layer_id < num_layers - 1:
+        try:
-            graph_vars["hidden_states"].copy_(graph_vars["layer_outputs"])
+            self.close()
-            graph_vars["residual"].copy_(graph_vars["layer_residual"])
+        except Exception:
            pass
    def __enter__(self):
        return self
    def __exit__(self, *args):
        self.close()
        return False
 ```
-## Test Results
+### Phase 3: 测试验证 [pending]
 **File**: `tests/test_multiple_processes.py` (新建)
-| Test | Mode | CUDA Graph | Status |
+```python
-|------|------|------------|--------|
+"""Test multiple independent nanovllm processes."""
-| `test_needle.py --input-len 4096` | GPU-only | N/A | PASSED |
+import subprocess
-| `test_needle.py --input-len 4096 --enable-offload` | CPU offload | Disabled | PASSED |
+import sys
-| `test_needle.py --input-len 32768 --enable-offload` | CPU offload | Disabled | PASSED |
+import time
 | `test_needle.py --input-len 32768 --enable-offload --use-cuda-graph` | CPU offload | Enabled | PASSED |
-## Next Steps
+def test_parallel_processes():
    """Test running multiple nanovllm processes in parallel."""
    script = '''
 import sys
 sys.path.insert(0, ".")
 from nanovllm import LLM, SamplingParams
 import os
-1. ~~Implement `capture_offload_cudagraph()` method~~ ✅
+gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "0")
-2. ~~Modify `run_layerwise_offload_decode()` to optionally use captured graphs~~ ✅
+print(f"[GPU {gpu}] Starting LLM")
-3. ~~Test correctness with needle-in-haystack~~ ✅
+llm = LLM("path/to/model", enable_cpu_offload=True)
-4. Benchmark performance improvement from CUDA graphs (optional)
+outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
-5. Consider full-decode graph optimization for maximum performance (future)
+print(f"[GPU {gpu}] Output: {outputs[0]['text'][:50]}")
 llm.close()
 print(f"[GPU {gpu}] Done")
 '''
    # Start 2 processes on different GPUs
    procs = []
    for gpu in [0, 1]:
        env = {"CUDA_VISIBLE_DEVICES": str(gpu)}
        p = subprocess.Popen(
            [sys.executable, "-c", script],
            env={**os.environ, **env}
        )
        procs.append(p)
        time.sleep(1)  # Stagger start slightly
    # Wait for all
    for p in procs:
        assert p.wait() == 0, f"Process failed with code {p.returncode}"
    print("PASSED: test_parallel_processes")
 if __name__ == "__main__":
    test_parallel_processes()
 ```
 ### Phase 4: 文档更新 [pending]
 **File**: `docs/torch_distributed_port_issue.md`
 更新文档标记问题已通过动态端口分配解决。
 ---
 ## Usage After Fix
 ### 场景 1: 多进程并行测试（主要场景）
 ```bash
 # 无需任何额外配置，直接运行
 CUDA_VISIBLE_DEVICES=0 python test_group1.py &
 CUDA_VISIBLE_DEVICES=1 python test_group2.py &
 CUDA_VISIBLE_DEVICES=2 python test_group3.py &
 wait
 ```
 ### 场景 2: 同一进程顺序创建（也支持）
 ```python
 for i in range(3):
    with LLM(model_path) as llm:
        outputs = llm.generate(prompts, params)
    # 自动清理，下一个可以使用新的随机端口
 ```
 ### 场景 3: 手动指定端口（向后兼容）
 ```bash
 NANOVLLM_DIST_PORT=2333 python test.py
 ```
 ---
 ## Success Criteria
 - [ ] 多个独立进程可以同时运行（不同 GPU）
 - [ ] 无需手动指定端口
 - [ ] 向后兼容（环境变量仍有效）
 - [ ] 同一进程顺序创建也能工作
 - [ ] 资源正确清理
 ---
 ## Files to Modify
 | File | Action | Status |
 |------|--------|--------|
 | `nanovllm/engine/model_runner.py` | Add `_find_free_port()` | pending |
 | `nanovllm/engine/llm_engine.py` | Add `close()`, context manager | pending |
 | `tests/test_multiple_processes.py` | Create | pending |
 | `docs/torch_distributed_port_issue.md` | Update | pending |
--- a/tests/run_parallel_niah.sh
+++ b/tests/run_parallel_niah.sh
@@ -0,0 +1,112 @@
 #!/bin/bash
 # Run NIAH tests in parallel on 6 GPUs
 # This tests the dynamic port allocation fix
 set -e
 MODEL="${1:-/home/zijie/models/Llama-3.1-8B-Instruct}"
 PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 echo "=========================================="
 echo "Parallel NIAH Test on 6 GPUs"
 echo "=========================================="
 echo "Model: $MODEL"
 echo "Project: $PROJECT_ROOT"
 echo ""
 # Sample distribution (100 samples total):
 # GPU 0: 0-16   (17 samples)
 # GPU 1: 17-33  (17 samples)
 # GPU 2: 34-50  (17 samples)
 # GPU 3: 51-67  (17 samples)
 # GPU 4: 68-83  (16 samples)
 # GPU 5: 84-99  (16 samples)
 declare -a RANGES=("0-16" "17-33" "34-50" "51-67" "68-83" "84-99")
 declare -a PIDS=()
 # Create log directory
 LOG_DIR="$PROJECT_ROOT/logs"
 mkdir -p "$LOG_DIR"
 # Start all 6 processes
 for gpu in {0..5}; do
    range="${RANGES[$gpu]}"
    log_file="$LOG_DIR/gpu${gpu}_${range}.log"
    echo "Starting GPU $gpu: samples $range -> $log_file"
    CUDA_VISIBLE_DEVICES=$gpu PYTHONPATH="$PROJECT_ROOT:$PYTHONPATH" \
        python "$PROJECT_ROOT/tests/test_ruler_niah.py" \
        --model "$MODEL" \
        --sample-indices "$range" \
        --enable-offload \
        --num-gpu-blocks 4 \
        --quiet \
        > "$log_file" 2>&1 &
    PIDS+=($!)
    # Small delay to stagger starts
    sleep 2
 done
 echo ""
 echo "All 6 processes started. Waiting for completion..."
 echo "PIDs: ${PIDS[*]}"
 echo ""
 # Wait for all processes and collect results
 declare -a RESULTS=()
 ALL_PASSED=true
 for i in {0..5}; do
    pid="${PIDS[$i]}"
    range="${RANGES[$i]}"
    log_file="$LOG_DIR/gpu${i}_${range}.log"
    if wait $pid; then
        RESULTS+=("GPU $i ($range): PASSED")
        echo "GPU $i completed successfully"
    else
        RESULTS+=("GPU $i ($range): FAILED (exit code $?)")
        ALL_PASSED=false
        echo "GPU $i FAILED!"
    fi
 done
 echo ""
 echo "=========================================="
 echo "RESULTS SUMMARY"
 echo "=========================================="
 for result in "${RESULTS[@]}"; do
    echo "$result"
 done
 echo ""
 # Show accuracy from each log
 echo "Accuracy per GPU:"
 for i in {0..5}; do
    range="${RANGES[$i]}"
    log_file="$LOG_DIR/gpu${i}_${range}.log"
    if [ -f "$log_file" ]; then
        accuracy=$(grep -E "Accuracy:|accuracy" "$log_file" | tail -1 || echo "N/A")
        port=$(grep "Auto-assigned distributed port" "$log_file" | head -1 || echo "N/A")
        echo "  GPU $i ($range): $accuracy | $port"
    fi
 done
 echo ""
 if $ALL_PASSED; then
    echo "=========================================="
    echo "ALL 6 TESTS PASSED!"
    echo "Dynamic port allocation works correctly."
    echo "=========================================="
    exit 0
 else
    echo "=========================================="
    echo "SOME TESTS FAILED!"
    echo "Check logs in $LOG_DIR"
    echo "=========================================="
    exit 1
 fi
--- a/tests/test_port_conflict.py
+++ b/tests/test_port_conflict.py
@@ -0,0 +1,198 @@
 """Test for torch distributed port conflict fix.
 This test verifies that:
 1. Multiple independent processes can run simultaneously (dynamic port allocation)
 2. Sequential LLM creation in same process works (proper cleanup)
 Usage:
    # Test parallel processes (requires 2 GPUs)
    python tests/test_port_conflict.py --model ~/models/Qwen3-4B --gpus 4,5 --test parallel
    # Test sequential creation in same process
    CUDA_VISIBLE_DEVICES=4 python tests/test_port_conflict.py --model ~/models/Qwen3-4B --test sequential
 """
 import argparse
 import os
 import subprocess
 import sys
 import time
 def test_sequential_creation(model_path: str, enable_offload: bool = True):
    """Test creating multiple LLM instances sequentially in same process."""
    # Add project root to path
    project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    sys.path.insert(0, project_root)
    from nanovllm import LLM, SamplingParams
    print("=" * 60)
    print("Test: Sequential LLM Creation (same process)")
    print("=" * 60)
    for i in range(3):
        print(f"\n--- Creating LLM instance {i+1}/3 ---")
        llm_kwargs = {"enable_cpu_offload": enable_offload}
        if enable_offload:
            llm_kwargs["num_gpu_blocks"] = 2
        llm = LLM(model_path, **llm_kwargs)
        # Simple generation
        outputs = llm.generate(
            ["Hello, how are you?"],
            SamplingParams(max_tokens=20)
        )
        print(f"Output: {outputs[0]['text'][:50]}...")
        # Explicit cleanup
        llm.close()
        print(f"Instance {i+1} closed successfully")
    print("\n" + "=" * 60)
    print("PASSED: test_sequential_creation")
    print("=" * 60)
 def test_context_manager(model_path: str, enable_offload: bool = True):
    """Test LLM with context manager."""
    project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    sys.path.insert(0, project_root)
    from nanovllm import LLM, SamplingParams
    print("=" * 60)
    print("Test: Context Manager")
    print("=" * 60)
    for i in range(2):
        print(f"\n--- Context manager instance {i+1}/2 ---")
        llm_kwargs = {"enable_cpu_offload": enable_offload}
        if enable_offload:
            llm_kwargs["num_gpu_blocks"] = 2
        with LLM(model_path, **llm_kwargs) as llm:
            outputs = llm.generate(
                ["What is 2+2?"],
                SamplingParams(max_tokens=20)
            )
            print(f"Output: {outputs[0]['text'][:50]}...")
        print(f"Instance {i+1} auto-closed via context manager")
    print("\n" + "=" * 60)
    print("PASSED: test_context_manager")
    print("=" * 60)
 def test_parallel_processes(model_path: str, gpus: str, enable_offload: bool = True):
    """Test running multiple nanovllm processes in parallel."""
    gpu_list = [int(g.strip()) for g in gpus.split(",")]
    if len(gpu_list) < 2:
        print("ERROR: Need at least 2 GPUs for parallel test")
        return False
    print("=" * 60)
    print(f"Test: Parallel Processes (GPUs: {gpu_list})")
    print("=" * 60)
    project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    # Script to run in each subprocess
    script = f'''
 import sys
 sys.path.insert(0, "{project_root}")
 import os
 from nanovllm import LLM, SamplingParams
 gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "?")
 print(f"[GPU {{gpu}}] Starting LLM...")
 llm_kwargs = {{"enable_cpu_offload": {enable_offload}}}
 if {enable_offload}:
    llm_kwargs["num_gpu_blocks"] = 2
 llm = LLM("{model_path}", **llm_kwargs)
 print(f"[GPU {{gpu}}] LLM initialized, generating...")
 outputs = llm.generate(["Hello world"], SamplingParams(max_tokens=10))
 print(f"[GPU {{gpu}}] Output: {{outputs[0]['text'][:30]}}...")
 llm.close()
 print(f"[GPU {{gpu}}] Done")
 '''
    # Start processes on different GPUs
    procs = []
    for i, gpu in enumerate(gpu_list[:2]):  # Use first 2 GPUs
        print(f"\nStarting process on GPU {gpu}...")
        env = os.environ.copy()
        env["CUDA_VISIBLE_DEVICES"] = str(gpu)
        p = subprocess.Popen(
            [sys.executable, "-c", script],
            env=env,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True
        )
        procs.append((gpu, p))
        time.sleep(2)  # Stagger starts to see concurrent running
    # Wait and collect results
    all_passed = True
    for gpu, p in procs:
        stdout, _ = p.communicate(timeout=300)
        print(f"\n--- GPU {gpu} output ---")
        print(stdout)
        if p.returncode != 0:
            print(f"ERROR: GPU {gpu} process failed with code {p.returncode}")
            all_passed = False
        else:
            print(f"GPU {gpu} process completed successfully")
    print("\n" + "=" * 60)
    if all_passed:
        print("PASSED: test_parallel_processes")
    else:
        print("FAILED: test_parallel_processes")
    print("=" * 60)
    return all_passed
 def main():
    parser = argparse.ArgumentParser(description="Test port conflict fix")
    parser.add_argument("--model", "-m", required=True, help="Path to model")
    parser.add_argument("--gpus", default="0,1", help="GPUs to use for parallel test (comma-separated)")
    parser.add_argument("--test", choices=["sequential", "context", "parallel", "all"],
                        default="all", help="Which test to run")
    parser.add_argument("--no-offload", action="store_true", help="Disable CPU offload")
    args = parser.parse_args()
    enable_offload = not args.no_offload
    model_path = os.path.expanduser(args.model)
    print(f"Model: {model_path}")
    print(f"CPU Offload: {enable_offload}")
    print(f"GPUs for parallel test: {args.gpus}")
    print()
    if args.test in ["sequential", "all"]:
        test_sequential_creation(model_path, enable_offload)
        print()
    if args.test in ["context", "all"]:
        test_context_manager(model_path, enable_offload)
        print()
    if args.test in ["parallel", "all"]:
        test_parallel_processes(model_path, args.gpus, enable_offload)
 if __name__ == "__main__":
    main()