Merge branch 'zijie/fix-dist-3': Fix distributed port conflict

- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Zijie Tian
2026-01-12 16:20:44 +08:00
parent de6f36bdb2
commit 64971c8e8a
10 changed files with 784 additions and 792 deletions

3
.gitignore vendored
View File

@@ -224,3 +224,6 @@ coordination/orchestration/*
claude-flow
# Removed Windows wrapper files per user request
hive-mind-prompt-*.txt
# Test data
tests/data/

View File

@@ -22,19 +22,9 @@ while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
done
```
### Other Scripts (tests, examples) - Port Conflict Check Only
### Other Scripts (tests, examples) - No Special Requirements
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
```bash
# Check if port 2333 (nanovllm default) is in use
if lsof -i :2333 >/dev/null 2>&1; then
echo "Port 2333 in use, waiting 10s..."
sleep 10
fi
```
**Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process.
For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for `torch.distributed` communication.
## Multi-Instance Development with PYTHONPATH
@@ -68,7 +58,6 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
| [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions |
| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
## Configuration

View File

@@ -1,308 +0,0 @@
# Torch Distributed Port Conflict Issue
## Problem Summary
When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
```
## Root Cause Analysis
### 1. Distributed Process Group Initialization
In `nanovllm/engine/model_runner.py:30-32`:
```python
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
```
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
- `init_process_group()` binds a TCP socket to this port
- This binding persists until `destroy_process_group()` is called
### 2. Cleanup Mechanism
In `nanovllm/engine/llm_engine.py:37`:
```python
atexit.register(self.exit)
```
In `nanovllm/engine/llm_engine.py:39-43`:
```python
def exit(self):
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
```
In `nanovllm/engine/model_runner.py:66-78`:
```python
def exit(self):
# ... cleanup code ...
dist.destroy_process_group()
```
### 3. The Problem
**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
Timeline of the bug:
```
1. Create LLM instance #1
├── init_process_group() binds port 2333 ✓
└── atexit.register(self.exit) registered
2. LLM #1 goes out of scope (garbage collected)
├── Python's GC deletes the object
├── BUT atexit handler NOT triggered yet
└── Port 2333 still bound! ❌
3. Create LLM instance #2
├── init_process_group() tries to bind port 2333
└── EADDRINUSE error! ❌
4. Program exits (only now atexit runs)
└── Too late - already crashed
```
## Impact
This issue affects:
1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
- Each group needs a fresh LLM instance
- Second group fails with port conflict
2. **Multiple LLM instances in same process**
- Any code that creates LLM, deletes it, then creates another
3. **Interactive/notebook usage**
- Re-running cells that create LLM instances
## Proposed Solutions
### Solution A: Add `__del__` Method (Quick Fix)
Add destructor to `LLMEngine` that calls cleanup:
```python
# In nanovllm/engine/llm_engine.py
def __del__(self):
try:
self.exit()
except Exception:
pass # Ignore errors during cleanup
```
**Pros**: Simple, backwards compatible
**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
### Solution B: Context Manager Pattern (Recommended)
Make `LLMEngine` a context manager:
```python
# In nanovllm/engine/llm_engine.py
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.exit()
return False
```
Usage:
```python
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# Cleanup happens automatically here
```
**Pros**: Explicit, guaranteed cleanup, Pythonic
**Cons**: Requires usage pattern change
### Solution C: Check and Cleanup Before Init (Defensive)
In `ModelRunner.__init__`, check if process group exists:
```python
# In nanovllm/engine/model_runner.py
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
```
**Pros**: Self-healing, no usage pattern change
**Cons**: May mask other issues, global state manipulation
### Solution D: Subprocess Isolation (For Testing)
For grouped testing specifically, run each group in a subprocess:
```python
import subprocess
for group in groups:
subprocess.run([sys.executable, "test_ruler_niah.py",
"--sample-indices", f"{start}-{end}"])
```
**Pros**: Complete isolation, no code changes to nanovllm
**Cons**: More overhead, only solves testing use case
### Solution E: Dynamic Port Allocation
Instead of fixed port 2333, use dynamic port:
```python
import socket
def find_free_port():
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
return s.getsockname()[1]
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
```
**Pros**: Avoids conflicts entirely
**Cons**: More complex, may have side effects
## Recommended Implementation
**Combine Solutions A + B + C** for maximum robustness:
1. Add `__del__` for best-effort cleanup
2. Add context manager for explicit cleanup
3. Add `is_initialized()` check as defensive measure
```python
# nanovllm/engine/llm_engine.py
class LLMEngine:
def __init__(self, model, **kwargs):
# ... existing code ...
atexit.register(self.exit)
self._exited = False
def exit(self):
if self._exited:
return
self._exited = True
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
def __del__(self):
try:
self.exit()
except Exception:
pass
def __enter__(self):
return self
def __exit__(self, *args):
self.exit()
return False
# nanovllm/engine/model_runner.py
class ModelRunner:
def __init__(self, config: Config, rank: int, event):
# ... existing code before init_process_group ...
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
# Defensive cleanup
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}",
world_size=self.world_size, rank=rank)
# ... rest of init ...
```
## Workaround for Current Code
Until the fix is implemented, use one of these workarounds:
### Workaround 1: Manual Cleanup
```python
import torch.distributed as dist
llm = LLM(model_path)
outputs = llm.generate(...)
llm.model_runner.call("exit") # Manual cleanup
del llm
# Now can create new LLM
llm2 = LLM(model_path)
```
### Workaround 2: Subprocess Testing
```bash
# Run each test group as separate process
for i in $(seq 0 5 95); do
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
done
```
### Workaround 3: Environment Variable Port
```bash
# Use different port for each run
NANOVLLM_DIST_PORT=2334 python test.py
NANOVLLM_DIST_PORT=2335 python test.py
```
## Related Files
| File | Relevant Code |
|------|---------------|
| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
## Testing the Fix
After implementing the fix, verify with:
```python
# test_multiple_llm.py
from nanovllm import LLM, SamplingParams
for i in range(3):
print(f"Creating LLM instance {i+1}")
llm = LLM("path/to/model", enable_cpu_offload=True)
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
print(f"Instance {i+1} output: {outputs[0]['text']}")
del llm
print(f"Instance {i+1} deleted\n")
print("All instances created and deleted successfully!")
```
Expected: No port conflict errors, all 3 instances work.
## Priority
**High** - This blocks grouped testing and any multi-LLM-instance workflows.

View File

@@ -1,160 +1,169 @@
# Findings: Multi-Model Support Analysis
# Findings: Torch Distributed Port Conflict
## Current Architecture Analysis
## Problem Analysis
### Model Loading Flow
```
LLM(model_path)
→ LLMEngine.__init__()
→ Config.__post_init__()
→ hf_config = AutoConfig.from_pretrained(model)
→ ModelRunner.__init__()
→ model = Qwen3ForCausalLM(hf_config) ← HARDCODED
→ load_model(model, config.model)
```
### Issue Summary
创建多个 LLM 实例时出现端口冲突 (EADDRINUSE),导致第二个实例无法启动。
### Key Files
| File | Purpose |
|------|---------|
| `nanovllm/engine/model_runner.py` | 模型加载和运行 |
| `nanovllm/models/qwen3.py` | Qwen3 模型定义 |
| `nanovllm/utils/loader.py` | safetensors 权重加载 |
| `nanovllm/layers/rotary_embedding.py` | RoPE 实现 |
### Root Cause Deep Dive
---
## Llama 3.1 Config Analysis
```json
{
"architectures": ["LlamaForCausalLM"],
"model_type": "llama",
"attention_bias": false,
"mlp_bias": false,
"head_dim": 128,
"hidden_size": 4096,
"intermediate_size": 14336,
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"hidden_act": "silu",
"rms_norm_eps": 1e-05,
"rope_theta": 500000.0,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
"max_position_embeddings": 131072,
"tie_word_embeddings": false,
"vocab_size": 128256
}
```
### Llama 3 RoPE Scaling
Llama 3 使用特殊的 RoPE scaling 策略 (`rope_type: "llama3"`)
- 低频分量保持不变(对应短距离依赖)
- 高频分量线性插值(对应长距离依赖)
- 参数: `factor`, `low_freq_factor`, `high_freq_factor`, `original_max_position_embeddings`
参考实现 (transformers):
#### 1. 资源绑定位置
```python
def _compute_llama3_parameters(config, device, inv_freq):
factor = config.factor
low_freq_factor = config.low_freq_factor
high_freq_factor = config.high_freq_factor
old_context_len = config.original_max_position_embeddings
low_freq_wavelen = old_context_len / low_freq_factor
high_freq_wavelen = old_context_len / high_freq_factor
wavelen = 2 * math.pi / inv_freq
inv_freq_llama = torch.where(
wavelen > low_freq_wavelen,
inv_freq / factor,
inv_freq
)
smooth_factor = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
smoothed_inv_freq = (1 - smooth_factor) * inv_freq_llama + smooth_factor * inv_freq
is_medium_freq = (wavelen >= high_freq_wavelen) & (wavelen <= low_freq_wavelen)
inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
return inv_freq_llama
# nanovllm/engine/model_runner.py:30-32
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
```
---
- 默认端口 **2333**,可通过 `NANOVLLM_DIST_PORT` 环境变量配置
- `init_process_group()` 绑定 TCP 端口用于进程间通信
- 端口绑定持续到 `destroy_process_group()` 被调用
## Weight Mapping Analysis
### Qwen3 packed_modules_mapping
#### 2. 清理机制缺陷
```python
packed_modules_mapping = {
"q_proj": ("qkv_proj", "q"),
"k_proj": ("qkv_proj", "k"),
"v_proj": ("qkv_proj", "v"),
"gate_proj": ("gate_up_proj", 0),
"up_proj": ("gate_up_proj", 1),
}
# nanovllm/engine/llm_engine.py:37
atexit.register(self.exit)
# nanovllm/engine/llm_engine.py:39-43
def exit(self):
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
# nanovllm/engine/model_runner.py:66-78
def exit(self):
# ... cleanup code ...
dist.destroy_process_group()
```
### Llama Weight Names (from safetensors)
预期 Llama 权重命名与 Qwen3 类似:
- `model.layers.{i}.self_attn.q_proj.weight`
- `model.layers.{i}.self_attn.k_proj.weight`
- `model.layers.{i}.self_attn.v_proj.weight`
- `model.layers.{i}.self_attn.o_proj.weight`
- `model.layers.{i}.mlp.gate_proj.weight`
- `model.layers.{i}.mlp.up_proj.weight`
- `model.layers.{i}.mlp.down_proj.weight`
- `model.layers.{i}.input_layernorm.weight`
- `model.layers.{i}.post_attention_layernorm.weight`
**关键问题**: `atexit` 只在 **Python 解释器退出** 时触发,而非对象被删除时!
**结论**: Llama 的 `packed_modules_mapping` 与 Qwen3 相同,可以复用。
#### 3. 问题时间线
```
1. 创建 LLM #1
├── init_process_group() 绑定端口 2333 ✓
└── atexit.register(self.exit) 注册
2. LLM #1 超出作用域或被 del
├── Python GC 回收对象内存
├── atexit handler 未触发(进程未退出)
├── Worker 进程仍在运行
└── 端口 2333 仍被占用 ❌
3. 创建 LLM #2
├── init_process_group() 尝试绑定端口 2333
└── EADDRINUSE 错误 ❌
4. 程序退出(此时 atexit 才运行)
└── 为时已晚 - 已经崩溃
```
---
## Shared Components (Can Reuse)
## Solution Analysis
| Component | File | Notes |
|-----------|------|-------|
| `RMSNorm` | `layers/layernorm.py` | 通用 |
| `SiluAndMul` | `layers/activation.py` | 通用 |
| `Attention` | `layers/attention.py` | FlashAttention wrapper |
| `QKVParallelLinear` | `layers/linear.py` | 支持 bias=False |
| `RowParallelLinear` | `layers/linear.py` | 通用 |
| `MergedColumnParallelLinear` | `layers/linear.py` | 通用 |
| `VocabParallelEmbedding` | `layers/embed_head.py` | 通用 |
| `ParallelLMHead` | `layers/embed_head.py` | 通用 |
| `load_model` | `utils/loader.py` | 通用 |
### 方案对比
| 方案 | 可靠性 | 向后兼容 | 实现复杂度 | 推荐度 |
|------|--------|----------|------------|--------|
| `close()` 方法 | 最高 | 是 | 低 | ★★★★★ |
| `__del__` 方法 | 中等 | 是 | 低 | ★★★☆☆ |
| 端口检测重试 | 中等 | 是 | 低 | ★★★☆☆ |
| Context Manager | 最高 | 需要代码修改 | 低 | ★★★★☆ |
| 动态端口 | 低 | 是 | 低 | ★★☆☆☆ |
### 为什么选择三层防护
1. **Layer 1: close()** - 用户显式控制,最可靠
2. **Layer 2: __del__** - 自动清理,覆盖大部分场景
3. **Layer 3: 端口检测** - 最后防线,提供清晰错误信息
### `__del__` 的限制
Python 的 `__del__` 不保证被调用:
- 循环引用时可能不触发
- 解释器关闭时可能无法访问依赖模块
- 不应依赖于 `__del__` 进行关键资源清理
但作为**额外防护层**是有价值的,因为:
- 大多数情况下会被调用
- 比没有好
- 不影响其他清理机制
---
## Llama vs Qwen3 Implementation Diff
## Code Structure Analysis
### Attention
| Feature | Qwen3Attention | LlamaAttention |
|---------|----------------|----------------|
| QKV bias | 可配置 (attention_bias) | 始终 False |
| q_norm | 有 (when bias=False) | 无 |
| k_norm | 有 (when bias=False) | 无 |
| RoPE | Standard | Llama3 scaled |
### LLMEngine 生命周期
```
__init__()
├── 创建 worker 进程 (self.ps)
├── 创建 ModelRunner (self.model_runner)
├── 注册 atexit handler
└── 设置 scheduler, tokenizer
### MLP
| Feature | Qwen3MLP | LlamaMLP |
|---------|----------|----------|
| gate/up bias | False | False |
| down bias | False | False |
| hidden_act | silu | silu |
close() [新增]
├── 检查 _closed 标志(幂等)
├── 注销 atexit handler
├── 调用 model_runner.exit()
├── join worker 进程
└── 设置 _closed = True
**结论**: Llama MLP 与 Qwen3 MLP 几乎相同,可以直接复用或简化。
__del__() [新增]
└── 调用 close()(忽略异常)
__enter__/__exit__() [新增]
└── Context manager 支持
```
### ModelRunner 资源
```
__init__()
├── torch.distributed 初始化(绑定端口)
├── 模型加载
├── KV cache 分配
├── CUDA graph 捕获(可选)
└── SharedMemory 创建多GPU
exit()
├── SharedMemory 清理
├── CUDA graph 清理
└── dist.destroy_process_group()
```
---
## Risk Assessment
| Risk | Impact | Mitigation |
|------|--------|------------|
| RoPE 实现错误 | - 导致错误输出 | 参考 transformers 实现,单元测试 |
| 权重映射错误 | 高 - 模型无法加载 | 检查 safetensors 键名 |
| 注册表循环导入 | 中 - 启动失败 | 延迟导入 |
| 风险 | 影响 | 缓解措施 |
|------|------|----------|
| `__del__` 不被调用 | - 端口泄漏 | Layer 3 端口检测提供清晰错误 |
| close() 重复调用 | 低 | `_closed` 标志保证幂等 |
| atexit 双重调用 | 低 | 注销机制防止 |
| 子进程残留 | 高 | join() 确保子进程退出 |
| CUDA 资源泄漏 | 中 | ModelRunner.exit() 清理 |
---
## Implementation Notes
### atexit.unregister 兼容性
- Python 3.7+ 支持
- 需要传入同一个函数对象
- 使用 `self._atexit_handler` 而非 `self.exit` 以便正确注销
### 端口检测方法
```python
def _check_port_available(port: int, host: str = "localhost") -> bool:
"""使用 socket connect_ex 检测端口是否被占用."""
try:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.settimeout(1)
result = s.connect_ex((host, port))
return result != 0 # 0 = connected = port in use
except Exception:
return True # 假设可用
```
**注意**: 这种检测存在 TOCTOU (Time-of-check to time-of-use) 竞争条件,但对于我们的用例足够了。

View File

@@ -34,14 +34,56 @@ class LLMEngine:
# Set Sequence.block_size to match the KV cache block size
Sequence.block_size = config.kvcache_block_size
self.scheduler = Scheduler(config, self.model_runner.kvcache_manager)
atexit.register(self.exit)
self._closed = False
atexit.register(self._atexit_handler)
def exit(self):
def _atexit_handler(self):
"""Handler for atexit - only runs if close() wasn't called."""
if not self._closed:
self.close()
def close(self):
"""Explicitly close the engine and release all resources.
This method is idempotent - calling it multiple times is safe.
Supports: explicit close(), context manager, and __del__ fallback.
"""
if self._closed:
return
self._closed = True
# Unregister atexit to prevent double cleanup
try:
atexit.unregister(self._atexit_handler)
except Exception:
pass
# Cleanup resources
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
def exit(self):
"""Alias for close() - kept for backward compatibility."""
self.close()
def __del__(self):
"""Destructor - attempt cleanup if not already done."""
try:
self.close()
except Exception:
pass
def __enter__(self):
"""Context manager entry."""
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit - ensures cleanup."""
self.close()
return False
def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
if isinstance(prompt, str):
prompt = self.tokenizer.encode(prompt)

View File

@@ -1,4 +1,6 @@
import os
import pickle
import socket
import torch
import torch.distributed as dist
from multiprocessing.synchronize import Event
@@ -16,6 +18,17 @@ from nanovllm.kvcache import create_kvcache_manager, KVCacheManager
logger = get_logger("model_runner")
def _find_free_port() -> int:
"""Find a free port for distributed communication.
Uses socket binding with port 0 to let the OS assign an available port.
"""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return s.getsockname()[1]
class ModelRunner:
def __init__(self, config: Config, rank: int, event: Event | list[Event]):
@@ -27,8 +40,13 @@ class ModelRunner:
self.rank = rank
self.event = event
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
# Dynamic port allocation: use env var if set, otherwise find a free port
env_port = os.environ.get("NANOVLLM_DIST_PORT")
if env_port is not None:
port = int(env_port)
else:
port = _find_free_port()
logger.info(f"Auto-assigned distributed port: {port}")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
torch.cuda.set_device(rank)
default_dtype = torch.get_default_dtype()

View File

@@ -1,76 +1,89 @@
# Progress Log: Multi-Model Support
# Progress Log: Fix Torch Distributed Port Conflict
## Session: 2026-01-10
## Status: COMPLETED & CLEANED UP
### Initial Analysis Complete
## Session: 2026-01-12
**Time**: Session start
**Actions:**
1. Read `nanovllm/engine/model_runner.py` - 确认硬编码位置 (line 35)
2. Read `nanovllm/models/qwen3.py` - 理解 Qwen3 模型结构
3. Read `nanovllm/utils/loader.py` - 理解权重加载机制
4. Read `nanovllm/layers/rotary_embedding.py` - 发现 RoPE scaling 限制
5. Read `/home/zijie/models/Llama-3.1-8B-Instruct/config.json` - 理解 Llama 配置
**Key Findings:**
- 模型加载在 `model_runner.py:35` 硬编码为 Qwen3
- RoPE 目前不支持 scaling (`assert rope_scaling is None`)
- Llama 3.1 需要 "llama3" 类型的 RoPE scaling
- Llama 无 q_norm/k_norm无 attention bias
**Created:**
- `task_plan.md` - 6 阶段实施计划
- `findings.md` - 技术分析和发现
### Task Overview
修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题,以及支持多卡环境下同时启动多个独立进程。
---
### Phase Status
| Phase | Status | Notes |
|-------|--------|-------|
| 1. Model Registry | **COMPLETED** | `registry.py`, `__init__.py` |
| 2. Llama3 RoPE | **COMPLETED** | `rotary_embedding.py` |
| 3. Llama Model | **COMPLETED** | `llama.py` |
| 4. ModelRunner | **COMPLETED** | Dynamic loading |
| 5. Qwen3 Register | **COMPLETED** | `@register_model` decorator |
| 6. Testing | **COMPLETED** | Both Llama & Qwen3 pass |
| Phase | Description | Status |
|-------|-------------|--------|
| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
| Phase 3 | 测试验证GPU 4,5 | COMPLETED |
| Phase 4 | 更新文档 | COMPLETED |
---
## Test Results
### Implementation Summary
### Llama 3.1-8B-Instruct (32K needle, GPU 0, offload)
```
Input: 32768 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 1644 tok/s
```
#### Phase 1: Dynamic Port Allocation
**File**: `nanovllm/engine/model_runner.py`
- Added `_find_free_port()` function using socket binding
- Modified port selection logic: use env var if set, otherwise auto-assign
- Added logging for auto-assigned ports
### Qwen3-4B (8K needle, GPU 1, offload) - Regression Test
```
Input: 8192 tokens
Expected: 7492
Output: 7492
Status: PASSED
Prefill: 3295 tok/s
```
#### Phase 2: Resource Cleanup Enhancement
**File**: `nanovllm/engine/llm_engine.py`
- Added `_closed` flag for idempotent cleanup
- Added `close()` method for explicit resource release
- Added `__del__()` for GC fallback
- Added `__enter__()` and `__exit__()` for context manager support
- Modified atexit registration to use `_atexit_handler`
#### Phase 3: Testing (GPU 4,5)
**File**: `tests/test_port_conflict.py`
- Created comprehensive test script
**Test Results**:
| Test | Status | Notes |
|------|--------|-------|
| Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
| Context manager | PASSED | Auto-cleanup works |
| Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |
#### Phase 4: Documentation
**File**: `docs/torch_distributed_port_issue.md`
- Updated status to RESOLVED
- Documented solution details
- Added usage examples
---
## Files Modified This Session
### Files Modified
| File | Action | Description |
|------|--------|-------------|
| `nanovllm/models/registry.py` | created | Model registry with `@register_model` decorator |
| `nanovllm/models/__init__.py` | created | Export registry functions, import models |
| `nanovllm/models/llama.py` | created | Llama model implementation |
| `nanovllm/models/qwen3.py` | modified | Added `@register_model` decorator |
| `nanovllm/layers/rotary_embedding.py` | modified | Added Llama3 RoPE scaling |
| `nanovllm/engine/model_runner.py` | modified | Dynamic model loading via registry |
| `.claude/rules/gpu-testing.md` | created | GPU testing rules |
| `task_plan.md` | created | Implementation plan |
| `findings.md` | created | Technical findings |
| `progress.md` | created | Progress tracking |
| `nanovllm/engine/model_runner.py` | Modified | Added `_find_free_port()`, dynamic port logic |
| `nanovllm/engine/llm_engine.py` | Modified | Added `close()`, `__del__`, context manager |
| `tests/test_port_conflict.py` | Created | Test script for port conflict fix |
| `docs/torch_distributed_port_issue.md` | Deleted | Issue resolved, doc removed |
| `CLAUDE.md` | Modified | Removed port conflict warnings, updated doc index |
---
### Key Features After Fix
1. **Multi-GPU Parallel Testing**
```bash
CUDA_VISIBLE_DEVICES=0 python test1.py &
CUDA_VISIBLE_DEVICES=1 python test2.py &
# Both run with different auto-assigned ports
```
2. **Sequential LLM Creation**
```python
for i in range(3):
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# Automatically cleaned up
```
3. **Backward Compatible**
- `NANOVLLM_DIST_PORT` env var still works
- `llm.exit()` still works (alias for `close()`)

View File

@@ -1,314 +1,230 @@
# Task Plan: Enable CUDA Graphs for CPU Offload Mode
# Task Plan: Fix Torch Distributed Port Conflict
## Current Status: ✅ COMPLETED
## Goal
支持多卡环境下同时启动多个独立的 nanovllm 进程进行测试,无需手动管理端口。
### Phase 0 Completed: Refactor Offload Decode to Use Standard Attention Path
## Problem Analysis
### Phases 1-3 Completed: CUDA Graph Support for Offload Mode
### 核心问题
```
当前:所有 nanovllm 实例默认使用端口 2333
└── 多个独立进程同时运行时会冲突!
**Implementation**: Added per-layer CUDA graph capture and replay for offload decode path.
CUDA_VISIBLE_DEVICES=0 python test1.py # 绑定端口 2333 ✓
CUDA_VISIBLE_DEVICES=1 python test2.py # 尝试绑定 2333 → EADDRINUSE ❌
```
**Key Changes**:
1. `capture_offload_cudagraph()` captures one graph per transformer layer
2. Each graph uses the corresponding ring buffer slot based on `layer_id % num_buffers`
3. `run_layerwise_offload_decode()` replays graphs when `enforce_eager=False`
4. Synchronization added between graph replays to ensure correct data flow
**Test Results**:
- `test_needle.py --input-len 32768 --enable-offload --use-cuda-graph`: **PASSED**
### 根本原因
- 端口是系统级资源,与 GPU 无关
- 即使使用不同 GPU端口仍会冲突
- 当前硬编码默认端口 `2333`
---
### Previous Work: Refactor Offload Decode to Use Standard Attention Path
## Solution: Dynamic Port Allocation
**Problem solved**: The original offload decode (`run_layerwise_offload_decode`) bypassed `Attention.forward()` by manually calling attention components. This was inconsistent with the standard execution path.
**Solution implemented**: Refactored to use `layer.forward()` which goes through:
```
Qwen3DecoderLayer.forward()
→ Qwen3Attention.forward()
→ Attention.forward() ← Now properly used!
```
### Code Changes Made
**File**: `nanovllm/engine/model_runner.py`
1. **`run_layerwise_offload_decode()` (line 841-991)** - Completely refactored:
Before (bypassed Attention):
```python
qkv = layer.self_attn.qkv_proj(hidden_ln)
q, k_new, v_new = qkv.split(...)
q = layer.self_attn.q_norm(...)
k = layer.self_attn.k_norm(...)
q, k = layer.self_attn.rotary_emb(...)
attn_output = flash_attn_varlen_func(q, k_full, v_full, ...) # Direct call!
hidden_states = layer.self_attn.o_proj(attn_output)
```
After (uses standard path):
```python
# Set up Attention module's cache to ring buffer
attn_module.k_cache = offload_engine.layer_k_cache[buffer_idx:buffer_idx+1]
attn_module.v_cache = offload_engine.layer_v_cache[buffer_idx:buffer_idx+1]
# Set context for contiguous mode
set_context(is_prefill=False, slot_mapping=..., context_lens=..., block_tables=None)
# Standard layer forward - goes through Attention.forward()!
hidden_states, residual = layer(positions, hidden_states, residual)
```
2. **`ModelRunner.__init__()` (line 46-57)** - Conditional CUDA graph capture:
```python
if not self.enforce_eager:
if config.enable_cpu_offload:
# TODO: Implement capture_offload_cudagraph()
pass # Temporarily use eager execution
else:
self.capture_cudagraph()
```
### Test Results
| Test | Mode | Status |
|------|------|--------|
| `test_needle.py --input-len 4096` | GPU-only | PASSED |
| `test_needle.py --input-len 4096 --enable-offload` | CPU offload | PASSED |
## Remaining Work: Implement Offload CUDA Graph
### Why Standard `capture_cudagraph()` Cannot Be Used
The standard capture function captures the PagedAttention decode path:
### 核心方案
```python
# capture_cudagraph() sets up:
k_cache: [num_blocks, block_size, kv_heads, head_dim] # PagedAttention format
block_tables: [...] # Block indices for paged indexing
def _find_free_port() -> int:
"""让系统自动分配一个空闲端口"""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
return s.getsockname()[1]
# 优先使用环境变量,否则自动分配
port = os.environ.get("NANOVLLM_DIST_PORT")
if port is None:
port = _find_free_port()
else:
port = int(port)
```
But offload mode uses contiguous ring buffer:
```python
# Offload decode sets up:
k_cache: [1, max_seq_len, kv_heads, head_dim] # Contiguous format
block_tables: None # No paging
### 效果
```bash
# 无需手动指定端口,可以同时运行多个测试
CUDA_VISIBLE_DEVICES=0 python test1.py & # 自动端口 54321
CUDA_VISIBLE_DEVICES=1 python test2.py & # 自动端口 54322
CUDA_VISIBLE_DEVICES=2 python test3.py & # 自动端口 54323
# 仍然支持手动指定(向后兼容)
NANOVLLM_DIST_PORT=2333 python test.py
```
### Implementation Plan for `capture_offload_cudagraph()`
#### Phase 1: Prepare Fixed-Address Tensors
```python
@torch.inference_mode()
def capture_offload_cudagraph(self):
"""Capture CUDA graphs for offload decode using ring buffer."""
offload_engine = self.kvcache_manager.offload_engine
num_buffers = offload_engine.num_kv_buffers
# Fixed-address tensors for graph capture
input_ids = torch.zeros(1, dtype=torch.int64, device="cuda")
positions = torch.zeros(1, dtype=torch.int64, device="cuda")
slot_mapping = torch.zeros(1, dtype=torch.int32, device="cuda")
context_lens = torch.zeros(1, dtype=torch.int32, device="cuda")
self.offload_graphs = {}
self.offload_graph_pool = None
```
#### Phase 2: Capture Per-Buffer Graphs
Since layer processing rotates through ring buffers (`layer_id % num_buffers`), we need graphs for each buffer slot:
```python
for buffer_idx in range(num_buffers):
graph = torch.cuda.CUDAGraph()
# Set Attention cache to this buffer slot (fixed address)
for layer in self.model.model.layers:
layer.self_attn.attn.k_cache = offload_engine.layer_k_cache[buffer_idx:buffer_idx+1]
layer.self_attn.attn.v_cache = offload_engine.layer_v_cache[buffer_idx:buffer_idx+1]
# Set context
set_context(is_prefill=False, slot_mapping=slot_mapping,
context_lens=context_lens, block_tables=None)
# Warmup
hidden = self.model.model.embed_tokens(input_ids)
residual = None
for layer_id, layer in enumerate(self.model.model.layers):
if layer_id % num_buffers == buffer_idx:
hidden, residual = layer(positions, hidden, residual)
# Capture
with torch.cuda.graph(graph, self.offload_graph_pool):
# Same operations
...
self.offload_graphs[buffer_idx] = graph
```
#### Phase 3: Use Graphs in Decode
Modify `run_layerwise_offload_decode()` to replay graphs:
```python
for layer_id in range(num_layers):
current_buffer = layer_id % num_buffers
# Wait for H2D load
offload_engine.wait_buffer_load(current_buffer)
# Copy decode buffer to ring buffer (same as current)
...
# Update graph variables
self.offload_graph_vars["positions"][0] = positions[0]
self.offload_graph_vars["slot_mapping"][0] = context_len
self.offload_graph_vars["context_lens"][0] = context_len + 1
# Replay graph instead of eager forward
self.offload_graphs[current_buffer].replay()
# Copy new KV to decode buffer (same as current)
...
```
### Challenges and Considerations
| Challenge | Solution |
|-----------|----------|
| H2D transfers interleaved with compute | H2D happens outside graph, only compute is captured |
| Different layers use different buffers | Capture per-buffer graphs, replay correct one |
| Variable context length | Use `cache_seqlens` parameter (fixed address, variable value) |
| Per-layer buffer rotation | Graph captures single-layer forward, loop in Python |
### Alternative: Full-Decode Graph (More Complex)
Instead of per-layer graphs, capture entire decode step:
1. Complete all H2D loads before graph
2. Single graph covers all layers
3. Better kernel fusion, less CPU overhead
4. More complex to implement (need to handle buffer rotation inside graph)
---
## Implementation Phases
| Phase | Description | Status |
|-------|-------------|--------|
| Phase 0 | Refactor offload decode to use Attention.forward() | ✅ Completed |
| Phase 1 | Implement `capture_offload_cudagraph()` with per-layer graphs | ✅ Completed |
| Phase 2 | Modify `run_layerwise_offload_decode()` to use graphs | ✅ Completed |
| Phase 3 | Test and benchmark | ✅ Completed |
| Phase 4 | (Optional) Optimize to full-decode graph | ⬜ Future |
## Architecture After Refactoring
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Offload Decode Flow (After Refactoring) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ For each layer: │
│ 1. Wait for H2D load (ring buffer has prefill KV) │
│ 2. Copy decode buffer → ring buffer (at prefill_len offset) │
│ 3. Set Attention.k_cache = ring_buffer[buffer_idx] │
│ 4. Set context (slot_mapping, context_lens, block_tables=None) │
│ 5. layer.forward() → Qwen3Attention.forward() → Attention.forward() │
│ └── store_kvcache() stores new token to ring buffer │
│ └── flash_attn_with_kvcache() computes attention │
│ 6. Copy new token KV: ring buffer → decode buffer │
│ 7. Start next layer H2D load │
│ │
│ Key insight: Now uses standard Attention path, just with ring buffer │
│ as k_cache/v_cache in contiguous format (block_tables=None) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Files Modified
| File | Changes |
|------|---------|
| `model_runner.py:46-50` | Conditional CUDA graph capture: calls `capture_offload_cudagraph()` for offload mode |
| `model_runner.py:69-73` | Updated `exit()` to clean up offload graph resources |
| `model_runner.py:844-1031` | Refactored `run_layerwise_offload_decode()` to use standard `layer.forward()` with optional CUDA graph |
| `model_runner.py:1075-1164` | New `capture_offload_cudagraph()` method for per-layer graph capture |
| `tests/test_needle.py` | Added `--use-cuda-graph` flag to test CUDA graph mode |
## Implementation Details
### `capture_offload_cudagraph()` (line 1075-1164)
Captures per-layer CUDA graphs for offload decode:
### Phase 1: ModelRunner 动态端口 [pending]
**File**: `nanovllm/engine/model_runner.py`
```python
def capture_offload_cudagraph(self):
# Fixed-address tensors for graph capture
hidden_states = torch.randn(1, hidden_size, ...)
residual = torch.randn(1, hidden_size, ...)
layer_outputs = torch.zeros(1, hidden_size, ...)
layer_residual = torch.zeros(1, hidden_size, ...)
import socket
for layer_id in range(num_layers):
buffer_idx = layer_id % num_buffers
def _find_free_port() -> int:
"""Find a free port for distributed communication."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return s.getsockname()[1]
# Set Attention cache to ring buffer
attn_module.k_cache = ring_buffer[buffer_idx:buffer_idx+1]
attn_module.v_cache = ring_buffer[buffer_idx:buffer_idx+1]
class ModelRunner:
def __init__(self, config: Config, rank: int, event: Event | list[Event]):
# ... existing code ...
# Warmup and capture
with torch.cuda.graph(graph):
out_h, out_r = layer(positions, hidden_states, residual)
layer_outputs.copy_(out_h)
layer_residual.copy_(out_r)
import os
port = os.environ.get("NANOVLLM_DIST_PORT")
if port is None:
port = _find_free_port()
logger.info(f"Auto-assigned distributed port: {port}")
else:
port = int(port)
# Update inputs for next layer
hidden_states.copy_(layer_outputs)
residual.copy_(layer_residual)
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
```
### `run_layerwise_offload_decode()` CUDA Graph Mode
### Phase 2: LLMEngine 资源清理增强 [pending]
**File**: `nanovllm/engine/llm_engine.py`
When CUDA graphs are available:
添加 `close()` 方法和 context manager 支持,确保资源正确释放:
```python
use_cuda_graph = not self.enforce_eager and hasattr(self, 'offload_graphs')
class LLMEngine:
def __init__(self, model, **kwargs):
# ... existing code ...
self._closed = False
atexit.register(self._atexit_handler)
if use_cuda_graph:
# Use fixed-address tensors
graph_vars["positions"][0] = len(seq) - 1
graph_vars["slot_mapping"][0] = context_len
graph_vars["context_lens"][0] = context_len + 1
graph_vars["hidden_states"].copy_(embedding)
graph_vars["residual"].zero_()
def _atexit_handler(self):
if not self._closed:
self.close()
for layer_id in range(num_layers):
# Set up ring buffer and context
...
def close(self):
"""Explicitly close the engine and release all resources."""
if self._closed:
return
self._closed = True
try:
atexit.unregister(self._atexit_handler)
except Exception:
pass
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
# Replay graph
self.offload_graphs[layer_id].replay()
torch.cuda.current_stream().synchronize()
def exit(self):
"""Alias for close() - backward compatibility."""
self.close()
# Copy outputs to inputs for next layer
if layer_id < num_layers - 1:
graph_vars["hidden_states"].copy_(graph_vars["layer_outputs"])
graph_vars["residual"].copy_(graph_vars["layer_residual"])
def __del__(self):
try:
self.close()
except Exception:
pass
def __enter__(self):
return self
def __exit__(self, *args):
self.close()
return False
```
## Test Results
### Phase 3: 测试验证 [pending]
**File**: `tests/test_multiple_processes.py` (新建)
| Test | Mode | CUDA Graph | Status |
|------|------|------------|--------|
| `test_needle.py --input-len 4096` | GPU-only | N/A | PASSED |
| `test_needle.py --input-len 4096 --enable-offload` | CPU offload | Disabled | PASSED |
| `test_needle.py --input-len 32768 --enable-offload` | CPU offload | Disabled | PASSED |
| `test_needle.py --input-len 32768 --enable-offload --use-cuda-graph` | CPU offload | Enabled | PASSED |
```python
"""Test multiple independent nanovllm processes."""
import subprocess
import sys
import time
## Next Steps
def test_parallel_processes():
"""Test running multiple nanovllm processes in parallel."""
script = '''
import sys
sys.path.insert(0, ".")
from nanovllm import LLM, SamplingParams
import os
1. ~~Implement `capture_offload_cudagraph()` method~~ ✅
2. ~~Modify `run_layerwise_offload_decode()` to optionally use captured graphs~~ ✅
3. ~~Test correctness with needle-in-haystack~~
4. Benchmark performance improvement from CUDA graphs (optional)
5. Consider full-decode graph optimization for maximum performance (future)
gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "0")
print(f"[GPU {gpu}] Starting LLM")
llm = LLM("path/to/model", enable_cpu_offload=True)
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
print(f"[GPU {gpu}] Output: {outputs[0]['text'][:50]}")
llm.close()
print(f"[GPU {gpu}] Done")
'''
# Start 2 processes on different GPUs
procs = []
for gpu in [0, 1]:
env = {"CUDA_VISIBLE_DEVICES": str(gpu)}
p = subprocess.Popen(
[sys.executable, "-c", script],
env={**os.environ, **env}
)
procs.append(p)
time.sleep(1) # Stagger start slightly
# Wait for all
for p in procs:
assert p.wait() == 0, f"Process failed with code {p.returncode}"
print("PASSED: test_parallel_processes")
if __name__ == "__main__":
test_parallel_processes()
```
### Phase 4: 文档更新 [pending]
**File**: `docs/torch_distributed_port_issue.md`
更新文档标记问题已通过动态端口分配解决。
---
## Usage After Fix
### 场景 1: 多进程并行测试(主要场景)
```bash
# 无需任何额外配置,直接运行
CUDA_VISIBLE_DEVICES=0 python test_group1.py &
CUDA_VISIBLE_DEVICES=1 python test_group2.py &
CUDA_VISIBLE_DEVICES=2 python test_group3.py &
wait
```
### 场景 2: 同一进程顺序创建(也支持)
```python
for i in range(3):
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# 自动清理,下一个可以使用新的随机端口
```
### 场景 3: 手动指定端口(向后兼容)
```bash
NANOVLLM_DIST_PORT=2333 python test.py
```
---
## Success Criteria
- [ ] 多个独立进程可以同时运行(不同 GPU
- [ ] 无需手动指定端口
- [ ] 向后兼容(环境变量仍有效)
- [ ] 同一进程顺序创建也能工作
- [ ] 资源正确清理
---
## Files to Modify
| File | Action | Status |
|------|--------|--------|
| `nanovllm/engine/model_runner.py` | Add `_find_free_port()` | pending |
| `nanovllm/engine/llm_engine.py` | Add `close()`, context manager | pending |
| `tests/test_multiple_processes.py` | Create | pending |
| `docs/torch_distributed_port_issue.md` | Update | pending |

112
tests/run_parallel_niah.sh Executable file
View File

@@ -0,0 +1,112 @@
#!/bin/bash
# Run NIAH tests in parallel on 6 GPUs
# This tests the dynamic port allocation fix
set -e
MODEL="${1:-/home/zijie/models/Llama-3.1-8B-Instruct}"
PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
echo "=========================================="
echo "Parallel NIAH Test on 6 GPUs"
echo "=========================================="
echo "Model: $MODEL"
echo "Project: $PROJECT_ROOT"
echo ""
# Sample distribution (100 samples total):
# GPU 0: 0-16 (17 samples)
# GPU 1: 17-33 (17 samples)
# GPU 2: 34-50 (17 samples)
# GPU 3: 51-67 (17 samples)
# GPU 4: 68-83 (16 samples)
# GPU 5: 84-99 (16 samples)
declare -a RANGES=("0-16" "17-33" "34-50" "51-67" "68-83" "84-99")
declare -a PIDS=()
# Create log directory
LOG_DIR="$PROJECT_ROOT/logs"
mkdir -p "$LOG_DIR"
# Start all 6 processes
for gpu in {0..5}; do
range="${RANGES[$gpu]}"
log_file="$LOG_DIR/gpu${gpu}_${range}.log"
echo "Starting GPU $gpu: samples $range -> $log_file"
CUDA_VISIBLE_DEVICES=$gpu PYTHONPATH="$PROJECT_ROOT:$PYTHONPATH" \
python "$PROJECT_ROOT/tests/test_ruler_niah.py" \
--model "$MODEL" \
--sample-indices "$range" \
--enable-offload \
--num-gpu-blocks 4 \
--quiet \
> "$log_file" 2>&1 &
PIDS+=($!)
# Small delay to stagger starts
sleep 2
done
echo ""
echo "All 6 processes started. Waiting for completion..."
echo "PIDs: ${PIDS[*]}"
echo ""
# Wait for all processes and collect results
declare -a RESULTS=()
ALL_PASSED=true
for i in {0..5}; do
pid="${PIDS[$i]}"
range="${RANGES[$i]}"
log_file="$LOG_DIR/gpu${i}_${range}.log"
if wait $pid; then
RESULTS+=("GPU $i ($range): PASSED")
echo "GPU $i completed successfully"
else
RESULTS+=("GPU $i ($range): FAILED (exit code $?)")
ALL_PASSED=false
echo "GPU $i FAILED!"
fi
done
echo ""
echo "=========================================="
echo "RESULTS SUMMARY"
echo "=========================================="
for result in "${RESULTS[@]}"; do
echo "$result"
done
echo ""
# Show accuracy from each log
echo "Accuracy per GPU:"
for i in {0..5}; do
range="${RANGES[$i]}"
log_file="$LOG_DIR/gpu${i}_${range}.log"
if [ -f "$log_file" ]; then
accuracy=$(grep -E "Accuracy:|accuracy" "$log_file" | tail -1 || echo "N/A")
port=$(grep "Auto-assigned distributed port" "$log_file" | head -1 || echo "N/A")
echo " GPU $i ($range): $accuracy | $port"
fi
done
echo ""
if $ALL_PASSED; then
echo "=========================================="
echo "ALL 6 TESTS PASSED!"
echo "Dynamic port allocation works correctly."
echo "=========================================="
exit 0
else
echo "=========================================="
echo "SOME TESTS FAILED!"
echo "Check logs in $LOG_DIR"
echo "=========================================="
exit 1
fi

198
tests/test_port_conflict.py Normal file
View File

@@ -0,0 +1,198 @@
"""Test for torch distributed port conflict fix.
This test verifies that:
1. Multiple independent processes can run simultaneously (dynamic port allocation)
2. Sequential LLM creation in same process works (proper cleanup)
Usage:
# Test parallel processes (requires 2 GPUs)
python tests/test_port_conflict.py --model ~/models/Qwen3-4B --gpus 4,5 --test parallel
# Test sequential creation in same process
CUDA_VISIBLE_DEVICES=4 python tests/test_port_conflict.py --model ~/models/Qwen3-4B --test sequential
"""
import argparse
import os
import subprocess
import sys
import time
def test_sequential_creation(model_path: str, enable_offload: bool = True):
"""Test creating multiple LLM instances sequentially in same process."""
# Add project root to path
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, project_root)
from nanovllm import LLM, SamplingParams
print("=" * 60)
print("Test: Sequential LLM Creation (same process)")
print("=" * 60)
for i in range(3):
print(f"\n--- Creating LLM instance {i+1}/3 ---")
llm_kwargs = {"enable_cpu_offload": enable_offload}
if enable_offload:
llm_kwargs["num_gpu_blocks"] = 2
llm = LLM(model_path, **llm_kwargs)
# Simple generation
outputs = llm.generate(
["Hello, how are you?"],
SamplingParams(max_tokens=20)
)
print(f"Output: {outputs[0]['text'][:50]}...")
# Explicit cleanup
llm.close()
print(f"Instance {i+1} closed successfully")
print("\n" + "=" * 60)
print("PASSED: test_sequential_creation")
print("=" * 60)
def test_context_manager(model_path: str, enable_offload: bool = True):
"""Test LLM with context manager."""
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, project_root)
from nanovllm import LLM, SamplingParams
print("=" * 60)
print("Test: Context Manager")
print("=" * 60)
for i in range(2):
print(f"\n--- Context manager instance {i+1}/2 ---")
llm_kwargs = {"enable_cpu_offload": enable_offload}
if enable_offload:
llm_kwargs["num_gpu_blocks"] = 2
with LLM(model_path, **llm_kwargs) as llm:
outputs = llm.generate(
["What is 2+2?"],
SamplingParams(max_tokens=20)
)
print(f"Output: {outputs[0]['text'][:50]}...")
print(f"Instance {i+1} auto-closed via context manager")
print("\n" + "=" * 60)
print("PASSED: test_context_manager")
print("=" * 60)
def test_parallel_processes(model_path: str, gpus: str, enable_offload: bool = True):
"""Test running multiple nanovllm processes in parallel."""
gpu_list = [int(g.strip()) for g in gpus.split(",")]
if len(gpu_list) < 2:
print("ERROR: Need at least 2 GPUs for parallel test")
return False
print("=" * 60)
print(f"Test: Parallel Processes (GPUs: {gpu_list})")
print("=" * 60)
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Script to run in each subprocess
script = f'''
import sys
sys.path.insert(0, "{project_root}")
import os
from nanovllm import LLM, SamplingParams
gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "?")
print(f"[GPU {{gpu}}] Starting LLM...")
llm_kwargs = {{"enable_cpu_offload": {enable_offload}}}
if {enable_offload}:
llm_kwargs["num_gpu_blocks"] = 2
llm = LLM("{model_path}", **llm_kwargs)
print(f"[GPU {{gpu}}] LLM initialized, generating...")
outputs = llm.generate(["Hello world"], SamplingParams(max_tokens=10))
print(f"[GPU {{gpu}}] Output: {{outputs[0]['text'][:30]}}...")
llm.close()
print(f"[GPU {{gpu}}] Done")
'''
# Start processes on different GPUs
procs = []
for i, gpu in enumerate(gpu_list[:2]): # Use first 2 GPUs
print(f"\nStarting process on GPU {gpu}...")
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(gpu)
p = subprocess.Popen(
[sys.executable, "-c", script],
env=env,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True
)
procs.append((gpu, p))
time.sleep(2) # Stagger starts to see concurrent running
# Wait and collect results
all_passed = True
for gpu, p in procs:
stdout, _ = p.communicate(timeout=300)
print(f"\n--- GPU {gpu} output ---")
print(stdout)
if p.returncode != 0:
print(f"ERROR: GPU {gpu} process failed with code {p.returncode}")
all_passed = False
else:
print(f"GPU {gpu} process completed successfully")
print("\n" + "=" * 60)
if all_passed:
print("PASSED: test_parallel_processes")
else:
print("FAILED: test_parallel_processes")
print("=" * 60)
return all_passed
def main():
parser = argparse.ArgumentParser(description="Test port conflict fix")
parser.add_argument("--model", "-m", required=True, help="Path to model")
parser.add_argument("--gpus", default="0,1", help="GPUs to use for parallel test (comma-separated)")
parser.add_argument("--test", choices=["sequential", "context", "parallel", "all"],
default="all", help="Which test to run")
parser.add_argument("--no-offload", action="store_true", help="Disable CPU offload")
args = parser.parse_args()
enable_offload = not args.no_offload
model_path = os.path.expanduser(args.model)
print(f"Model: {model_path}")
print(f"CPU Offload: {enable_offload}")
print(f"GPUs for parallel test: {args.gpus}")
print()
if args.test in ["sequential", "all"]:
test_sequential_creation(model_path, enable_offload)
print()
if args.test in ["context", "all"]:
test_context_manager(model_path, enable_offload)
print()
if args.test in ["parallel", "all"]:
test_parallel_processes(model_path, args.gpus, enable_offload)
if __name__ == "__main__":
main()