Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py - Resource management refactor with close() + context manager in llm_engine.py - Add tests/test_port_conflict.py and tests/run_parallel_niah.sh - Remove docs/torch_distributed_port_issue.md (issue fixed) - Ignore tests/data/ directory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
3
.gitignore
vendored
3
.gitignore
vendored
@@ -224,3 +224,6 @@ coordination/orchestration/*
|
|||||||
claude-flow
|
claude-flow
|
||||||
# Removed Windows wrapper files per user request
|
# Removed Windows wrapper files per user request
|
||||||
hive-mind-prompt-*.txt
|
hive-mind-prompt-*.txt
|
||||||
|
|
||||||
|
# Test data
|
||||||
|
tests/data/
|
||||||
|
|||||||
15
CLAUDE.md
15
CLAUDE.md
@@ -22,19 +22,9 @@ while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
|
|||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
### Other Scripts (tests, examples) - Port Conflict Check Only
|
### Other Scripts (tests, examples) - No Special Requirements
|
||||||
|
|
||||||
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
|
For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for `torch.distributed` communication.
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check if port 2333 (nanovllm default) is in use
|
|
||||||
if lsof -i :2333 >/dev/null 2>&1; then
|
|
||||||
echo "Port 2333 in use, waiting 10s..."
|
|
||||||
sleep 10
|
|
||||||
fi
|
|
||||||
```
|
|
||||||
|
|
||||||
**Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process.
|
|
||||||
|
|
||||||
## Multi-Instance Development with PYTHONPATH
|
## Multi-Instance Development with PYTHONPATH
|
||||||
|
|
||||||
@@ -68,7 +58,6 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
|||||||
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
||||||
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
||||||
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
|
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
|
||||||
| [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions |
|
|
||||||
| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
|
| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|||||||
@@ -1,308 +0,0 @@
|
|||||||
# Torch Distributed Port Conflict Issue
|
|
||||||
|
|
||||||
## Problem Summary
|
|
||||||
|
|
||||||
When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
|
|
||||||
|
|
||||||
```
|
|
||||||
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
|
|
||||||
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
|
|
||||||
```
|
|
||||||
|
|
||||||
## Root Cause Analysis
|
|
||||||
|
|
||||||
### 1. Distributed Process Group Initialization
|
|
||||||
|
|
||||||
In `nanovllm/engine/model_runner.py:30-32`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import os
|
|
||||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
|
||||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
|
||||||
```
|
|
||||||
|
|
||||||
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
|
|
||||||
- `init_process_group()` binds a TCP socket to this port
|
|
||||||
- This binding persists until `destroy_process_group()` is called
|
|
||||||
|
|
||||||
### 2. Cleanup Mechanism
|
|
||||||
|
|
||||||
In `nanovllm/engine/llm_engine.py:37`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
atexit.register(self.exit)
|
|
||||||
```
|
|
||||||
|
|
||||||
In `nanovllm/engine/llm_engine.py:39-43`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def exit(self):
|
|
||||||
self.model_runner.call("exit")
|
|
||||||
del self.model_runner
|
|
||||||
for p in self.ps:
|
|
||||||
p.join()
|
|
||||||
```
|
|
||||||
|
|
||||||
In `nanovllm/engine/model_runner.py:66-78`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def exit(self):
|
|
||||||
# ... cleanup code ...
|
|
||||||
dist.destroy_process_group()
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. The Problem
|
|
||||||
|
|
||||||
**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
|
|
||||||
|
|
||||||
Timeline of the bug:
|
|
||||||
|
|
||||||
```
|
|
||||||
1. Create LLM instance #1
|
|
||||||
├── init_process_group() binds port 2333 ✓
|
|
||||||
└── atexit.register(self.exit) registered
|
|
||||||
|
|
||||||
2. LLM #1 goes out of scope (garbage collected)
|
|
||||||
├── Python's GC deletes the object
|
|
||||||
├── BUT atexit handler NOT triggered yet
|
|
||||||
└── Port 2333 still bound! ❌
|
|
||||||
|
|
||||||
3. Create LLM instance #2
|
|
||||||
├── init_process_group() tries to bind port 2333
|
|
||||||
└── EADDRINUSE error! ❌
|
|
||||||
|
|
||||||
4. Program exits (only now atexit runs)
|
|
||||||
└── Too late - already crashed
|
|
||||||
```
|
|
||||||
|
|
||||||
## Impact
|
|
||||||
|
|
||||||
This issue affects:
|
|
||||||
|
|
||||||
1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
|
|
||||||
- Each group needs a fresh LLM instance
|
|
||||||
- Second group fails with port conflict
|
|
||||||
|
|
||||||
2. **Multiple LLM instances in same process**
|
|
||||||
- Any code that creates LLM, deletes it, then creates another
|
|
||||||
|
|
||||||
3. **Interactive/notebook usage**
|
|
||||||
- Re-running cells that create LLM instances
|
|
||||||
|
|
||||||
## Proposed Solutions
|
|
||||||
|
|
||||||
### Solution A: Add `__del__` Method (Quick Fix)
|
|
||||||
|
|
||||||
Add destructor to `LLMEngine` that calls cleanup:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# In nanovllm/engine/llm_engine.py
|
|
||||||
|
|
||||||
def __del__(self):
|
|
||||||
try:
|
|
||||||
self.exit()
|
|
||||||
except Exception:
|
|
||||||
pass # Ignore errors during cleanup
|
|
||||||
```
|
|
||||||
|
|
||||||
**Pros**: Simple, backwards compatible
|
|
||||||
**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
|
|
||||||
|
|
||||||
### Solution B: Context Manager Pattern (Recommended)
|
|
||||||
|
|
||||||
Make `LLMEngine` a context manager:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# In nanovllm/engine/llm_engine.py
|
|
||||||
|
|
||||||
def __enter__(self):
|
|
||||||
return self
|
|
||||||
|
|
||||||
def __exit__(self, exc_type, exc_val, exc_tb):
|
|
||||||
self.exit()
|
|
||||||
return False
|
|
||||||
```
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
```python
|
|
||||||
with LLM(model_path) as llm:
|
|
||||||
outputs = llm.generate(prompts, params)
|
|
||||||
# Cleanup happens automatically here
|
|
||||||
```
|
|
||||||
|
|
||||||
**Pros**: Explicit, guaranteed cleanup, Pythonic
|
|
||||||
**Cons**: Requires usage pattern change
|
|
||||||
|
|
||||||
### Solution C: Check and Cleanup Before Init (Defensive)
|
|
||||||
|
|
||||||
In `ModelRunner.__init__`, check if process group exists:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# In nanovllm/engine/model_runner.py
|
|
||||||
|
|
||||||
if dist.is_initialized():
|
|
||||||
dist.destroy_process_group()
|
|
||||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Pros**: Self-healing, no usage pattern change
|
|
||||||
**Cons**: May mask other issues, global state manipulation
|
|
||||||
|
|
||||||
### Solution D: Subprocess Isolation (For Testing)
|
|
||||||
|
|
||||||
For grouped testing specifically, run each group in a subprocess:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import subprocess
|
|
||||||
for group in groups:
|
|
||||||
subprocess.run([sys.executable, "test_ruler_niah.py",
|
|
||||||
"--sample-indices", f"{start}-{end}"])
|
|
||||||
```
|
|
||||||
|
|
||||||
**Pros**: Complete isolation, no code changes to nanovllm
|
|
||||||
**Cons**: More overhead, only solves testing use case
|
|
||||||
|
|
||||||
### Solution E: Dynamic Port Allocation
|
|
||||||
|
|
||||||
Instead of fixed port 2333, use dynamic port:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import socket
|
|
||||||
|
|
||||||
def find_free_port():
|
|
||||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
|
||||||
s.bind(('', 0))
|
|
||||||
return s.getsockname()[1]
|
|
||||||
|
|
||||||
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
|
|
||||||
```
|
|
||||||
|
|
||||||
**Pros**: Avoids conflicts entirely
|
|
||||||
**Cons**: More complex, may have side effects
|
|
||||||
|
|
||||||
## Recommended Implementation
|
|
||||||
|
|
||||||
**Combine Solutions A + B + C** for maximum robustness:
|
|
||||||
|
|
||||||
1. Add `__del__` for best-effort cleanup
|
|
||||||
2. Add context manager for explicit cleanup
|
|
||||||
3. Add `is_initialized()` check as defensive measure
|
|
||||||
|
|
||||||
```python
|
|
||||||
# nanovllm/engine/llm_engine.py
|
|
||||||
|
|
||||||
class LLMEngine:
|
|
||||||
def __init__(self, model, **kwargs):
|
|
||||||
# ... existing code ...
|
|
||||||
atexit.register(self.exit)
|
|
||||||
self._exited = False
|
|
||||||
|
|
||||||
def exit(self):
|
|
||||||
if self._exited:
|
|
||||||
return
|
|
||||||
self._exited = True
|
|
||||||
self.model_runner.call("exit")
|
|
||||||
del self.model_runner
|
|
||||||
for p in self.ps:
|
|
||||||
p.join()
|
|
||||||
|
|
||||||
def __del__(self):
|
|
||||||
try:
|
|
||||||
self.exit()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
def __enter__(self):
|
|
||||||
return self
|
|
||||||
|
|
||||||
def __exit__(self, *args):
|
|
||||||
self.exit()
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
# nanovllm/engine/model_runner.py
|
|
||||||
|
|
||||||
class ModelRunner:
|
|
||||||
def __init__(self, config: Config, rank: int, event):
|
|
||||||
# ... existing code before init_process_group ...
|
|
||||||
|
|
||||||
import os
|
|
||||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
|
||||||
|
|
||||||
# Defensive cleanup
|
|
||||||
if dist.is_initialized():
|
|
||||||
dist.destroy_process_group()
|
|
||||||
|
|
||||||
dist.init_process_group("nccl", f"tcp://localhost:{port}",
|
|
||||||
world_size=self.world_size, rank=rank)
|
|
||||||
# ... rest of init ...
|
|
||||||
```
|
|
||||||
|
|
||||||
## Workaround for Current Code
|
|
||||||
|
|
||||||
Until the fix is implemented, use one of these workarounds:
|
|
||||||
|
|
||||||
### Workaround 1: Manual Cleanup
|
|
||||||
|
|
||||||
```python
|
|
||||||
import torch.distributed as dist
|
|
||||||
|
|
||||||
llm = LLM(model_path)
|
|
||||||
outputs = llm.generate(...)
|
|
||||||
llm.model_runner.call("exit") # Manual cleanup
|
|
||||||
del llm
|
|
||||||
|
|
||||||
# Now can create new LLM
|
|
||||||
llm2 = LLM(model_path)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Workaround 2: Subprocess Testing
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run each test group as separate process
|
|
||||||
for i in $(seq 0 5 95); do
|
|
||||||
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
|
|
||||||
done
|
|
||||||
```
|
|
||||||
|
|
||||||
### Workaround 3: Environment Variable Port
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Use different port for each run
|
|
||||||
NANOVLLM_DIST_PORT=2334 python test.py
|
|
||||||
NANOVLLM_DIST_PORT=2335 python test.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Related Files
|
|
||||||
|
|
||||||
| File | Relevant Code |
|
|
||||||
|------|---------------|
|
|
||||||
| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
|
|
||||||
| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
|
|
||||||
| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
|
|
||||||
| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
|
|
||||||
|
|
||||||
## Testing the Fix
|
|
||||||
|
|
||||||
After implementing the fix, verify with:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# test_multiple_llm.py
|
|
||||||
from nanovllm import LLM, SamplingParams
|
|
||||||
|
|
||||||
for i in range(3):
|
|
||||||
print(f"Creating LLM instance {i+1}")
|
|
||||||
llm = LLM("path/to/model", enable_cpu_offload=True)
|
|
||||||
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
|
|
||||||
print(f"Instance {i+1} output: {outputs[0]['text']}")
|
|
||||||
del llm
|
|
||||||
print(f"Instance {i+1} deleted\n")
|
|
||||||
|
|
||||||
print("All instances created and deleted successfully!")
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected: No port conflict errors, all 3 instances work.
|
|
||||||
|
|
||||||
## Priority
|
|
||||||
|
|
||||||
**High** - This blocks grouped testing and any multi-LLM-instance workflows.
|
|
||||||
275
findings.md
275
findings.md
@@ -1,160 +1,169 @@
|
|||||||
# Findings: Multi-Model Support Analysis
|
# Findings: Torch Distributed Port Conflict
|
||||||
|
|
||||||
## Current Architecture Analysis
|
## Problem Analysis
|
||||||
|
|
||||||
### Model Loading Flow
|
### Issue Summary
|
||||||
```
|
创建多个 LLM 实例时出现端口冲突 (EADDRINUSE),导致第二个实例无法启动。
|
||||||
LLM(model_path)
|
|
||||||
→ LLMEngine.__init__()
|
|
||||||
→ Config.__post_init__()
|
|
||||||
→ hf_config = AutoConfig.from_pretrained(model)
|
|
||||||
→ ModelRunner.__init__()
|
|
||||||
→ model = Qwen3ForCausalLM(hf_config) ← HARDCODED
|
|
||||||
→ load_model(model, config.model)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Key Files
|
### Root Cause Deep Dive
|
||||||
| File | Purpose |
|
|
||||||
|------|---------|
|
|
||||||
| `nanovllm/engine/model_runner.py` | 模型加载和运行 |
|
|
||||||
| `nanovllm/models/qwen3.py` | Qwen3 模型定义 |
|
|
||||||
| `nanovllm/utils/loader.py` | safetensors 权重加载 |
|
|
||||||
| `nanovllm/layers/rotary_embedding.py` | RoPE 实现 |
|
|
||||||
|
|
||||||
---
|
#### 1. 资源绑定位置
|
||||||
|
|
||||||
## Llama 3.1 Config Analysis
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"architectures": ["LlamaForCausalLM"],
|
|
||||||
"model_type": "llama",
|
|
||||||
"attention_bias": false,
|
|
||||||
"mlp_bias": false,
|
|
||||||
"head_dim": 128,
|
|
||||||
"hidden_size": 4096,
|
|
||||||
"intermediate_size": 14336,
|
|
||||||
"num_attention_heads": 32,
|
|
||||||
"num_hidden_layers": 32,
|
|
||||||
"num_key_value_heads": 8,
|
|
||||||
"hidden_act": "silu",
|
|
||||||
"rms_norm_eps": 1e-05,
|
|
||||||
"rope_theta": 500000.0,
|
|
||||||
"rope_scaling": {
|
|
||||||
"factor": 8.0,
|
|
||||||
"high_freq_factor": 4.0,
|
|
||||||
"low_freq_factor": 1.0,
|
|
||||||
"original_max_position_embeddings": 8192,
|
|
||||||
"rope_type": "llama3"
|
|
||||||
},
|
|
||||||
"max_position_embeddings": 131072,
|
|
||||||
"tie_word_embeddings": false,
|
|
||||||
"vocab_size": 128256
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Llama 3 RoPE Scaling
|
|
||||||
Llama 3 使用特殊的 RoPE scaling 策略 (`rope_type: "llama3"`):
|
|
||||||
- 低频分量保持不变(对应短距离依赖)
|
|
||||||
- 高频分量线性插值(对应长距离依赖)
|
|
||||||
- 参数: `factor`, `low_freq_factor`, `high_freq_factor`, `original_max_position_embeddings`
|
|
||||||
|
|
||||||
参考实现 (transformers):
|
|
||||||
```python
|
```python
|
||||||
def _compute_llama3_parameters(config, device, inv_freq):
|
# nanovllm/engine/model_runner.py:30-32
|
||||||
factor = config.factor
|
import os
|
||||||
low_freq_factor = config.low_freq_factor
|
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
||||||
high_freq_factor = config.high_freq_factor
|
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
||||||
old_context_len = config.original_max_position_embeddings
|
|
||||||
|
|
||||||
low_freq_wavelen = old_context_len / low_freq_factor
|
|
||||||
high_freq_wavelen = old_context_len / high_freq_factor
|
|
||||||
|
|
||||||
wavelen = 2 * math.pi / inv_freq
|
|
||||||
inv_freq_llama = torch.where(
|
|
||||||
wavelen > low_freq_wavelen,
|
|
||||||
inv_freq / factor,
|
|
||||||
inv_freq
|
|
||||||
)
|
|
||||||
smooth_factor = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
|
|
||||||
smoothed_inv_freq = (1 - smooth_factor) * inv_freq_llama + smooth_factor * inv_freq
|
|
||||||
is_medium_freq = (wavelen >= high_freq_wavelen) & (wavelen <= low_freq_wavelen)
|
|
||||||
inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
|
|
||||||
return inv_freq_llama
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
- 默认端口 **2333**,可通过 `NANOVLLM_DIST_PORT` 环境变量配置
|
||||||
|
- `init_process_group()` 绑定 TCP 端口用于进程间通信
|
||||||
|
- 端口绑定持续到 `destroy_process_group()` 被调用
|
||||||
|
|
||||||
## Weight Mapping Analysis
|
#### 2. 清理机制缺陷
|
||||||
|
|
||||||
### Qwen3 packed_modules_mapping
|
|
||||||
```python
|
```python
|
||||||
packed_modules_mapping = {
|
# nanovllm/engine/llm_engine.py:37
|
||||||
"q_proj": ("qkv_proj", "q"),
|
atexit.register(self.exit)
|
||||||
"k_proj": ("qkv_proj", "k"),
|
|
||||||
"v_proj": ("qkv_proj", "v"),
|
# nanovllm/engine/llm_engine.py:39-43
|
||||||
"gate_proj": ("gate_up_proj", 0),
|
def exit(self):
|
||||||
"up_proj": ("gate_up_proj", 1),
|
self.model_runner.call("exit")
|
||||||
}
|
del self.model_runner
|
||||||
|
for p in self.ps:
|
||||||
|
p.join()
|
||||||
|
|
||||||
|
# nanovllm/engine/model_runner.py:66-78
|
||||||
|
def exit(self):
|
||||||
|
# ... cleanup code ...
|
||||||
|
dist.destroy_process_group()
|
||||||
```
|
```
|
||||||
|
|
||||||
### Llama Weight Names (from safetensors)
|
**关键问题**: `atexit` 只在 **Python 解释器退出** 时触发,而非对象被删除时!
|
||||||
预期 Llama 权重命名与 Qwen3 类似:
|
|
||||||
- `model.layers.{i}.self_attn.q_proj.weight`
|
|
||||||
- `model.layers.{i}.self_attn.k_proj.weight`
|
|
||||||
- `model.layers.{i}.self_attn.v_proj.weight`
|
|
||||||
- `model.layers.{i}.self_attn.o_proj.weight`
|
|
||||||
- `model.layers.{i}.mlp.gate_proj.weight`
|
|
||||||
- `model.layers.{i}.mlp.up_proj.weight`
|
|
||||||
- `model.layers.{i}.mlp.down_proj.weight`
|
|
||||||
- `model.layers.{i}.input_layernorm.weight`
|
|
||||||
- `model.layers.{i}.post_attention_layernorm.weight`
|
|
||||||
|
|
||||||
**结论**: Llama 的 `packed_modules_mapping` 与 Qwen3 相同,可以复用。
|
#### 3. 问题时间线
|
||||||
|
```
|
||||||
|
1. 创建 LLM #1
|
||||||
|
├── init_process_group() 绑定端口 2333 ✓
|
||||||
|
└── atexit.register(self.exit) 注册
|
||||||
|
|
||||||
|
2. LLM #1 超出作用域或被 del
|
||||||
|
├── Python GC 回收对象内存
|
||||||
|
├── atexit handler 未触发(进程未退出)
|
||||||
|
├── Worker 进程仍在运行
|
||||||
|
└── 端口 2333 仍被占用 ❌
|
||||||
|
|
||||||
|
3. 创建 LLM #2
|
||||||
|
├── init_process_group() 尝试绑定端口 2333
|
||||||
|
└── EADDRINUSE 错误 ❌
|
||||||
|
|
||||||
|
4. 程序退出(此时 atexit 才运行)
|
||||||
|
└── 为时已晚 - 已经崩溃
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Shared Components (Can Reuse)
|
## Solution Analysis
|
||||||
|
|
||||||
| Component | File | Notes |
|
### 方案对比
|
||||||
|-----------|------|-------|
|
|
||||||
| `RMSNorm` | `layers/layernorm.py` | 通用 |
|
| 方案 | 可靠性 | 向后兼容 | 实现复杂度 | 推荐度 |
|
||||||
| `SiluAndMul` | `layers/activation.py` | 通用 |
|
|------|--------|----------|------------|--------|
|
||||||
| `Attention` | `layers/attention.py` | FlashAttention wrapper |
|
| `close()` 方法 | 最高 | 是 | 低 | ★★★★★ |
|
||||||
| `QKVParallelLinear` | `layers/linear.py` | 支持 bias=False |
|
| `__del__` 方法 | 中等 | 是 | 低 | ★★★☆☆ |
|
||||||
| `RowParallelLinear` | `layers/linear.py` | 通用 |
|
| 端口检测重试 | 中等 | 是 | 低 | ★★★☆☆ |
|
||||||
| `MergedColumnParallelLinear` | `layers/linear.py` | 通用 |
|
| Context Manager | 最高 | 需要代码修改 | 低 | ★★★★☆ |
|
||||||
| `VocabParallelEmbedding` | `layers/embed_head.py` | 通用 |
|
| 动态端口 | 低 | 是 | 低 | ★★☆☆☆ |
|
||||||
| `ParallelLMHead` | `layers/embed_head.py` | 通用 |
|
|
||||||
| `load_model` | `utils/loader.py` | 通用 |
|
### 为什么选择三层防护
|
||||||
|
|
||||||
|
1. **Layer 1: close()** - 用户显式控制,最可靠
|
||||||
|
2. **Layer 2: __del__** - 自动清理,覆盖大部分场景
|
||||||
|
3. **Layer 3: 端口检测** - 最后防线,提供清晰错误信息
|
||||||
|
|
||||||
|
### `__del__` 的限制
|
||||||
|
|
||||||
|
Python 的 `__del__` 不保证被调用:
|
||||||
|
- 循环引用时可能不触发
|
||||||
|
- 解释器关闭时可能无法访问依赖模块
|
||||||
|
- 不应依赖于 `__del__` 进行关键资源清理
|
||||||
|
|
||||||
|
但作为**额外防护层**是有价值的,因为:
|
||||||
|
- 大多数情况下会被调用
|
||||||
|
- 比没有好
|
||||||
|
- 不影响其他清理机制
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Llama vs Qwen3 Implementation Diff
|
## Code Structure Analysis
|
||||||
|
|
||||||
### Attention
|
### LLMEngine 生命周期
|
||||||
| Feature | Qwen3Attention | LlamaAttention |
|
```
|
||||||
|---------|----------------|----------------|
|
__init__()
|
||||||
| QKV bias | 可配置 (attention_bias) | 始终 False |
|
├── 创建 worker 进程 (self.ps)
|
||||||
| q_norm | 有 (when bias=False) | 无 |
|
├── 创建 ModelRunner (self.model_runner)
|
||||||
| k_norm | 有 (when bias=False) | 无 |
|
├── 注册 atexit handler
|
||||||
| RoPE | Standard | Llama3 scaled |
|
└── 设置 scheduler, tokenizer
|
||||||
|
|
||||||
### MLP
|
close() [新增]
|
||||||
| Feature | Qwen3MLP | LlamaMLP |
|
├── 检查 _closed 标志(幂等)
|
||||||
|---------|----------|----------|
|
├── 注销 atexit handler
|
||||||
| gate/up bias | False | False |
|
├── 调用 model_runner.exit()
|
||||||
| down bias | False | False |
|
├── join worker 进程
|
||||||
| hidden_act | silu | silu |
|
└── 设置 _closed = True
|
||||||
|
|
||||||
**结论**: Llama MLP 与 Qwen3 MLP 几乎相同,可以直接复用或简化。
|
__del__() [新增]
|
||||||
|
└── 调用 close()(忽略异常)
|
||||||
|
|
||||||
|
__enter__/__exit__() [新增]
|
||||||
|
└── Context manager 支持
|
||||||
|
```
|
||||||
|
|
||||||
|
### ModelRunner 资源
|
||||||
|
```
|
||||||
|
__init__()
|
||||||
|
├── torch.distributed 初始化(绑定端口)
|
||||||
|
├── 模型加载
|
||||||
|
├── KV cache 分配
|
||||||
|
├── CUDA graph 捕获(可选)
|
||||||
|
└── SharedMemory 创建(多GPU)
|
||||||
|
|
||||||
|
exit()
|
||||||
|
├── SharedMemory 清理
|
||||||
|
├── CUDA graph 清理
|
||||||
|
└── dist.destroy_process_group()
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Risk Assessment
|
## Risk Assessment
|
||||||
|
|
||||||
| Risk | Impact | Mitigation |
|
| 风险 | 影响 | 缓解措施 |
|
||||||
|------|--------|------------|
|
|------|------|----------|
|
||||||
| RoPE 实现错误 | 高 - 导致错误输出 | 参考 transformers 实现,单元测试 |
|
| `__del__` 不被调用 | 中 - 端口泄漏 | Layer 3 端口检测提供清晰错误 |
|
||||||
| 权重映射错误 | 高 - 模型无法加载 | 检查 safetensors 键名 |
|
| close() 重复调用 | 低 | `_closed` 标志保证幂等 |
|
||||||
| 注册表循环导入 | 中 - 启动失败 | 延迟导入 |
|
| atexit 双重调用 | 低 | 注销机制防止 |
|
||||||
|
| 子进程残留 | 高 | join() 确保子进程退出 |
|
||||||
|
| CUDA 资源泄漏 | 中 | ModelRunner.exit() 清理 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
### atexit.unregister 兼容性
|
||||||
|
- Python 3.7+ 支持
|
||||||
|
- 需要传入同一个函数对象
|
||||||
|
- 使用 `self._atexit_handler` 而非 `self.exit` 以便正确注销
|
||||||
|
|
||||||
|
### 端口检测方法
|
||||||
|
```python
|
||||||
|
def _check_port_available(port: int, host: str = "localhost") -> bool:
|
||||||
|
"""使用 socket connect_ex 检测端口是否被占用."""
|
||||||
|
try:
|
||||||
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
|
s.settimeout(1)
|
||||||
|
result = s.connect_ex((host, port))
|
||||||
|
return result != 0 # 0 = connected = port in use
|
||||||
|
except Exception:
|
||||||
|
return True # 假设可用
|
||||||
|
```
|
||||||
|
|
||||||
|
**注意**: 这种检测存在 TOCTOU (Time-of-check to time-of-use) 竞争条件,但对于我们的用例足够了。
|
||||||
|
|||||||
@@ -34,14 +34,56 @@ class LLMEngine:
|
|||||||
# Set Sequence.block_size to match the KV cache block size
|
# Set Sequence.block_size to match the KV cache block size
|
||||||
Sequence.block_size = config.kvcache_block_size
|
Sequence.block_size = config.kvcache_block_size
|
||||||
self.scheduler = Scheduler(config, self.model_runner.kvcache_manager)
|
self.scheduler = Scheduler(config, self.model_runner.kvcache_manager)
|
||||||
atexit.register(self.exit)
|
self._closed = False
|
||||||
|
atexit.register(self._atexit_handler)
|
||||||
|
|
||||||
def exit(self):
|
def _atexit_handler(self):
|
||||||
|
"""Handler for atexit - only runs if close() wasn't called."""
|
||||||
|
if not self._closed:
|
||||||
|
self.close()
|
||||||
|
|
||||||
|
def close(self):
|
||||||
|
"""Explicitly close the engine and release all resources.
|
||||||
|
|
||||||
|
This method is idempotent - calling it multiple times is safe.
|
||||||
|
Supports: explicit close(), context manager, and __del__ fallback.
|
||||||
|
"""
|
||||||
|
if self._closed:
|
||||||
|
return
|
||||||
|
self._closed = True
|
||||||
|
|
||||||
|
# Unregister atexit to prevent double cleanup
|
||||||
|
try:
|
||||||
|
atexit.unregister(self._atexit_handler)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Cleanup resources
|
||||||
self.model_runner.call("exit")
|
self.model_runner.call("exit")
|
||||||
del self.model_runner
|
del self.model_runner
|
||||||
for p in self.ps:
|
for p in self.ps:
|
||||||
p.join()
|
p.join()
|
||||||
|
|
||||||
|
def exit(self):
|
||||||
|
"""Alias for close() - kept for backward compatibility."""
|
||||||
|
self.close()
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
"""Destructor - attempt cleanup if not already done."""
|
||||||
|
try:
|
||||||
|
self.close()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
"""Context manager entry."""
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
"""Context manager exit - ensures cleanup."""
|
||||||
|
self.close()
|
||||||
|
return False
|
||||||
|
|
||||||
def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
|
def add_request(self, prompt: str | list[int], sampling_params: SamplingParams):
|
||||||
if isinstance(prompt, str):
|
if isinstance(prompt, str):
|
||||||
prompt = self.tokenizer.encode(prompt)
|
prompt = self.tokenizer.encode(prompt)
|
||||||
|
|||||||
@@ -1,4 +1,6 @@
|
|||||||
|
import os
|
||||||
import pickle
|
import pickle
|
||||||
|
import socket
|
||||||
import torch
|
import torch
|
||||||
import torch.distributed as dist
|
import torch.distributed as dist
|
||||||
from multiprocessing.synchronize import Event
|
from multiprocessing.synchronize import Event
|
||||||
@@ -16,6 +18,17 @@ from nanovllm.kvcache import create_kvcache_manager, KVCacheManager
|
|||||||
logger = get_logger("model_runner")
|
logger = get_logger("model_runner")
|
||||||
|
|
||||||
|
|
||||||
|
def _find_free_port() -> int:
|
||||||
|
"""Find a free port for distributed communication.
|
||||||
|
|
||||||
|
Uses socket binding with port 0 to let the OS assign an available port.
|
||||||
|
"""
|
||||||
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
|
s.bind(('', 0))
|
||||||
|
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
|
||||||
|
return s.getsockname()[1]
|
||||||
|
|
||||||
|
|
||||||
class ModelRunner:
|
class ModelRunner:
|
||||||
|
|
||||||
def __init__(self, config: Config, rank: int, event: Event | list[Event]):
|
def __init__(self, config: Config, rank: int, event: Event | list[Event]):
|
||||||
@@ -27,8 +40,13 @@ class ModelRunner:
|
|||||||
self.rank = rank
|
self.rank = rank
|
||||||
self.event = event
|
self.event = event
|
||||||
|
|
||||||
import os
|
# Dynamic port allocation: use env var if set, otherwise find a free port
|
||||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
env_port = os.environ.get("NANOVLLM_DIST_PORT")
|
||||||
|
if env_port is not None:
|
||||||
|
port = int(env_port)
|
||||||
|
else:
|
||||||
|
port = _find_free_port()
|
||||||
|
logger.info(f"Auto-assigned distributed port: {port}")
|
||||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
||||||
torch.cuda.set_device(rank)
|
torch.cuda.set_device(rank)
|
||||||
default_dtype = torch.get_default_dtype()
|
default_dtype = torch.get_default_dtype()
|
||||||
|
|||||||
127
progress.md
127
progress.md
@@ -1,76 +1,89 @@
|
|||||||
# Progress Log: Multi-Model Support
|
# Progress Log: Fix Torch Distributed Port Conflict
|
||||||
|
|
||||||
## Session: 2026-01-10
|
## Status: COMPLETED & CLEANED UP
|
||||||
|
|
||||||
### Initial Analysis Complete
|
## Session: 2026-01-12
|
||||||
|
|
||||||
**Time**: Session start
|
### Task Overview
|
||||||
|
修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题,以及支持多卡环境下同时启动多个独立进程。
|
||||||
**Actions:**
|
|
||||||
1. Read `nanovllm/engine/model_runner.py` - 确认硬编码位置 (line 35)
|
|
||||||
2. Read `nanovllm/models/qwen3.py` - 理解 Qwen3 模型结构
|
|
||||||
3. Read `nanovllm/utils/loader.py` - 理解权重加载机制
|
|
||||||
4. Read `nanovllm/layers/rotary_embedding.py` - 发现 RoPE scaling 限制
|
|
||||||
5. Read `/home/zijie/models/Llama-3.1-8B-Instruct/config.json` - 理解 Llama 配置
|
|
||||||
|
|
||||||
**Key Findings:**
|
|
||||||
- 模型加载在 `model_runner.py:35` 硬编码为 Qwen3
|
|
||||||
- RoPE 目前不支持 scaling (`assert rope_scaling is None`)
|
|
||||||
- Llama 3.1 需要 "llama3" 类型的 RoPE scaling
|
|
||||||
- Llama 无 q_norm/k_norm,无 attention bias
|
|
||||||
|
|
||||||
**Created:**
|
|
||||||
- `task_plan.md` - 6 阶段实施计划
|
|
||||||
- `findings.md` - 技术分析和发现
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Phase Status
|
### Phase Status
|
||||||
|
|
||||||
| Phase | Status | Notes |
|
| Phase | Description | Status |
|
||||||
|-------|--------|-------|
|
|-------|-------------|--------|
|
||||||
| 1. Model Registry | **COMPLETED** | `registry.py`, `__init__.py` |
|
| Phase 1 | ModelRunner 动态端口分配 | COMPLETED |
|
||||||
| 2. Llama3 RoPE | **COMPLETED** | `rotary_embedding.py` |
|
| Phase 2 | LLMEngine close() 和 context manager | COMPLETED |
|
||||||
| 3. Llama Model | **COMPLETED** | `llama.py` |
|
| Phase 3 | 测试验证(GPU 4,5) | COMPLETED |
|
||||||
| 4. ModelRunner | **COMPLETED** | Dynamic loading |
|
| Phase 4 | 更新文档 | COMPLETED |
|
||||||
| 5. Qwen3 Register | **COMPLETED** | `@register_model` decorator |
|
|
||||||
| 6. Testing | **COMPLETED** | Both Llama & Qwen3 pass |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Test Results
|
### Implementation Summary
|
||||||
|
|
||||||
### Llama 3.1-8B-Instruct (32K needle, GPU 0, offload)
|
#### Phase 1: Dynamic Port Allocation
|
||||||
```
|
**File**: `nanovllm/engine/model_runner.py`
|
||||||
Input: 32768 tokens
|
- Added `_find_free_port()` function using socket binding
|
||||||
Expected: 7492
|
- Modified port selection logic: use env var if set, otherwise auto-assign
|
||||||
Output: 7492
|
- Added logging for auto-assigned ports
|
||||||
Status: PASSED
|
|
||||||
Prefill: 1644 tok/s
|
|
||||||
```
|
|
||||||
|
|
||||||
### Qwen3-4B (8K needle, GPU 1, offload) - Regression Test
|
#### Phase 2: Resource Cleanup Enhancement
|
||||||
```
|
**File**: `nanovllm/engine/llm_engine.py`
|
||||||
Input: 8192 tokens
|
- Added `_closed` flag for idempotent cleanup
|
||||||
Expected: 7492
|
- Added `close()` method for explicit resource release
|
||||||
Output: 7492
|
- Added `__del__()` for GC fallback
|
||||||
Status: PASSED
|
- Added `__enter__()` and `__exit__()` for context manager support
|
||||||
Prefill: 3295 tok/s
|
- Modified atexit registration to use `_atexit_handler`
|
||||||
```
|
|
||||||
|
#### Phase 3: Testing (GPU 4,5)
|
||||||
|
**File**: `tests/test_port_conflict.py`
|
||||||
|
- Created comprehensive test script
|
||||||
|
|
||||||
|
**Test Results**:
|
||||||
|
| Test | Status | Notes |
|
||||||
|
|------|--------|-------|
|
||||||
|
| Sequential creation (3 instances) | PASSED | Ports: 50405, 47835, 53011 |
|
||||||
|
| Context manager | PASSED | Auto-cleanup works |
|
||||||
|
| Parallel processes (GPU 4,5) | PASSED | Ports: 34631, 56097 |
|
||||||
|
|
||||||
|
#### Phase 4: Documentation
|
||||||
|
**File**: `docs/torch_distributed_port_issue.md`
|
||||||
|
- Updated status to RESOLVED
|
||||||
|
- Documented solution details
|
||||||
|
- Added usage examples
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Files Modified This Session
|
### Files Modified
|
||||||
|
|
||||||
| File | Action | Description |
|
| File | Action | Description |
|
||||||
|------|--------|-------------|
|
|------|--------|-------------|
|
||||||
| `nanovllm/models/registry.py` | created | Model registry with `@register_model` decorator |
|
| `nanovllm/engine/model_runner.py` | Modified | Added `_find_free_port()`, dynamic port logic |
|
||||||
| `nanovllm/models/__init__.py` | created | Export registry functions, import models |
|
| `nanovllm/engine/llm_engine.py` | Modified | Added `close()`, `__del__`, context manager |
|
||||||
| `nanovllm/models/llama.py` | created | Llama model implementation |
|
| `tests/test_port_conflict.py` | Created | Test script for port conflict fix |
|
||||||
| `nanovllm/models/qwen3.py` | modified | Added `@register_model` decorator |
|
| `docs/torch_distributed_port_issue.md` | Deleted | Issue resolved, doc removed |
|
||||||
| `nanovllm/layers/rotary_embedding.py` | modified | Added Llama3 RoPE scaling |
|
| `CLAUDE.md` | Modified | Removed port conflict warnings, updated doc index |
|
||||||
| `nanovllm/engine/model_runner.py` | modified | Dynamic model loading via registry |
|
|
||||||
| `.claude/rules/gpu-testing.md` | created | GPU testing rules |
|
---
|
||||||
| `task_plan.md` | created | Implementation plan |
|
|
||||||
| `findings.md` | created | Technical findings |
|
### Key Features After Fix
|
||||||
| `progress.md` | created | Progress tracking |
|
|
||||||
|
1. **Multi-GPU Parallel Testing**
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python test1.py &
|
||||||
|
CUDA_VISIBLE_DEVICES=1 python test2.py &
|
||||||
|
# Both run with different auto-assigned ports
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Sequential LLM Creation**
|
||||||
|
```python
|
||||||
|
for i in range(3):
|
||||||
|
with LLM(model_path) as llm:
|
||||||
|
outputs = llm.generate(prompts, params)
|
||||||
|
# Automatically cleaned up
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Backward Compatible**
|
||||||
|
- `NANOVLLM_DIST_PORT` env var still works
|
||||||
|
- `llm.exit()` still works (alias for `close()`)
|
||||||
|
|||||||
464
task_plan.md
464
task_plan.md
@@ -1,314 +1,230 @@
|
|||||||
# Task Plan: Enable CUDA Graphs for CPU Offload Mode
|
# Task Plan: Fix Torch Distributed Port Conflict
|
||||||
|
|
||||||
## Current Status: ✅ COMPLETED
|
## Goal
|
||||||
|
支持多卡环境下同时启动多个独立的 nanovllm 进程进行测试,无需手动管理端口。
|
||||||
|
|
||||||
### Phase 0 Completed: Refactor Offload Decode to Use Standard Attention Path
|
## Problem Analysis
|
||||||
|
|
||||||
### Phases 1-3 Completed: CUDA Graph Support for Offload Mode
|
### 核心问题
|
||||||
|
```
|
||||||
|
当前:所有 nanovllm 实例默认使用端口 2333
|
||||||
|
└── 多个独立进程同时运行时会冲突!
|
||||||
|
|
||||||
**Implementation**: Added per-layer CUDA graph capture and replay for offload decode path.
|
CUDA_VISIBLE_DEVICES=0 python test1.py # 绑定端口 2333 ✓
|
||||||
|
CUDA_VISIBLE_DEVICES=1 python test2.py # 尝试绑定 2333 → EADDRINUSE ❌
|
||||||
|
```
|
||||||
|
|
||||||
**Key Changes**:
|
### 根本原因
|
||||||
1. `capture_offload_cudagraph()` captures one graph per transformer layer
|
- 端口是系统级资源,与 GPU 无关
|
||||||
2. Each graph uses the corresponding ring buffer slot based on `layer_id % num_buffers`
|
- 即使使用不同 GPU,端口仍会冲突
|
||||||
3. `run_layerwise_offload_decode()` replays graphs when `enforce_eager=False`
|
- 当前硬编码默认端口 `2333`
|
||||||
4. Synchronization added between graph replays to ensure correct data flow
|
|
||||||
|
|
||||||
**Test Results**:
|
|
||||||
- `test_needle.py --input-len 32768 --enable-offload --use-cuda-graph`: **PASSED**
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Previous Work: Refactor Offload Decode to Use Standard Attention Path
|
## Solution: Dynamic Port Allocation
|
||||||
|
|
||||||
**Problem solved**: The original offload decode (`run_layerwise_offload_decode`) bypassed `Attention.forward()` by manually calling attention components. This was inconsistent with the standard execution path.
|
### 核心方案
|
||||||
|
|
||||||
**Solution implemented**: Refactored to use `layer.forward()` which goes through:
|
|
||||||
```
|
|
||||||
Qwen3DecoderLayer.forward()
|
|
||||||
→ Qwen3Attention.forward()
|
|
||||||
→ Attention.forward() ← Now properly used!
|
|
||||||
```
|
|
||||||
|
|
||||||
### Code Changes Made
|
|
||||||
|
|
||||||
**File**: `nanovllm/engine/model_runner.py`
|
|
||||||
|
|
||||||
1. **`run_layerwise_offload_decode()` (line 841-991)** - Completely refactored:
|
|
||||||
|
|
||||||
Before (bypassed Attention):
|
|
||||||
```python
|
```python
|
||||||
qkv = layer.self_attn.qkv_proj(hidden_ln)
|
def _find_free_port() -> int:
|
||||||
q, k_new, v_new = qkv.split(...)
|
"""让系统自动分配一个空闲端口"""
|
||||||
q = layer.self_attn.q_norm(...)
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
k = layer.self_attn.k_norm(...)
|
s.bind(('', 0))
|
||||||
q, k = layer.self_attn.rotary_emb(...)
|
return s.getsockname()[1]
|
||||||
attn_output = flash_attn_varlen_func(q, k_full, v_full, ...) # Direct call!
|
|
||||||
hidden_states = layer.self_attn.o_proj(attn_output)
|
|
||||||
```
|
|
||||||
|
|
||||||
After (uses standard path):
|
# 优先使用环境变量,否则自动分配
|
||||||
```python
|
port = os.environ.get("NANOVLLM_DIST_PORT")
|
||||||
# Set up Attention module's cache to ring buffer
|
if port is None:
|
||||||
attn_module.k_cache = offload_engine.layer_k_cache[buffer_idx:buffer_idx+1]
|
port = _find_free_port()
|
||||||
attn_module.v_cache = offload_engine.layer_v_cache[buffer_idx:buffer_idx+1]
|
|
||||||
|
|
||||||
# Set context for contiguous mode
|
|
||||||
set_context(is_prefill=False, slot_mapping=..., context_lens=..., block_tables=None)
|
|
||||||
|
|
||||||
# Standard layer forward - goes through Attention.forward()!
|
|
||||||
hidden_states, residual = layer(positions, hidden_states, residual)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **`ModelRunner.__init__()` (line 46-57)** - Conditional CUDA graph capture:
|
|
||||||
```python
|
|
||||||
if not self.enforce_eager:
|
|
||||||
if config.enable_cpu_offload:
|
|
||||||
# TODO: Implement capture_offload_cudagraph()
|
|
||||||
pass # Temporarily use eager execution
|
|
||||||
else:
|
else:
|
||||||
self.capture_cudagraph()
|
port = int(port)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Test Results
|
### 效果
|
||||||
|
```bash
|
||||||
|
# 无需手动指定端口,可以同时运行多个测试
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python test1.py & # 自动端口 54321
|
||||||
|
CUDA_VISIBLE_DEVICES=1 python test2.py & # 自动端口 54322
|
||||||
|
CUDA_VISIBLE_DEVICES=2 python test3.py & # 自动端口 54323
|
||||||
|
|
||||||
| Test | Mode | Status |
|
# 仍然支持手动指定(向后兼容)
|
||||||
|------|------|--------|
|
NANOVLLM_DIST_PORT=2333 python test.py
|
||||||
| `test_needle.py --input-len 4096` | GPU-only | PASSED |
|
|
||||||
| `test_needle.py --input-len 4096 --enable-offload` | CPU offload | PASSED |
|
|
||||||
|
|
||||||
## Remaining Work: Implement Offload CUDA Graph
|
|
||||||
|
|
||||||
### Why Standard `capture_cudagraph()` Cannot Be Used
|
|
||||||
|
|
||||||
The standard capture function captures the PagedAttention decode path:
|
|
||||||
```python
|
|
||||||
# capture_cudagraph() sets up:
|
|
||||||
k_cache: [num_blocks, block_size, kv_heads, head_dim] # PagedAttention format
|
|
||||||
block_tables: [...] # Block indices for paged indexing
|
|
||||||
```
|
```
|
||||||
|
|
||||||
But offload mode uses contiguous ring buffer:
|
---
|
||||||
```python
|
|
||||||
# Offload decode sets up:
|
|
||||||
k_cache: [1, max_seq_len, kv_heads, head_dim] # Contiguous format
|
|
||||||
block_tables: None # No paging
|
|
||||||
```
|
|
||||||
|
|
||||||
### Implementation Plan for `capture_offload_cudagraph()`
|
|
||||||
|
|
||||||
#### Phase 1: Prepare Fixed-Address Tensors
|
|
||||||
|
|
||||||
```python
|
|
||||||
@torch.inference_mode()
|
|
||||||
def capture_offload_cudagraph(self):
|
|
||||||
"""Capture CUDA graphs for offload decode using ring buffer."""
|
|
||||||
offload_engine = self.kvcache_manager.offload_engine
|
|
||||||
num_buffers = offload_engine.num_kv_buffers
|
|
||||||
|
|
||||||
# Fixed-address tensors for graph capture
|
|
||||||
input_ids = torch.zeros(1, dtype=torch.int64, device="cuda")
|
|
||||||
positions = torch.zeros(1, dtype=torch.int64, device="cuda")
|
|
||||||
slot_mapping = torch.zeros(1, dtype=torch.int32, device="cuda")
|
|
||||||
context_lens = torch.zeros(1, dtype=torch.int32, device="cuda")
|
|
||||||
|
|
||||||
self.offload_graphs = {}
|
|
||||||
self.offload_graph_pool = None
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Phase 2: Capture Per-Buffer Graphs
|
|
||||||
|
|
||||||
Since layer processing rotates through ring buffers (`layer_id % num_buffers`), we need graphs for each buffer slot:
|
|
||||||
|
|
||||||
```python
|
|
||||||
for buffer_idx in range(num_buffers):
|
|
||||||
graph = torch.cuda.CUDAGraph()
|
|
||||||
|
|
||||||
# Set Attention cache to this buffer slot (fixed address)
|
|
||||||
for layer in self.model.model.layers:
|
|
||||||
layer.self_attn.attn.k_cache = offload_engine.layer_k_cache[buffer_idx:buffer_idx+1]
|
|
||||||
layer.self_attn.attn.v_cache = offload_engine.layer_v_cache[buffer_idx:buffer_idx+1]
|
|
||||||
|
|
||||||
# Set context
|
|
||||||
set_context(is_prefill=False, slot_mapping=slot_mapping,
|
|
||||||
context_lens=context_lens, block_tables=None)
|
|
||||||
|
|
||||||
# Warmup
|
|
||||||
hidden = self.model.model.embed_tokens(input_ids)
|
|
||||||
residual = None
|
|
||||||
for layer_id, layer in enumerate(self.model.model.layers):
|
|
||||||
if layer_id % num_buffers == buffer_idx:
|
|
||||||
hidden, residual = layer(positions, hidden, residual)
|
|
||||||
|
|
||||||
# Capture
|
|
||||||
with torch.cuda.graph(graph, self.offload_graph_pool):
|
|
||||||
# Same operations
|
|
||||||
...
|
|
||||||
|
|
||||||
self.offload_graphs[buffer_idx] = graph
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Phase 3: Use Graphs in Decode
|
|
||||||
|
|
||||||
Modify `run_layerwise_offload_decode()` to replay graphs:
|
|
||||||
|
|
||||||
```python
|
|
||||||
for layer_id in range(num_layers):
|
|
||||||
current_buffer = layer_id % num_buffers
|
|
||||||
|
|
||||||
# Wait for H2D load
|
|
||||||
offload_engine.wait_buffer_load(current_buffer)
|
|
||||||
|
|
||||||
# Copy decode buffer to ring buffer (same as current)
|
|
||||||
...
|
|
||||||
|
|
||||||
# Update graph variables
|
|
||||||
self.offload_graph_vars["positions"][0] = positions[0]
|
|
||||||
self.offload_graph_vars["slot_mapping"][0] = context_len
|
|
||||||
self.offload_graph_vars["context_lens"][0] = context_len + 1
|
|
||||||
|
|
||||||
# Replay graph instead of eager forward
|
|
||||||
self.offload_graphs[current_buffer].replay()
|
|
||||||
|
|
||||||
# Copy new KV to decode buffer (same as current)
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
### Challenges and Considerations
|
|
||||||
|
|
||||||
| Challenge | Solution |
|
|
||||||
|-----------|----------|
|
|
||||||
| H2D transfers interleaved with compute | H2D happens outside graph, only compute is captured |
|
|
||||||
| Different layers use different buffers | Capture per-buffer graphs, replay correct one |
|
|
||||||
| Variable context length | Use `cache_seqlens` parameter (fixed address, variable value) |
|
|
||||||
| Per-layer buffer rotation | Graph captures single-layer forward, loop in Python |
|
|
||||||
|
|
||||||
### Alternative: Full-Decode Graph (More Complex)
|
|
||||||
|
|
||||||
Instead of per-layer graphs, capture entire decode step:
|
|
||||||
1. Complete all H2D loads before graph
|
|
||||||
2. Single graph covers all layers
|
|
||||||
3. Better kernel fusion, less CPU overhead
|
|
||||||
4. More complex to implement (need to handle buffer rotation inside graph)
|
|
||||||
|
|
||||||
## Implementation Phases
|
## Implementation Phases
|
||||||
|
|
||||||
| Phase | Description | Status |
|
### Phase 1: ModelRunner 动态端口 [pending]
|
||||||
|-------|-------------|--------|
|
**File**: `nanovllm/engine/model_runner.py`
|
||||||
| Phase 0 | Refactor offload decode to use Attention.forward() | ✅ Completed |
|
|
||||||
| Phase 1 | Implement `capture_offload_cudagraph()` with per-layer graphs | ✅ Completed |
|
|
||||||
| Phase 2 | Modify `run_layerwise_offload_decode()` to use graphs | ✅ Completed |
|
|
||||||
| Phase 3 | Test and benchmark | ✅ Completed |
|
|
||||||
| Phase 4 | (Optional) Optimize to full-decode graph | ⬜ Future |
|
|
||||||
|
|
||||||
## Architecture After Refactoring
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
||||||
│ Offload Decode Flow (After Refactoring) │
|
|
||||||
├─────────────────────────────────────────────────────────────────────────────┤
|
|
||||||
│ │
|
|
||||||
│ For each layer: │
|
|
||||||
│ 1. Wait for H2D load (ring buffer has prefill KV) │
|
|
||||||
│ 2. Copy decode buffer → ring buffer (at prefill_len offset) │
|
|
||||||
│ 3. Set Attention.k_cache = ring_buffer[buffer_idx] │
|
|
||||||
│ 4. Set context (slot_mapping, context_lens, block_tables=None) │
|
|
||||||
│ 5. layer.forward() → Qwen3Attention.forward() → Attention.forward() │
|
|
||||||
│ └── store_kvcache() stores new token to ring buffer │
|
|
||||||
│ └── flash_attn_with_kvcache() computes attention │
|
|
||||||
│ 6. Copy new token KV: ring buffer → decode buffer │
|
|
||||||
│ 7. Start next layer H2D load │
|
|
||||||
│ │
|
|
||||||
│ Key insight: Now uses standard Attention path, just with ring buffer │
|
|
||||||
│ as k_cache/v_cache in contiguous format (block_tables=None) │
|
|
||||||
│ │
|
|
||||||
└─────────────────────────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## Files Modified
|
|
||||||
|
|
||||||
| File | Changes |
|
|
||||||
|------|---------|
|
|
||||||
| `model_runner.py:46-50` | Conditional CUDA graph capture: calls `capture_offload_cudagraph()` for offload mode |
|
|
||||||
| `model_runner.py:69-73` | Updated `exit()` to clean up offload graph resources |
|
|
||||||
| `model_runner.py:844-1031` | Refactored `run_layerwise_offload_decode()` to use standard `layer.forward()` with optional CUDA graph |
|
|
||||||
| `model_runner.py:1075-1164` | New `capture_offload_cudagraph()` method for per-layer graph capture |
|
|
||||||
| `tests/test_needle.py` | Added `--use-cuda-graph` flag to test CUDA graph mode |
|
|
||||||
|
|
||||||
## Implementation Details
|
|
||||||
|
|
||||||
### `capture_offload_cudagraph()` (line 1075-1164)
|
|
||||||
|
|
||||||
Captures per-layer CUDA graphs for offload decode:
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def capture_offload_cudagraph(self):
|
import socket
|
||||||
# Fixed-address tensors for graph capture
|
|
||||||
hidden_states = torch.randn(1, hidden_size, ...)
|
|
||||||
residual = torch.randn(1, hidden_size, ...)
|
|
||||||
layer_outputs = torch.zeros(1, hidden_size, ...)
|
|
||||||
layer_residual = torch.zeros(1, hidden_size, ...)
|
|
||||||
|
|
||||||
for layer_id in range(num_layers):
|
def _find_free_port() -> int:
|
||||||
buffer_idx = layer_id % num_buffers
|
"""Find a free port for distributed communication."""
|
||||||
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
|
s.bind(('', 0))
|
||||||
|
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
|
||||||
|
return s.getsockname()[1]
|
||||||
|
|
||||||
# Set Attention cache to ring buffer
|
class ModelRunner:
|
||||||
attn_module.k_cache = ring_buffer[buffer_idx:buffer_idx+1]
|
def __init__(self, config: Config, rank: int, event: Event | list[Event]):
|
||||||
attn_module.v_cache = ring_buffer[buffer_idx:buffer_idx+1]
|
# ... existing code ...
|
||||||
|
|
||||||
# Warmup and capture
|
import os
|
||||||
with torch.cuda.graph(graph):
|
port = os.environ.get("NANOVLLM_DIST_PORT")
|
||||||
out_h, out_r = layer(positions, hidden_states, residual)
|
if port is None:
|
||||||
layer_outputs.copy_(out_h)
|
port = _find_free_port()
|
||||||
layer_residual.copy_(out_r)
|
logger.info(f"Auto-assigned distributed port: {port}")
|
||||||
|
else:
|
||||||
|
port = int(port)
|
||||||
|
|
||||||
# Update inputs for next layer
|
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
|
||||||
hidden_states.copy_(layer_outputs)
|
|
||||||
residual.copy_(layer_residual)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### `run_layerwise_offload_decode()` CUDA Graph Mode
|
### Phase 2: LLMEngine 资源清理增强 [pending]
|
||||||
|
**File**: `nanovllm/engine/llm_engine.py`
|
||||||
|
|
||||||
When CUDA graphs are available:
|
添加 `close()` 方法和 context manager 支持,确保资源正确释放:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
use_cuda_graph = not self.enforce_eager and hasattr(self, 'offload_graphs')
|
class LLMEngine:
|
||||||
|
def __init__(self, model, **kwargs):
|
||||||
|
# ... existing code ...
|
||||||
|
self._closed = False
|
||||||
|
atexit.register(self._atexit_handler)
|
||||||
|
|
||||||
if use_cuda_graph:
|
def _atexit_handler(self):
|
||||||
# Use fixed-address tensors
|
if not self._closed:
|
||||||
graph_vars["positions"][0] = len(seq) - 1
|
self.close()
|
||||||
graph_vars["slot_mapping"][0] = context_len
|
|
||||||
graph_vars["context_lens"][0] = context_len + 1
|
|
||||||
graph_vars["hidden_states"].copy_(embedding)
|
|
||||||
graph_vars["residual"].zero_()
|
|
||||||
|
|
||||||
for layer_id in range(num_layers):
|
def close(self):
|
||||||
# Set up ring buffer and context
|
"""Explicitly close the engine and release all resources."""
|
||||||
...
|
if self._closed:
|
||||||
|
return
|
||||||
|
self._closed = True
|
||||||
|
try:
|
||||||
|
atexit.unregister(self._atexit_handler)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
self.model_runner.call("exit")
|
||||||
|
del self.model_runner
|
||||||
|
for p in self.ps:
|
||||||
|
p.join()
|
||||||
|
|
||||||
# Replay graph
|
def exit(self):
|
||||||
self.offload_graphs[layer_id].replay()
|
"""Alias for close() - backward compatibility."""
|
||||||
torch.cuda.current_stream().synchronize()
|
self.close()
|
||||||
|
|
||||||
# Copy outputs to inputs for next layer
|
def __del__(self):
|
||||||
if layer_id < num_layers - 1:
|
try:
|
||||||
graph_vars["hidden_states"].copy_(graph_vars["layer_outputs"])
|
self.close()
|
||||||
graph_vars["residual"].copy_(graph_vars["layer_residual"])
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, *args):
|
||||||
|
self.close()
|
||||||
|
return False
|
||||||
```
|
```
|
||||||
|
|
||||||
## Test Results
|
### Phase 3: 测试验证 [pending]
|
||||||
|
**File**: `tests/test_multiple_processes.py` (新建)
|
||||||
|
|
||||||
| Test | Mode | CUDA Graph | Status |
|
```python
|
||||||
|------|------|------------|--------|
|
"""Test multiple independent nanovllm processes."""
|
||||||
| `test_needle.py --input-len 4096` | GPU-only | N/A | PASSED |
|
import subprocess
|
||||||
| `test_needle.py --input-len 4096 --enable-offload` | CPU offload | Disabled | PASSED |
|
import sys
|
||||||
| `test_needle.py --input-len 32768 --enable-offload` | CPU offload | Disabled | PASSED |
|
import time
|
||||||
| `test_needle.py --input-len 32768 --enable-offload --use-cuda-graph` | CPU offload | Enabled | PASSED |
|
|
||||||
|
|
||||||
## Next Steps
|
def test_parallel_processes():
|
||||||
|
"""Test running multiple nanovllm processes in parallel."""
|
||||||
|
script = '''
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, ".")
|
||||||
|
from nanovllm import LLM, SamplingParams
|
||||||
|
import os
|
||||||
|
|
||||||
1. ~~Implement `capture_offload_cudagraph()` method~~ ✅
|
gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "0")
|
||||||
2. ~~Modify `run_layerwise_offload_decode()` to optionally use captured graphs~~ ✅
|
print(f"[GPU {gpu}] Starting LLM")
|
||||||
3. ~~Test correctness with needle-in-haystack~~ ✅
|
llm = LLM("path/to/model", enable_cpu_offload=True)
|
||||||
4. Benchmark performance improvement from CUDA graphs (optional)
|
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
|
||||||
5. Consider full-decode graph optimization for maximum performance (future)
|
print(f"[GPU {gpu}] Output: {outputs[0]['text'][:50]}")
|
||||||
|
llm.close()
|
||||||
|
print(f"[GPU {gpu}] Done")
|
||||||
|
'''
|
||||||
|
|
||||||
|
# Start 2 processes on different GPUs
|
||||||
|
procs = []
|
||||||
|
for gpu in [0, 1]:
|
||||||
|
env = {"CUDA_VISIBLE_DEVICES": str(gpu)}
|
||||||
|
p = subprocess.Popen(
|
||||||
|
[sys.executable, "-c", script],
|
||||||
|
env={**os.environ, **env}
|
||||||
|
)
|
||||||
|
procs.append(p)
|
||||||
|
time.sleep(1) # Stagger start slightly
|
||||||
|
|
||||||
|
# Wait for all
|
||||||
|
for p in procs:
|
||||||
|
assert p.wait() == 0, f"Process failed with code {p.returncode}"
|
||||||
|
|
||||||
|
print("PASSED: test_parallel_processes")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
test_parallel_processes()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 4: 文档更新 [pending]
|
||||||
|
**File**: `docs/torch_distributed_port_issue.md`
|
||||||
|
|
||||||
|
更新文档标记问题已通过动态端口分配解决。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Usage After Fix
|
||||||
|
|
||||||
|
### 场景 1: 多进程并行测试(主要场景)
|
||||||
|
```bash
|
||||||
|
# 无需任何额外配置,直接运行
|
||||||
|
CUDA_VISIBLE_DEVICES=0 python test_group1.py &
|
||||||
|
CUDA_VISIBLE_DEVICES=1 python test_group2.py &
|
||||||
|
CUDA_VISIBLE_DEVICES=2 python test_group3.py &
|
||||||
|
wait
|
||||||
|
```
|
||||||
|
|
||||||
|
### 场景 2: 同一进程顺序创建(也支持)
|
||||||
|
```python
|
||||||
|
for i in range(3):
|
||||||
|
with LLM(model_path) as llm:
|
||||||
|
outputs = llm.generate(prompts, params)
|
||||||
|
# 自动清理,下一个可以使用新的随机端口
|
||||||
|
```
|
||||||
|
|
||||||
|
### 场景 3: 手动指定端口(向后兼容)
|
||||||
|
```bash
|
||||||
|
NANOVLLM_DIST_PORT=2333 python test.py
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- [ ] 多个独立进程可以同时运行(不同 GPU)
|
||||||
|
- [ ] 无需手动指定端口
|
||||||
|
- [ ] 向后兼容(环境变量仍有效)
|
||||||
|
- [ ] 同一进程顺序创建也能工作
|
||||||
|
- [ ] 资源正确清理
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files to Modify
|
||||||
|
|
||||||
|
| File | Action | Status |
|
||||||
|
|------|--------|--------|
|
||||||
|
| `nanovllm/engine/model_runner.py` | Add `_find_free_port()` | pending |
|
||||||
|
| `nanovllm/engine/llm_engine.py` | Add `close()`, context manager | pending |
|
||||||
|
| `tests/test_multiple_processes.py` | Create | pending |
|
||||||
|
| `docs/torch_distributed_port_issue.md` | Update | pending |
|
||||||
|
|||||||
112
tests/run_parallel_niah.sh
Executable file
112
tests/run_parallel_niah.sh
Executable file
@@ -0,0 +1,112 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# Run NIAH tests in parallel on 6 GPUs
|
||||||
|
# This tests the dynamic port allocation fix
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
MODEL="${1:-/home/zijie/models/Llama-3.1-8B-Instruct}"
|
||||||
|
PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||||
|
|
||||||
|
echo "=========================================="
|
||||||
|
echo "Parallel NIAH Test on 6 GPUs"
|
||||||
|
echo "=========================================="
|
||||||
|
echo "Model: $MODEL"
|
||||||
|
echo "Project: $PROJECT_ROOT"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Sample distribution (100 samples total):
|
||||||
|
# GPU 0: 0-16 (17 samples)
|
||||||
|
# GPU 1: 17-33 (17 samples)
|
||||||
|
# GPU 2: 34-50 (17 samples)
|
||||||
|
# GPU 3: 51-67 (17 samples)
|
||||||
|
# GPU 4: 68-83 (16 samples)
|
||||||
|
# GPU 5: 84-99 (16 samples)
|
||||||
|
|
||||||
|
declare -a RANGES=("0-16" "17-33" "34-50" "51-67" "68-83" "84-99")
|
||||||
|
declare -a PIDS=()
|
||||||
|
|
||||||
|
# Create log directory
|
||||||
|
LOG_DIR="$PROJECT_ROOT/logs"
|
||||||
|
mkdir -p "$LOG_DIR"
|
||||||
|
|
||||||
|
# Start all 6 processes
|
||||||
|
for gpu in {0..5}; do
|
||||||
|
range="${RANGES[$gpu]}"
|
||||||
|
log_file="$LOG_DIR/gpu${gpu}_${range}.log"
|
||||||
|
|
||||||
|
echo "Starting GPU $gpu: samples $range -> $log_file"
|
||||||
|
|
||||||
|
CUDA_VISIBLE_DEVICES=$gpu PYTHONPATH="$PROJECT_ROOT:$PYTHONPATH" \
|
||||||
|
python "$PROJECT_ROOT/tests/test_ruler_niah.py" \
|
||||||
|
--model "$MODEL" \
|
||||||
|
--sample-indices "$range" \
|
||||||
|
--enable-offload \
|
||||||
|
--num-gpu-blocks 4 \
|
||||||
|
--quiet \
|
||||||
|
> "$log_file" 2>&1 &
|
||||||
|
|
||||||
|
PIDS+=($!)
|
||||||
|
|
||||||
|
# Small delay to stagger starts
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "All 6 processes started. Waiting for completion..."
|
||||||
|
echo "PIDs: ${PIDS[*]}"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Wait for all processes and collect results
|
||||||
|
declare -a RESULTS=()
|
||||||
|
ALL_PASSED=true
|
||||||
|
|
||||||
|
for i in {0..5}; do
|
||||||
|
pid="${PIDS[$i]}"
|
||||||
|
range="${RANGES[$i]}"
|
||||||
|
log_file="$LOG_DIR/gpu${i}_${range}.log"
|
||||||
|
|
||||||
|
if wait $pid; then
|
||||||
|
RESULTS+=("GPU $i ($range): PASSED")
|
||||||
|
echo "GPU $i completed successfully"
|
||||||
|
else
|
||||||
|
RESULTS+=("GPU $i ($range): FAILED (exit code $?)")
|
||||||
|
ALL_PASSED=false
|
||||||
|
echo "GPU $i FAILED!"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "=========================================="
|
||||||
|
echo "RESULTS SUMMARY"
|
||||||
|
echo "=========================================="
|
||||||
|
for result in "${RESULTS[@]}"; do
|
||||||
|
echo "$result"
|
||||||
|
done
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Show accuracy from each log
|
||||||
|
echo "Accuracy per GPU:"
|
||||||
|
for i in {0..5}; do
|
||||||
|
range="${RANGES[$i]}"
|
||||||
|
log_file="$LOG_DIR/gpu${i}_${range}.log"
|
||||||
|
if [ -f "$log_file" ]; then
|
||||||
|
accuracy=$(grep -E "Accuracy:|accuracy" "$log_file" | tail -1 || echo "N/A")
|
||||||
|
port=$(grep "Auto-assigned distributed port" "$log_file" | head -1 || echo "N/A")
|
||||||
|
echo " GPU $i ($range): $accuracy | $port"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
if $ALL_PASSED; then
|
||||||
|
echo "=========================================="
|
||||||
|
echo "ALL 6 TESTS PASSED!"
|
||||||
|
echo "Dynamic port allocation works correctly."
|
||||||
|
echo "=========================================="
|
||||||
|
exit 0
|
||||||
|
else
|
||||||
|
echo "=========================================="
|
||||||
|
echo "SOME TESTS FAILED!"
|
||||||
|
echo "Check logs in $LOG_DIR"
|
||||||
|
echo "=========================================="
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
198
tests/test_port_conflict.py
Normal file
198
tests/test_port_conflict.py
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
"""Test for torch distributed port conflict fix.
|
||||||
|
|
||||||
|
This test verifies that:
|
||||||
|
1. Multiple independent processes can run simultaneously (dynamic port allocation)
|
||||||
|
2. Sequential LLM creation in same process works (proper cleanup)
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
# Test parallel processes (requires 2 GPUs)
|
||||||
|
python tests/test_port_conflict.py --model ~/models/Qwen3-4B --gpus 4,5 --test parallel
|
||||||
|
|
||||||
|
# Test sequential creation in same process
|
||||||
|
CUDA_VISIBLE_DEVICES=4 python tests/test_port_conflict.py --model ~/models/Qwen3-4B --test sequential
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
|
||||||
|
def test_sequential_creation(model_path: str, enable_offload: bool = True):
|
||||||
|
"""Test creating multiple LLM instances sequentially in same process."""
|
||||||
|
# Add project root to path
|
||||||
|
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
sys.path.insert(0, project_root)
|
||||||
|
|
||||||
|
from nanovllm import LLM, SamplingParams
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("Test: Sequential LLM Creation (same process)")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
for i in range(3):
|
||||||
|
print(f"\n--- Creating LLM instance {i+1}/3 ---")
|
||||||
|
|
||||||
|
llm_kwargs = {"enable_cpu_offload": enable_offload}
|
||||||
|
if enable_offload:
|
||||||
|
llm_kwargs["num_gpu_blocks"] = 2
|
||||||
|
|
||||||
|
llm = LLM(model_path, **llm_kwargs)
|
||||||
|
|
||||||
|
# Simple generation
|
||||||
|
outputs = llm.generate(
|
||||||
|
["Hello, how are you?"],
|
||||||
|
SamplingParams(max_tokens=20)
|
||||||
|
)
|
||||||
|
print(f"Output: {outputs[0]['text'][:50]}...")
|
||||||
|
|
||||||
|
# Explicit cleanup
|
||||||
|
llm.close()
|
||||||
|
print(f"Instance {i+1} closed successfully")
|
||||||
|
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("PASSED: test_sequential_creation")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
|
||||||
|
def test_context_manager(model_path: str, enable_offload: bool = True):
|
||||||
|
"""Test LLM with context manager."""
|
||||||
|
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
sys.path.insert(0, project_root)
|
||||||
|
|
||||||
|
from nanovllm import LLM, SamplingParams
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("Test: Context Manager")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
for i in range(2):
|
||||||
|
print(f"\n--- Context manager instance {i+1}/2 ---")
|
||||||
|
|
||||||
|
llm_kwargs = {"enable_cpu_offload": enable_offload}
|
||||||
|
if enable_offload:
|
||||||
|
llm_kwargs["num_gpu_blocks"] = 2
|
||||||
|
|
||||||
|
with LLM(model_path, **llm_kwargs) as llm:
|
||||||
|
outputs = llm.generate(
|
||||||
|
["What is 2+2?"],
|
||||||
|
SamplingParams(max_tokens=20)
|
||||||
|
)
|
||||||
|
print(f"Output: {outputs[0]['text'][:50]}...")
|
||||||
|
|
||||||
|
print(f"Instance {i+1} auto-closed via context manager")
|
||||||
|
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("PASSED: test_context_manager")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parallel_processes(model_path: str, gpus: str, enable_offload: bool = True):
|
||||||
|
"""Test running multiple nanovllm processes in parallel."""
|
||||||
|
gpu_list = [int(g.strip()) for g in gpus.split(",")]
|
||||||
|
if len(gpu_list) < 2:
|
||||||
|
print("ERROR: Need at least 2 GPUs for parallel test")
|
||||||
|
return False
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"Test: Parallel Processes (GPUs: {gpu_list})")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
|
||||||
|
# Script to run in each subprocess
|
||||||
|
script = f'''
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, "{project_root}")
|
||||||
|
import os
|
||||||
|
from nanovllm import LLM, SamplingParams
|
||||||
|
|
||||||
|
gpu = os.environ.get("CUDA_VISIBLE_DEVICES", "?")
|
||||||
|
print(f"[GPU {{gpu}}] Starting LLM...")
|
||||||
|
|
||||||
|
llm_kwargs = {{"enable_cpu_offload": {enable_offload}}}
|
||||||
|
if {enable_offload}:
|
||||||
|
llm_kwargs["num_gpu_blocks"] = 2
|
||||||
|
|
||||||
|
llm = LLM("{model_path}", **llm_kwargs)
|
||||||
|
print(f"[GPU {{gpu}}] LLM initialized, generating...")
|
||||||
|
|
||||||
|
outputs = llm.generate(["Hello world"], SamplingParams(max_tokens=10))
|
||||||
|
print(f"[GPU {{gpu}}] Output: {{outputs[0]['text'][:30]}}...")
|
||||||
|
|
||||||
|
llm.close()
|
||||||
|
print(f"[GPU {{gpu}}] Done")
|
||||||
|
'''
|
||||||
|
|
||||||
|
# Start processes on different GPUs
|
||||||
|
procs = []
|
||||||
|
for i, gpu in enumerate(gpu_list[:2]): # Use first 2 GPUs
|
||||||
|
print(f"\nStarting process on GPU {gpu}...")
|
||||||
|
env = os.environ.copy()
|
||||||
|
env["CUDA_VISIBLE_DEVICES"] = str(gpu)
|
||||||
|
|
||||||
|
p = subprocess.Popen(
|
||||||
|
[sys.executable, "-c", script],
|
||||||
|
env=env,
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.STDOUT,
|
||||||
|
text=True
|
||||||
|
)
|
||||||
|
procs.append((gpu, p))
|
||||||
|
time.sleep(2) # Stagger starts to see concurrent running
|
||||||
|
|
||||||
|
# Wait and collect results
|
||||||
|
all_passed = True
|
||||||
|
for gpu, p in procs:
|
||||||
|
stdout, _ = p.communicate(timeout=300)
|
||||||
|
print(f"\n--- GPU {gpu} output ---")
|
||||||
|
print(stdout)
|
||||||
|
|
||||||
|
if p.returncode != 0:
|
||||||
|
print(f"ERROR: GPU {gpu} process failed with code {p.returncode}")
|
||||||
|
all_passed = False
|
||||||
|
else:
|
||||||
|
print(f"GPU {gpu} process completed successfully")
|
||||||
|
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
if all_passed:
|
||||||
|
print("PASSED: test_parallel_processes")
|
||||||
|
else:
|
||||||
|
print("FAILED: test_parallel_processes")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
return all_passed
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Test port conflict fix")
|
||||||
|
parser.add_argument("--model", "-m", required=True, help="Path to model")
|
||||||
|
parser.add_argument("--gpus", default="0,1", help="GPUs to use for parallel test (comma-separated)")
|
||||||
|
parser.add_argument("--test", choices=["sequential", "context", "parallel", "all"],
|
||||||
|
default="all", help="Which test to run")
|
||||||
|
parser.add_argument("--no-offload", action="store_true", help="Disable CPU offload")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
enable_offload = not args.no_offload
|
||||||
|
model_path = os.path.expanduser(args.model)
|
||||||
|
|
||||||
|
print(f"Model: {model_path}")
|
||||||
|
print(f"CPU Offload: {enable_offload}")
|
||||||
|
print(f"GPUs for parallel test: {args.gpus}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if args.test in ["sequential", "all"]:
|
||||||
|
test_sequential_creation(model_path, enable_offload)
|
||||||
|
print()
|
||||||
|
|
||||||
|
if args.test in ["context", "all"]:
|
||||||
|
test_context_manager(model_path, enable_offload)
|
||||||
|
print()
|
||||||
|
|
||||||
|
if args.test in ["parallel", "all"]:
|
||||||
|
test_parallel_processes(model_path, args.gpus, enable_offload)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user