Files
nano-vllm/progress.md
Zijie Tian 64971c8e8a Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:27:25 +08:00

2.8 KiB
Raw Blame History

Progress Log: Fix Torch Distributed Port Conflict

Status: COMPLETED & CLEANED UP

Session: 2026-01-12

Task Overview

修复在同一 Python 进程中顺序创建多个 LLM 实例时的 EADDRINUSE 端口冲突问题,以及支持多卡环境下同时启动多个独立进程。


Phase Status

Phase Description Status
Phase 1 ModelRunner 动态端口分配 COMPLETED
Phase 2 LLMEngine close() 和 context manager COMPLETED
Phase 3 测试验证GPU 4,5 COMPLETED
Phase 4 更新文档 COMPLETED

Implementation Summary

Phase 1: Dynamic Port Allocation

File: nanovllm/engine/model_runner.py

  • Added _find_free_port() function using socket binding
  • Modified port selection logic: use env var if set, otherwise auto-assign
  • Added logging for auto-assigned ports

Phase 2: Resource Cleanup Enhancement

File: nanovllm/engine/llm_engine.py

  • Added _closed flag for idempotent cleanup
  • Added close() method for explicit resource release
  • Added __del__() for GC fallback
  • Added __enter__() and __exit__() for context manager support
  • Modified atexit registration to use _atexit_handler

Phase 3: Testing (GPU 4,5)

File: tests/test_port_conflict.py

  • Created comprehensive test script

Test Results:

Test Status Notes
Sequential creation (3 instances) PASSED Ports: 50405, 47835, 53011
Context manager PASSED Auto-cleanup works
Parallel processes (GPU 4,5) PASSED Ports: 34631, 56097

Phase 4: Documentation

File: docs/torch_distributed_port_issue.md

  • Updated status to RESOLVED
  • Documented solution details
  • Added usage examples

Files Modified

File Action Description
nanovllm/engine/model_runner.py Modified Added _find_free_port(), dynamic port logic
nanovllm/engine/llm_engine.py Modified Added close(), __del__, context manager
tests/test_port_conflict.py Created Test script for port conflict fix
docs/torch_distributed_port_issue.md Deleted Issue resolved, doc removed
CLAUDE.md Modified Removed port conflict warnings, updated doc index

Key Features After Fix

  1. Multi-GPU Parallel Testing

    CUDA_VISIBLE_DEVICES=0 python test1.py &
    CUDA_VISIBLE_DEVICES=1 python test2.py &
    # Both run with different auto-assigned ports
    
  2. Sequential LLM Creation

    for i in range(3):
        with LLM(model_path) as llm:
            outputs = llm.generate(prompts, params)
        # Automatically cleaned up
    
  3. Backward Compatible

    • NANOVLLM_DIST_PORT env var still works
    • llm.exit() still works (alias for close())