[docs] Added dist port issue.

This commit is contained in:
Zijie Tian
2026-01-12 15:16:39 +08:00
parent 8e0888c20c
commit de6f36bdb2
3 changed files with 569 additions and 372 deletions

435
CLAUDE.md
View File

@@ -1,389 +1,108 @@
# Claude Code Configuration - SPARC Development Environment
# CLAUDE.md
## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT
This file provides guidance to Claude Code when working with this repository.
**ABSOLUTE RULES**:
1. ALL operations MUST be concurrent/parallel in a single message
2. **NEVER save working files, text/mds and tests to the root folder**
3. ALWAYS organize files in appropriate subdirectories
4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP
## Overview
### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
**MANDATORY PATTERNS:**
- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message
## GPU Mutex for Multi-Instance Debugging
### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
**Claude Code's Task tool is the PRIMARY way to spawn agents:**
```javascript
// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
[Single Message]:
Task("Research agent", "Analyze requirements and patterns...", "researcher")
Task("Coder agent", "Implement core features...", "coder")
Task("Tester agent", "Create comprehensive tests...", "tester")
Task("Reviewer agent", "Review code quality...", "reviewer")
Task("Architect agent", "Design system architecture...", "system-architect")
```
### Benchmarks (`bench*.py`) - Exclusive GPU Access Required
**MCP tools are ONLY for coordination setup:**
- `mcp__claude-flow__swarm_init` - Initialize coordination topology
- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows
### 📁 File Organization Rules
**NEVER save to root folder. Use these directories:**
- `/src` - Source code files
- `/tests` - Test files
- `/docs` - Documentation and markdown files
- `/config` - Configuration files
- `/scripts` - Utility scripts
- `/examples` - Example code
## Project Overview
This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
## SPARC Commands
### Core Commands
- `npx claude-flow sparc modes` - List available modes
- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
- `npx claude-flow sparc info <mode>` - Get mode details
### Batchtools Commands
- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
### Build Commands
- `npm run build` - Build project
- `npm run test` - Run tests
- `npm run lint` - Linting
- `npm run typecheck` - Type checking
## SPARC Workflow Phases
1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
3. **Architecture** - System design (`sparc run architect`)
4. **Refinement** - TDD implementation (`sparc tdd`)
5. **Completion** - Integration (`sparc run integration`)
## Code Style & Best Practices
- **Modular Design**: Files under 500 lines
- **Environment Safety**: Never hardcode secrets
- **Test-First**: Write tests before implementation
- **Clean Architecture**: Separate concerns
- **Documentation**: Keep updated
## 🚀 Available Agents (54 Total)
### Core Development
`coder`, `reviewer`, `tester`, `planner`, `researcher`
### Swarm Coordination
`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
### Consensus & Distributed
`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
### Performance & Optimization
`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
### GitHub & Repository
`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
### SPARC Methodology
`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
### Specialized Development
`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
### Testing & Validation
`tdd-london-swarm`, `production-validator`
### Migration & Planning
`migration-planner`, `swarm-init`
## 🎯 Claude Code vs MCP Tools
### Claude Code Handles ALL EXECUTION:
- **Task tool**: Spawn and run agents concurrently for actual work
- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
- Code generation and programming
- Bash commands and system operations
- Implementation work
- Project navigation and analysis
- TodoWrite and task management
- Git operations
- Package management
- Testing and debugging
### MCP Tools ONLY COORDINATE:
- Swarm initialization (topology setup)
- Agent type definitions (coordination patterns)
- Task orchestration (high-level planning)
- Memory management
- Neural features
- Performance tracking
- GitHub integration
**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
## 🚀 Quick Setup
Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:
```bash
# Add MCP servers (Claude Flow required, others optional)
claude mcp add claude-flow npx claude-flow@alpha mcp start
claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination
claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features
# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
echo "GPU busy, waiting 10s..."
sleep 10
done
```
## MCP Tool Categories
### Other Scripts (tests, examples) - Port Conflict Check Only
### Coordination
`swarm_init`, `agent_spawn`, `task_orchestrate`
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
### Monitoring
`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`
### Memory & Neural
`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`
### GitHub Integration
`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
### System
`benchmark_run`, `features_detect`, `swarm_monitor`
### Flow-Nexus MCP Tools (Optional Advanced Features)
Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
**Key MCP Tool Categories:**
- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
- **Storage**: `storage_upload`, `storage_list` (cloud file management)
**Authentication Required:**
- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
- Access 70+ specialized MCP tools for advanced orchestration
## 🚀 Agent Execution Flow with Claude Code
### The Correct Pattern:
1. **Optional**: Use MCP tools to set up coordination topology
2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
3. **REQUIRED**: Each agent runs hooks for coordination
4. **REQUIRED**: Batch all operations in single messages
### Example Full-Stack Development:
```javascript
// Single message with all agent spawning via Claude Code's Task tool
[Parallel Agent Execution]:
Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
// All todos batched together
TodoWrite { todos: [...8-10 todos...] }
// All file operations together
Write "backend/server.js"
Write "frontend/App.jsx"
Write "database/schema.sql"
```
## 📋 Agent Coordination Protocol
### Every Agent Spawned via Task Tool MUST:
**1⃣ BEFORE Work:**
```bash
npx claude-flow@alpha hooks pre-task --description "[task]"
npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
# Check if port 2333 (nanovllm default) is in use
if lsof -i :2333 >/dev/null 2>&1; then
echo "Port 2333 in use, waiting 10s..."
sleep 10
fi
```
**2⃣ DURING Work:**
**Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process.
## Multi-Instance Development with PYTHONPATH
**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
**Use PYTHONPATH directly** - no pip install needed:
```bash
npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
npx claude-flow@alpha hooks notify --message "[what was done]"
# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
```
**3⃣ AFTER Work:**
```bash
npx claude-flow@alpha hooks post-task --task-id "[task]"
npx claude-flow@alpha hooks session-end --export-metrics true
```
**Benefits**:
- No `pip install` required
- Code changes take effect immediately (no reinstall needed)
- Each worktree is completely isolated
## 🎯 Concurrent Execution Examples
## Documentation Index
### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes
| Document | Purpose |
|----------|---------|
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
| [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions |
| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
```javascript
// Step 1: MCP tools set up coordination (optional, for complex tasks)
[Single Message - Coordination Setup]:
mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
mcp__claude-flow__agent_spawn { type: "researcher" }
mcp__claude-flow__agent_spawn { type: "coder" }
mcp__claude-flow__agent_spawn { type: "tester" }
## Configuration
// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
[Single Message - Parallel Agent Execution]:
// Claude Code's Task tool spawns real agents concurrently
Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
// Batch ALL todos in ONE call
TodoWrite { todos: [
{id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
{id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
{id: "3", content: "Implement authentication", status: "pending", priority: "high"},
{id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
{id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
{id: "6", content: "Integration tests", status: "pending", priority: "medium"},
{id: "7", content: "API documentation", status: "pending", priority: "low"},
{id: "8", content: "Performance optimization", status: "pending", priority: "low"}
]}
// Parallel file operations
Bash "mkdir -p app/{src,tests,docs,config}"
Write "app/package.json"
Write "app/src/server.js"
Write "app/tests/server.test.js"
Write "app/docs/API.md"
```
| Parameter | Default | Notes |
|-----------|---------|-------|
| `kvcache_block_size` | 4096 | Tokens per block |
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
| `enable_cpu_offload` | False | Enable for long context |
| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
| `enforce_eager` | False | Set True to disable CUDA graphs |
### ❌ WRONG (Multiple Messages):
```javascript
Message 1: mcp__claude-flow__swarm_init
Message 2: Task("agent 1")
Message 3: TodoWrite { todos: [single todo] }
Message 4: Write "file.js"
// This breaks parallel coordination!
```
## Benchmarking
## Performance Benefits
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
- **84.8% SWE-Bench solve rate**
- **32.3% token reduction**
- **2.8-4.4x speed improvement**
- **27+ neural models**
**Common Issues**:
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
## Hooks Integration
**Model Limits**:
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
- Llama-3.1-8B-Instruct: 131072 tokens
### Pre-Operation
- Auto-assign agents by file type
- Validate commands for safety
- Prepare resources automatically
- Optimize topology by complexity
- Cache searches
### Post-Operation
- Auto-format code
- Train neural patterns
- Update memory
- Analyze performance
- Track token usage
### Session Management
- Generate summaries
- Persist state
- Track metrics
- Restore context
- Export workflows
## Advanced Features (v2.0.0)
- 🚀 Automatic Topology Selection
- ⚡ Parallel Execution (2.8-4.4x speed)
- 🧠 Neural Training
- 📊 Bottleneck Analysis
- 🤖 Smart Auto-Spawning
- 🛡️ Self-Healing Workflows
- 💾 Cross-Session Memory
- 🔗 GitHub Integration
## Integration Tips
1. Start with basic swarm init
2. Scale agents gradually
3. Use memory for context
4. Monitor progress regularly
5. Train patterns from success
6. Enable hooks automation
7. Use GitHub tools first
## Support
- Documentation: https://github.com/ruvnet/claude-flow
- Issues: https://github.com/ruvnet/claude-flow/issues
- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)
**Performance (Qwen3-4B, CPU Offload)**:
- Prefill: ~5700-8000 tok/s (varies by context length)
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
- **CUDA Graph speedup: 4x decode throughput**
---
Remember: **Claude Flow coordinates, Claude Code creates!**
# Nano-vLLM Testing
## RULER NIAH Benchmark Test
Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
**Documentation**:
- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
### Quick Start
```bash
# Single sample test (recommended for initial verification)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload
# All 5 samples
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--sample-indices 0-4
```
### Options
| Option | Default | Description |
|--------|---------|-------------|
| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
| `--max-model-len` | 32768 | Maximum context length |
| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
---
# important-instruction-reminders
Do what has been asked; nothing more, nothing less.
NEVER create files unless they're absolutely necessary for achieving your goal.
ALWAYS prefer editing an existing file to creating a new one.
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
Never save working files, text/mds and tests to the root folder.
**Author**: Zijie Tian

View File

@@ -0,0 +1,308 @@
# Torch Distributed Port Conflict Issue
## Problem Summary
When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
```
## Root Cause Analysis
### 1. Distributed Process Group Initialization
In `nanovllm/engine/model_runner.py:30-32`:
```python
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
```
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
- `init_process_group()` binds a TCP socket to this port
- This binding persists until `destroy_process_group()` is called
### 2. Cleanup Mechanism
In `nanovllm/engine/llm_engine.py:37`:
```python
atexit.register(self.exit)
```
In `nanovllm/engine/llm_engine.py:39-43`:
```python
def exit(self):
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
```
In `nanovllm/engine/model_runner.py:66-78`:
```python
def exit(self):
# ... cleanup code ...
dist.destroy_process_group()
```
### 3. The Problem
**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
Timeline of the bug:
```
1. Create LLM instance #1
├── init_process_group() binds port 2333 ✓
└── atexit.register(self.exit) registered
2. LLM #1 goes out of scope (garbage collected)
├── Python's GC deletes the object
├── BUT atexit handler NOT triggered yet
└── Port 2333 still bound! ❌
3. Create LLM instance #2
├── init_process_group() tries to bind port 2333
└── EADDRINUSE error! ❌
4. Program exits (only now atexit runs)
└── Too late - already crashed
```
## Impact
This issue affects:
1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
- Each group needs a fresh LLM instance
- Second group fails with port conflict
2. **Multiple LLM instances in same process**
- Any code that creates LLM, deletes it, then creates another
3. **Interactive/notebook usage**
- Re-running cells that create LLM instances
## Proposed Solutions
### Solution A: Add `__del__` Method (Quick Fix)
Add destructor to `LLMEngine` that calls cleanup:
```python
# In nanovllm/engine/llm_engine.py
def __del__(self):
try:
self.exit()
except Exception:
pass # Ignore errors during cleanup
```
**Pros**: Simple, backwards compatible
**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
### Solution B: Context Manager Pattern (Recommended)
Make `LLMEngine` a context manager:
```python
# In nanovllm/engine/llm_engine.py
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.exit()
return False
```
Usage:
```python
with LLM(model_path) as llm:
outputs = llm.generate(prompts, params)
# Cleanup happens automatically here
```
**Pros**: Explicit, guaranteed cleanup, Pythonic
**Cons**: Requires usage pattern change
### Solution C: Check and Cleanup Before Init (Defensive)
In `ModelRunner.__init__`, check if process group exists:
```python
# In nanovllm/engine/model_runner.py
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
```
**Pros**: Self-healing, no usage pattern change
**Cons**: May mask other issues, global state manipulation
### Solution D: Subprocess Isolation (For Testing)
For grouped testing specifically, run each group in a subprocess:
```python
import subprocess
for group in groups:
subprocess.run([sys.executable, "test_ruler_niah.py",
"--sample-indices", f"{start}-{end}"])
```
**Pros**: Complete isolation, no code changes to nanovllm
**Cons**: More overhead, only solves testing use case
### Solution E: Dynamic Port Allocation
Instead of fixed port 2333, use dynamic port:
```python
import socket
def find_free_port():
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', 0))
return s.getsockname()[1]
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
```
**Pros**: Avoids conflicts entirely
**Cons**: More complex, may have side effects
## Recommended Implementation
**Combine Solutions A + B + C** for maximum robustness:
1. Add `__del__` for best-effort cleanup
2. Add context manager for explicit cleanup
3. Add `is_initialized()` check as defensive measure
```python
# nanovllm/engine/llm_engine.py
class LLMEngine:
def __init__(self, model, **kwargs):
# ... existing code ...
atexit.register(self.exit)
self._exited = False
def exit(self):
if self._exited:
return
self._exited = True
self.model_runner.call("exit")
del self.model_runner
for p in self.ps:
p.join()
def __del__(self):
try:
self.exit()
except Exception:
pass
def __enter__(self):
return self
def __exit__(self, *args):
self.exit()
return False
# nanovllm/engine/model_runner.py
class ModelRunner:
def __init__(self, config: Config, rank: int, event):
# ... existing code before init_process_group ...
import os
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
# Defensive cleanup
if dist.is_initialized():
dist.destroy_process_group()
dist.init_process_group("nccl", f"tcp://localhost:{port}",
world_size=self.world_size, rank=rank)
# ... rest of init ...
```
## Workaround for Current Code
Until the fix is implemented, use one of these workarounds:
### Workaround 1: Manual Cleanup
```python
import torch.distributed as dist
llm = LLM(model_path)
outputs = llm.generate(...)
llm.model_runner.call("exit") # Manual cleanup
del llm
# Now can create new LLM
llm2 = LLM(model_path)
```
### Workaround 2: Subprocess Testing
```bash
# Run each test group as separate process
for i in $(seq 0 5 95); do
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
done
```
### Workaround 3: Environment Variable Port
```bash
# Use different port for each run
NANOVLLM_DIST_PORT=2334 python test.py
NANOVLLM_DIST_PORT=2335 python test.py
```
## Related Files
| File | Relevant Code |
|------|---------------|
| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
## Testing the Fix
After implementing the fix, verify with:
```python
# test_multiple_llm.py
from nanovllm import LLM, SamplingParams
for i in range(3):
print(f"Creating LLM instance {i+1}")
llm = LLM("path/to/model", enable_cpu_offload=True)
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
print(f"Instance {i+1} output: {outputs[0]['text']}")
del llm
print(f"Instance {i+1} deleted\n")
print("All instances created and deleted successfully!")
```
Expected: No port conflict errors, all 3 instances work.
## Priority
**High** - This blocks grouped testing and any multi-LLM-instance workflows.

View File

@@ -14,6 +14,9 @@ Usage:
# Test with custom model
python tests/test_ruler_niah.py --model /path/to/model --enable-offload
# Group mode: test in batches with separate LLM initialization per group
python tests/test_ruler_niah.py --enable-offload --group-size 5
"""
import os
@@ -216,6 +219,143 @@ def run_ruler_niah_test(
return correct, total
# ============================================================
# Grouped Test Function
# ============================================================
def run_grouped_test(
model_path: str,
data_file: Path,
group_size: int = 5,
total_samples: Optional[int] = None,
max_model_len: int = DEFAULT_MAX_MODEL_LEN,
max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,
enable_cpu_offload: bool = False,
num_gpu_blocks: int = 4,
block_size: int = 1024,
gpu_utilization: float = 0.9,
enforce_eager: bool = True,
) -> Tuple[int, int, List[dict]]:
"""
Run RULER NIAH test in groups, with separate LLM initialization per group.
This mode is useful for:
- Avoiding state accumulation issues
- Testing LLM initialization stability
- Running large-scale tests with memory cleanup between groups
Args:
model_path: Path to the model
data_file: Path to JSONL data file
group_size: Number of samples per group
total_samples: Total samples to test (None = all in file)
Other args: Same as run_ruler_niah_test
Returns:
(total_correct, total_tested, group_results): Results summary
"""
import time
import gc
import torch
# Count total samples in file
file_sample_count = count_samples(data_file)
if total_samples is None:
total_samples = file_sample_count
else:
total_samples = min(total_samples, file_sample_count)
num_groups = (total_samples + group_size - 1) // group_size
print(f"\n{'='*60}")
print(f"RULER NIAH Grouped Test")
print(f"{'='*60}")
print(f"Model: {model_path}")
print(f"Data file: {data_file}")
print(f"Total samples: {total_samples}")
print(f"Group size: {group_size}")
print(f"Number of groups: {num_groups}")
print(f"CPU offload: {enable_cpu_offload}")
print(f"{'='*60}\n")
total_correct = 0
total_tested = 0
group_results = []
all_failed = []
test_start_time = time.time()
for group_idx in range(num_groups):
start_idx = group_idx * group_size
end_idx = min(start_idx + group_size, total_samples)
sample_indices = list(range(start_idx, end_idx))
print(f"\n{'='*60}")
print(f"Group {group_idx + 1}/{num_groups}: Samples {start_idx}-{end_idx - 1}")
print(f"{'='*60}")
group_start_time = time.time()
# Run test for this group
correct, tested = run_ruler_niah_test(
model_path=model_path,
data_file=data_file,
sample_indices=sample_indices,
max_model_len=max_model_len,
max_new_tokens=max_new_tokens,
enable_cpu_offload=enable_cpu_offload,
num_gpu_blocks=num_gpu_blocks,
block_size=block_size,
gpu_utilization=gpu_utilization,
enforce_eager=enforce_eager,
verbose=True,
)
group_time = time.time() - group_start_time
total_correct += correct
total_tested += tested
group_result = {
"group": group_idx + 1,
"samples": f"{start_idx}-{end_idx - 1}",
"correct": correct,
"total": tested,
"accuracy": 100 * correct / tested if tested > 0 else 0,
"time": group_time,
}
group_results.append(group_result)
print(f"\nGroup {group_idx + 1} Summary: {correct}/{tested} PASSED ({group_result['accuracy']:.1f}%) in {group_time:.1f}s")
# Force cleanup between groups
gc.collect()
torch.cuda.empty_cache()
# Small delay to ensure port is released
if group_idx < num_groups - 1:
time.sleep(3)
total_time = time.time() - test_start_time
# Final summary
print(f"\n{'='*60}")
print(f"FINAL SUMMARY")
print(f"{'='*60}")
print(f"\nGroup Results:")
print(f"{'Group':<8} {'Samples':<12} {'Result':<12} {'Accuracy':<10} {'Time':<10}")
print(f"{'-'*52}")
for r in group_results:
print(f"{r['group']:<8} {r['samples']:<12} {r['correct']}/{r['total']:<9} {r['accuracy']:.1f}%{'':<5} {r['time']:.1f}s")
print(f"{'-'*52}")
overall_accuracy = 100 * total_correct / total_tested if total_tested > 0 else 0
print(f"{'TOTAL':<8} {'0-' + str(total_tested-1):<12} {total_correct}/{total_tested:<9} {overall_accuracy:.1f}%{'':<5} {total_time:.1f}s")
print(f"{'='*60}\n")
return total_correct, total_tested, group_results
# ============================================================
# CLI Entry Point
# ============================================================
@@ -326,6 +466,18 @@ Examples:
action="store_true",
help="Quiet mode, only print final result"
)
parser.add_argument(
"--group-size",
type=int,
default=0,
help="Enable grouped testing mode with specified group size. Each group initializes LLM separately. (default: 0 = disabled)"
)
parser.add_argument(
"--total-samples",
type=int,
default=0,
help="Total number of samples to test in group mode (default: 0 = all samples in file)"
)
args = parser.parse_args()
@@ -334,20 +486,38 @@ Examples:
enforce_eager = not args.use_cuda_graph
verbose = not args.quiet
# Run test
correct, total = run_ruler_niah_test(
model_path=os.path.expanduser(args.model),
data_file=Path(args.data_file),
sample_indices=sample_indices,
max_model_len=args.max_model_len,
max_new_tokens=args.max_new_tokens,
enable_cpu_offload=args.enable_offload,
num_gpu_blocks=args.num_gpu_blocks,
block_size=args.block_size,
gpu_utilization=args.gpu_utilization,
enforce_eager=enforce_eager,
verbose=verbose,
)
# Check if group mode is enabled
if args.group_size > 0:
# Grouped testing mode
total_samples = args.total_samples if args.total_samples > 0 else None
correct, total, _ = run_grouped_test(
model_path=os.path.expanduser(args.model),
data_file=Path(args.data_file),
group_size=args.group_size,
total_samples=total_samples,
max_model_len=args.max_model_len,
max_new_tokens=args.max_new_tokens,
enable_cpu_offload=args.enable_offload,
num_gpu_blocks=args.num_gpu_blocks,
block_size=args.block_size,
gpu_utilization=args.gpu_utilization,
enforce_eager=enforce_eager,
)
else:
# Standard testing mode
correct, total = run_ruler_niah_test(
model_path=os.path.expanduser(args.model),
data_file=Path(args.data_file),
sample_indices=sample_indices,
max_model_len=args.max_model_len,
max_new_tokens=args.max_new_tokens,
enable_cpu_offload=args.enable_offload,
num_gpu_blocks=args.num_gpu_blocks,
block_size=args.block_size,
gpu_utilization=args.gpu_utilization,
enforce_eager=enforce_eager,
verbose=verbose,
)
# Final status
if correct == total: