[docs] Added dist port issue.
This commit is contained in:
435
CLAUDE.md
435
CLAUDE.md
@@ -1,389 +1,108 @@
|
||||
# Claude Code Configuration - SPARC Development Environment
|
||||
# CLAUDE.md
|
||||
|
||||
## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT
|
||||
This file provides guidance to Claude Code when working with this repository.
|
||||
|
||||
**ABSOLUTE RULES**:
|
||||
1. ALL operations MUST be concurrent/parallel in a single message
|
||||
2. **NEVER save working files, text/mds and tests to the root folder**
|
||||
3. ALWAYS organize files in appropriate subdirectories
|
||||
4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP
|
||||
## Overview
|
||||
|
||||
### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"
|
||||
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
|
||||
|
||||
**MANDATORY PATTERNS:**
|
||||
- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
|
||||
- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
|
||||
- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
|
||||
- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
|
||||
- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message
|
||||
## GPU Mutex for Multi-Instance Debugging
|
||||
|
||||
### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution
|
||||
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
|
||||
|
||||
**Claude Code's Task tool is the PRIMARY way to spawn agents:**
|
||||
```javascript
|
||||
// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
|
||||
[Single Message]:
|
||||
Task("Research agent", "Analyze requirements and patterns...", "researcher")
|
||||
Task("Coder agent", "Implement core features...", "coder")
|
||||
Task("Tester agent", "Create comprehensive tests...", "tester")
|
||||
Task("Reviewer agent", "Review code quality...", "reviewer")
|
||||
Task("Architect agent", "Design system architecture...", "system-architect")
|
||||
```
|
||||
### Benchmarks (`bench*.py`) - Exclusive GPU Access Required
|
||||
|
||||
**MCP tools are ONLY for coordination setup:**
|
||||
- `mcp__claude-flow__swarm_init` - Initialize coordination topology
|
||||
- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
|
||||
- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows
|
||||
|
||||
### 📁 File Organization Rules
|
||||
|
||||
**NEVER save to root folder. Use these directories:**
|
||||
- `/src` - Source code files
|
||||
- `/tests` - Test files
|
||||
- `/docs` - Documentation and markdown files
|
||||
- `/config` - Configuration files
|
||||
- `/scripts` - Utility scripts
|
||||
- `/examples` - Example code
|
||||
|
||||
## Project Overview
|
||||
|
||||
This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
|
||||
|
||||
## SPARC Commands
|
||||
|
||||
### Core Commands
|
||||
- `npx claude-flow sparc modes` - List available modes
|
||||
- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
|
||||
- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
|
||||
- `npx claude-flow sparc info <mode>` - Get mode details
|
||||
|
||||
### Batchtools Commands
|
||||
- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
|
||||
- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
|
||||
- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
|
||||
|
||||
### Build Commands
|
||||
- `npm run build` - Build project
|
||||
- `npm run test` - Run tests
|
||||
- `npm run lint` - Linting
|
||||
- `npm run typecheck` - Type checking
|
||||
|
||||
## SPARC Workflow Phases
|
||||
|
||||
1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
|
||||
2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
|
||||
3. **Architecture** - System design (`sparc run architect`)
|
||||
4. **Refinement** - TDD implementation (`sparc tdd`)
|
||||
5. **Completion** - Integration (`sparc run integration`)
|
||||
|
||||
## Code Style & Best Practices
|
||||
|
||||
- **Modular Design**: Files under 500 lines
|
||||
- **Environment Safety**: Never hardcode secrets
|
||||
- **Test-First**: Write tests before implementation
|
||||
- **Clean Architecture**: Separate concerns
|
||||
- **Documentation**: Keep updated
|
||||
|
||||
## 🚀 Available Agents (54 Total)
|
||||
|
||||
### Core Development
|
||||
`coder`, `reviewer`, `tester`, `planner`, `researcher`
|
||||
|
||||
### Swarm Coordination
|
||||
`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
|
||||
|
||||
### Consensus & Distributed
|
||||
`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
|
||||
|
||||
### Performance & Optimization
|
||||
`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
|
||||
|
||||
### GitHub & Repository
|
||||
`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
|
||||
|
||||
### SPARC Methodology
|
||||
`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
|
||||
|
||||
### Specialized Development
|
||||
`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
|
||||
|
||||
### Testing & Validation
|
||||
`tdd-london-swarm`, `production-validator`
|
||||
|
||||
### Migration & Planning
|
||||
`migration-planner`, `swarm-init`
|
||||
|
||||
## 🎯 Claude Code vs MCP Tools
|
||||
|
||||
### Claude Code Handles ALL EXECUTION:
|
||||
- **Task tool**: Spawn and run agents concurrently for actual work
|
||||
- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
|
||||
- Code generation and programming
|
||||
- Bash commands and system operations
|
||||
- Implementation work
|
||||
- Project navigation and analysis
|
||||
- TodoWrite and task management
|
||||
- Git operations
|
||||
- Package management
|
||||
- Testing and debugging
|
||||
|
||||
### MCP Tools ONLY COORDINATE:
|
||||
- Swarm initialization (topology setup)
|
||||
- Agent type definitions (coordination patterns)
|
||||
- Task orchestration (high-level planning)
|
||||
- Memory management
|
||||
- Neural features
|
||||
- Performance tracking
|
||||
- GitHub integration
|
||||
|
||||
**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
|
||||
|
||||
## 🚀 Quick Setup
|
||||
Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:
|
||||
|
||||
```bash
|
||||
# Add MCP servers (Claude Flow required, others optional)
|
||||
claude mcp add claude-flow npx claude-flow@alpha mcp start
|
||||
claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination
|
||||
claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features
|
||||
# Check and wait for GPU to be free
|
||||
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
|
||||
echo "GPU busy, waiting 10s..."
|
||||
sleep 10
|
||||
done
|
||||
```
|
||||
|
||||
## MCP Tool Categories
|
||||
### Other Scripts (tests, examples) - Port Conflict Check Only
|
||||
|
||||
### Coordination
|
||||
`swarm_init`, `agent_spawn`, `task_orchestrate`
|
||||
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
|
||||
|
||||
### Monitoring
|
||||
`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`
|
||||
|
||||
### Memory & Neural
|
||||
`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`
|
||||
|
||||
### GitHub Integration
|
||||
`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
|
||||
|
||||
### System
|
||||
`benchmark_run`, `features_detect`, `swarm_monitor`
|
||||
|
||||
### Flow-Nexus MCP Tools (Optional Advanced Features)
|
||||
Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
|
||||
|
||||
**Key MCP Tool Categories:**
|
||||
- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
|
||||
- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
|
||||
- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
|
||||
- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
|
||||
- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
|
||||
- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
|
||||
- **Storage**: `storage_upload`, `storage_list` (cloud file management)
|
||||
|
||||
**Authentication Required:**
|
||||
- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
|
||||
- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
|
||||
- Access 70+ specialized MCP tools for advanced orchestration
|
||||
|
||||
## 🚀 Agent Execution Flow with Claude Code
|
||||
|
||||
### The Correct Pattern:
|
||||
|
||||
1. **Optional**: Use MCP tools to set up coordination topology
|
||||
2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
|
||||
3. **REQUIRED**: Each agent runs hooks for coordination
|
||||
4. **REQUIRED**: Batch all operations in single messages
|
||||
|
||||
### Example Full-Stack Development:
|
||||
|
||||
```javascript
|
||||
// Single message with all agent spawning via Claude Code's Task tool
|
||||
[Parallel Agent Execution]:
|
||||
Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
|
||||
Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
|
||||
Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
|
||||
Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
|
||||
Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
|
||||
Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
|
||||
|
||||
// All todos batched together
|
||||
TodoWrite { todos: [...8-10 todos...] }
|
||||
|
||||
// All file operations together
|
||||
Write "backend/server.js"
|
||||
Write "frontend/App.jsx"
|
||||
Write "database/schema.sql"
|
||||
```
|
||||
|
||||
## 📋 Agent Coordination Protocol
|
||||
|
||||
### Every Agent Spawned via Task Tool MUST:
|
||||
|
||||
**1️⃣ BEFORE Work:**
|
||||
```bash
|
||||
npx claude-flow@alpha hooks pre-task --description "[task]"
|
||||
npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
|
||||
# Check if port 2333 (nanovllm default) is in use
|
||||
if lsof -i :2333 >/dev/null 2>&1; then
|
||||
echo "Port 2333 in use, waiting 10s..."
|
||||
sleep 10
|
||||
fi
|
||||
```
|
||||
|
||||
**2️⃣ DURING Work:**
|
||||
**Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process.
|
||||
|
||||
## Multi-Instance Development with PYTHONPATH
|
||||
|
||||
**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
|
||||
|
||||
**Use PYTHONPATH directly** - no pip install needed:
|
||||
|
||||
```bash
|
||||
npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
|
||||
npx claude-flow@alpha hooks notify --message "[what was done]"
|
||||
# Set PYTHONPATH to point to the project root directory
|
||||
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
|
||||
|
||||
# Example: running tests
|
||||
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
```
|
||||
|
||||
**3️⃣ AFTER Work:**
|
||||
```bash
|
||||
npx claude-flow@alpha hooks post-task --task-id "[task]"
|
||||
npx claude-flow@alpha hooks session-end --export-metrics true
|
||||
```
|
||||
**Benefits**:
|
||||
- No `pip install` required
|
||||
- Code changes take effect immediately (no reinstall needed)
|
||||
- Each worktree is completely isolated
|
||||
|
||||
## 🎯 Concurrent Execution Examples
|
||||
## Documentation Index
|
||||
|
||||
### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes
|
||||
| Document | Purpose |
|
||||
|----------|---------|
|
||||
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
|
||||
| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
|
||||
| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
|
||||
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
|
||||
| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
|
||||
| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
|
||||
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
||||
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
||||
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
|
||||
| [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions |
|
||||
| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
|
||||
|
||||
```javascript
|
||||
// Step 1: MCP tools set up coordination (optional, for complex tasks)
|
||||
[Single Message - Coordination Setup]:
|
||||
mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
|
||||
mcp__claude-flow__agent_spawn { type: "researcher" }
|
||||
mcp__claude-flow__agent_spawn { type: "coder" }
|
||||
mcp__claude-flow__agent_spawn { type: "tester" }
|
||||
## Configuration
|
||||
|
||||
// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
|
||||
[Single Message - Parallel Agent Execution]:
|
||||
// Claude Code's Task tool spawns real agents concurrently
|
||||
Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
|
||||
Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
|
||||
Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
|
||||
Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
|
||||
Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
|
||||
|
||||
// Batch ALL todos in ONE call
|
||||
TodoWrite { todos: [
|
||||
{id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
|
||||
{id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
|
||||
{id: "3", content: "Implement authentication", status: "pending", priority: "high"},
|
||||
{id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
|
||||
{id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
|
||||
{id: "6", content: "Integration tests", status: "pending", priority: "medium"},
|
||||
{id: "7", content: "API documentation", status: "pending", priority: "low"},
|
||||
{id: "8", content: "Performance optimization", status: "pending", priority: "low"}
|
||||
]}
|
||||
|
||||
// Parallel file operations
|
||||
Bash "mkdir -p app/{src,tests,docs,config}"
|
||||
Write "app/package.json"
|
||||
Write "app/src/server.js"
|
||||
Write "app/tests/server.test.js"
|
||||
Write "app/docs/API.md"
|
||||
```
|
||||
| Parameter | Default | Notes |
|
||||
|-----------|---------|-------|
|
||||
| `kvcache_block_size` | 4096 | Tokens per block |
|
||||
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
|
||||
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
|
||||
| `enable_cpu_offload` | False | Enable for long context |
|
||||
| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
|
||||
| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
|
||||
| `enforce_eager` | False | Set True to disable CUDA graphs |
|
||||
|
||||
### ❌ WRONG (Multiple Messages):
|
||||
```javascript
|
||||
Message 1: mcp__claude-flow__swarm_init
|
||||
Message 2: Task("agent 1")
|
||||
Message 3: TodoWrite { todos: [single todo] }
|
||||
Message 4: Write "file.js"
|
||||
// This breaks parallel coordination!
|
||||
```
|
||||
## Benchmarking
|
||||
|
||||
## Performance Benefits
|
||||
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
|
||||
|
||||
- **84.8% SWE-Bench solve rate**
|
||||
- **32.3% token reduction**
|
||||
- **2.8-4.4x speed improvement**
|
||||
- **27+ neural models**
|
||||
**Common Issues**:
|
||||
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
|
||||
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
|
||||
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
|
||||
|
||||
## Hooks Integration
|
||||
**Model Limits**:
|
||||
- Qwen3-0.6B/4B: 40960 tokens
|
||||
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
|
||||
- Llama-3.1-8B-Instruct: 131072 tokens
|
||||
|
||||
### Pre-Operation
|
||||
- Auto-assign agents by file type
|
||||
- Validate commands for safety
|
||||
- Prepare resources automatically
|
||||
- Optimize topology by complexity
|
||||
- Cache searches
|
||||
|
||||
### Post-Operation
|
||||
- Auto-format code
|
||||
- Train neural patterns
|
||||
- Update memory
|
||||
- Analyze performance
|
||||
- Track token usage
|
||||
|
||||
### Session Management
|
||||
- Generate summaries
|
||||
- Persist state
|
||||
- Track metrics
|
||||
- Restore context
|
||||
- Export workflows
|
||||
|
||||
## Advanced Features (v2.0.0)
|
||||
|
||||
- 🚀 Automatic Topology Selection
|
||||
- ⚡ Parallel Execution (2.8-4.4x speed)
|
||||
- 🧠 Neural Training
|
||||
- 📊 Bottleneck Analysis
|
||||
- 🤖 Smart Auto-Spawning
|
||||
- 🛡️ Self-Healing Workflows
|
||||
- 💾 Cross-Session Memory
|
||||
- 🔗 GitHub Integration
|
||||
|
||||
## Integration Tips
|
||||
|
||||
1. Start with basic swarm init
|
||||
2. Scale agents gradually
|
||||
3. Use memory for context
|
||||
4. Monitor progress regularly
|
||||
5. Train patterns from success
|
||||
6. Enable hooks automation
|
||||
7. Use GitHub tools first
|
||||
|
||||
## Support
|
||||
|
||||
- Documentation: https://github.com/ruvnet/claude-flow
|
||||
- Issues: https://github.com/ruvnet/claude-flow/issues
|
||||
- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)
|
||||
**Performance (Qwen3-4B, CPU Offload)**:
|
||||
- Prefill: ~5700-8000 tok/s (varies by context length)
|
||||
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
|
||||
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
|
||||
- **CUDA Graph speedup: 4x decode throughput**
|
||||
|
||||
---
|
||||
|
||||
Remember: **Claude Flow coordinates, Claude Code creates!**
|
||||
|
||||
# Nano-vLLM Testing
|
||||
|
||||
## RULER NIAH Benchmark Test
|
||||
|
||||
Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
|
||||
|
||||
**Documentation**:
|
||||
- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
|
||||
- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Single sample test (recommended for initial verification)
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload
|
||||
|
||||
# All 5 samples
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--sample-indices 0-4
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
|
||||
| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
|
||||
| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
|
||||
| `--max-model-len` | 32768 | Maximum context length |
|
||||
| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
|
||||
|
||||
---
|
||||
|
||||
# important-instruction-reminders
|
||||
Do what has been asked; nothing more, nothing less.
|
||||
NEVER create files unless they're absolutely necessary for achieving your goal.
|
||||
ALWAYS prefer editing an existing file to creating a new one.
|
||||
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
|
||||
Never save working files, text/mds and tests to the root folder.
|
||||
**Author**: Zijie Tian
|
||||
|
||||
308
docs/torch_distributed_port_issue.md
Normal file
308
docs/torch_distributed_port_issue.md
Normal file
@@ -0,0 +1,308 @@
|
||||
# Torch Distributed Port Conflict Issue
|
||||
|
||||
## Problem Summary
|
||||
|
||||
When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
|
||||
|
||||
```
|
||||
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
|
||||
port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### 1. Distributed Process Group Initialization
|
||||
|
||||
In `nanovllm/engine/model_runner.py:30-32`:
|
||||
|
||||
```python
|
||||
import os
|
||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
|
||||
```
|
||||
|
||||
- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
|
||||
- `init_process_group()` binds a TCP socket to this port
|
||||
- This binding persists until `destroy_process_group()` is called
|
||||
|
||||
### 2. Cleanup Mechanism
|
||||
|
||||
In `nanovllm/engine/llm_engine.py:37`:
|
||||
|
||||
```python
|
||||
atexit.register(self.exit)
|
||||
```
|
||||
|
||||
In `nanovllm/engine/llm_engine.py:39-43`:
|
||||
|
||||
```python
|
||||
def exit(self):
|
||||
self.model_runner.call("exit")
|
||||
del self.model_runner
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
```
|
||||
|
||||
In `nanovllm/engine/model_runner.py:66-78`:
|
||||
|
||||
```python
|
||||
def exit(self):
|
||||
# ... cleanup code ...
|
||||
dist.destroy_process_group()
|
||||
```
|
||||
|
||||
### 3. The Problem
|
||||
|
||||
**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
|
||||
|
||||
Timeline of the bug:
|
||||
|
||||
```
|
||||
1. Create LLM instance #1
|
||||
├── init_process_group() binds port 2333 ✓
|
||||
└── atexit.register(self.exit) registered
|
||||
|
||||
2. LLM #1 goes out of scope (garbage collected)
|
||||
├── Python's GC deletes the object
|
||||
├── BUT atexit handler NOT triggered yet
|
||||
└── Port 2333 still bound! ❌
|
||||
|
||||
3. Create LLM instance #2
|
||||
├── init_process_group() tries to bind port 2333
|
||||
└── EADDRINUSE error! ❌
|
||||
|
||||
4. Program exits (only now atexit runs)
|
||||
└── Too late - already crashed
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
This issue affects:
|
||||
|
||||
1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
|
||||
- Each group needs a fresh LLM instance
|
||||
- Second group fails with port conflict
|
||||
|
||||
2. **Multiple LLM instances in same process**
|
||||
- Any code that creates LLM, deletes it, then creates another
|
||||
|
||||
3. **Interactive/notebook usage**
|
||||
- Re-running cells that create LLM instances
|
||||
|
||||
## Proposed Solutions
|
||||
|
||||
### Solution A: Add `__del__` Method (Quick Fix)
|
||||
|
||||
Add destructor to `LLMEngine` that calls cleanup:
|
||||
|
||||
```python
|
||||
# In nanovllm/engine/llm_engine.py
|
||||
|
||||
def __del__(self):
|
||||
try:
|
||||
self.exit()
|
||||
except Exception:
|
||||
pass # Ignore errors during cleanup
|
||||
```
|
||||
|
||||
**Pros**: Simple, backwards compatible
|
||||
**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
|
||||
|
||||
### Solution B: Context Manager Pattern (Recommended)
|
||||
|
||||
Make `LLMEngine` a context manager:
|
||||
|
||||
```python
|
||||
# In nanovllm/engine/llm_engine.py
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||
self.exit()
|
||||
return False
|
||||
```
|
||||
|
||||
Usage:
|
||||
```python
|
||||
with LLM(model_path) as llm:
|
||||
outputs = llm.generate(prompts, params)
|
||||
# Cleanup happens automatically here
|
||||
```
|
||||
|
||||
**Pros**: Explicit, guaranteed cleanup, Pythonic
|
||||
**Cons**: Requires usage pattern change
|
||||
|
||||
### Solution C: Check and Cleanup Before Init (Defensive)
|
||||
|
||||
In `ModelRunner.__init__`, check if process group exists:
|
||||
|
||||
```python
|
||||
# In nanovllm/engine/model_runner.py
|
||||
|
||||
if dist.is_initialized():
|
||||
dist.destroy_process_group()
|
||||
dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
|
||||
```
|
||||
|
||||
**Pros**: Self-healing, no usage pattern change
|
||||
**Cons**: May mask other issues, global state manipulation
|
||||
|
||||
### Solution D: Subprocess Isolation (For Testing)
|
||||
|
||||
For grouped testing specifically, run each group in a subprocess:
|
||||
|
||||
```python
|
||||
import subprocess
|
||||
for group in groups:
|
||||
subprocess.run([sys.executable, "test_ruler_niah.py",
|
||||
"--sample-indices", f"{start}-{end}"])
|
||||
```
|
||||
|
||||
**Pros**: Complete isolation, no code changes to nanovllm
|
||||
**Cons**: More overhead, only solves testing use case
|
||||
|
||||
### Solution E: Dynamic Port Allocation
|
||||
|
||||
Instead of fixed port 2333, use dynamic port:
|
||||
|
||||
```python
|
||||
import socket
|
||||
|
||||
def find_free_port():
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||
s.bind(('', 0))
|
||||
return s.getsockname()[1]
|
||||
|
||||
port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
|
||||
```
|
||||
|
||||
**Pros**: Avoids conflicts entirely
|
||||
**Cons**: More complex, may have side effects
|
||||
|
||||
## Recommended Implementation
|
||||
|
||||
**Combine Solutions A + B + C** for maximum robustness:
|
||||
|
||||
1. Add `__del__` for best-effort cleanup
|
||||
2. Add context manager for explicit cleanup
|
||||
3. Add `is_initialized()` check as defensive measure
|
||||
|
||||
```python
|
||||
# nanovllm/engine/llm_engine.py
|
||||
|
||||
class LLMEngine:
|
||||
def __init__(self, model, **kwargs):
|
||||
# ... existing code ...
|
||||
atexit.register(self.exit)
|
||||
self._exited = False
|
||||
|
||||
def exit(self):
|
||||
if self._exited:
|
||||
return
|
||||
self._exited = True
|
||||
self.model_runner.call("exit")
|
||||
del self.model_runner
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
|
||||
def __del__(self):
|
||||
try:
|
||||
self.exit()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, *args):
|
||||
self.exit()
|
||||
return False
|
||||
|
||||
|
||||
# nanovllm/engine/model_runner.py
|
||||
|
||||
class ModelRunner:
|
||||
def __init__(self, config: Config, rank: int, event):
|
||||
# ... existing code before init_process_group ...
|
||||
|
||||
import os
|
||||
port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
|
||||
|
||||
# Defensive cleanup
|
||||
if dist.is_initialized():
|
||||
dist.destroy_process_group()
|
||||
|
||||
dist.init_process_group("nccl", f"tcp://localhost:{port}",
|
||||
world_size=self.world_size, rank=rank)
|
||||
# ... rest of init ...
|
||||
```
|
||||
|
||||
## Workaround for Current Code
|
||||
|
||||
Until the fix is implemented, use one of these workarounds:
|
||||
|
||||
### Workaround 1: Manual Cleanup
|
||||
|
||||
```python
|
||||
import torch.distributed as dist
|
||||
|
||||
llm = LLM(model_path)
|
||||
outputs = llm.generate(...)
|
||||
llm.model_runner.call("exit") # Manual cleanup
|
||||
del llm
|
||||
|
||||
# Now can create new LLM
|
||||
llm2 = LLM(model_path)
|
||||
```
|
||||
|
||||
### Workaround 2: Subprocess Testing
|
||||
|
||||
```bash
|
||||
# Run each test group as separate process
|
||||
for i in $(seq 0 5 95); do
|
||||
python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
|
||||
done
|
||||
```
|
||||
|
||||
### Workaround 3: Environment Variable Port
|
||||
|
||||
```bash
|
||||
# Use different port for each run
|
||||
NANOVLLM_DIST_PORT=2334 python test.py
|
||||
NANOVLLM_DIST_PORT=2335 python test.py
|
||||
```
|
||||
|
||||
## Related Files
|
||||
|
||||
| File | Relevant Code |
|
||||
|------|---------------|
|
||||
| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
|
||||
| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
|
||||
| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
|
||||
| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
|
||||
|
||||
## Testing the Fix
|
||||
|
||||
After implementing the fix, verify with:
|
||||
|
||||
```python
|
||||
# test_multiple_llm.py
|
||||
from nanovllm import LLM, SamplingParams
|
||||
|
||||
for i in range(3):
|
||||
print(f"Creating LLM instance {i+1}")
|
||||
llm = LLM("path/to/model", enable_cpu_offload=True)
|
||||
outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
|
||||
print(f"Instance {i+1} output: {outputs[0]['text']}")
|
||||
del llm
|
||||
print(f"Instance {i+1} deleted\n")
|
||||
|
||||
print("All instances created and deleted successfully!")
|
||||
```
|
||||
|
||||
Expected: No port conflict errors, all 3 instances work.
|
||||
|
||||
## Priority
|
||||
|
||||
**High** - This blocks grouped testing and any multi-LLM-instance workflows.
|
||||
@@ -14,6 +14,9 @@ Usage:
|
||||
|
||||
# Test with custom model
|
||||
python tests/test_ruler_niah.py --model /path/to/model --enable-offload
|
||||
|
||||
# Group mode: test in batches with separate LLM initialization per group
|
||||
python tests/test_ruler_niah.py --enable-offload --group-size 5
|
||||
"""
|
||||
|
||||
import os
|
||||
@@ -216,6 +219,143 @@ def run_ruler_niah_test(
|
||||
return correct, total
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Grouped Test Function
|
||||
# ============================================================
|
||||
|
||||
def run_grouped_test(
|
||||
model_path: str,
|
||||
data_file: Path,
|
||||
group_size: int = 5,
|
||||
total_samples: Optional[int] = None,
|
||||
max_model_len: int = DEFAULT_MAX_MODEL_LEN,
|
||||
max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,
|
||||
enable_cpu_offload: bool = False,
|
||||
num_gpu_blocks: int = 4,
|
||||
block_size: int = 1024,
|
||||
gpu_utilization: float = 0.9,
|
||||
enforce_eager: bool = True,
|
||||
) -> Tuple[int, int, List[dict]]:
|
||||
"""
|
||||
Run RULER NIAH test in groups, with separate LLM initialization per group.
|
||||
|
||||
This mode is useful for:
|
||||
- Avoiding state accumulation issues
|
||||
- Testing LLM initialization stability
|
||||
- Running large-scale tests with memory cleanup between groups
|
||||
|
||||
Args:
|
||||
model_path: Path to the model
|
||||
data_file: Path to JSONL data file
|
||||
group_size: Number of samples per group
|
||||
total_samples: Total samples to test (None = all in file)
|
||||
Other args: Same as run_ruler_niah_test
|
||||
|
||||
Returns:
|
||||
(total_correct, total_tested, group_results): Results summary
|
||||
"""
|
||||
import time
|
||||
import gc
|
||||
import torch
|
||||
|
||||
# Count total samples in file
|
||||
file_sample_count = count_samples(data_file)
|
||||
if total_samples is None:
|
||||
total_samples = file_sample_count
|
||||
else:
|
||||
total_samples = min(total_samples, file_sample_count)
|
||||
|
||||
num_groups = (total_samples + group_size - 1) // group_size
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"RULER NIAH Grouped Test")
|
||||
print(f"{'='*60}")
|
||||
print(f"Model: {model_path}")
|
||||
print(f"Data file: {data_file}")
|
||||
print(f"Total samples: {total_samples}")
|
||||
print(f"Group size: {group_size}")
|
||||
print(f"Number of groups: {num_groups}")
|
||||
print(f"CPU offload: {enable_cpu_offload}")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
total_correct = 0
|
||||
total_tested = 0
|
||||
group_results = []
|
||||
all_failed = []
|
||||
|
||||
test_start_time = time.time()
|
||||
|
||||
for group_idx in range(num_groups):
|
||||
start_idx = group_idx * group_size
|
||||
end_idx = min(start_idx + group_size, total_samples)
|
||||
sample_indices = list(range(start_idx, end_idx))
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Group {group_idx + 1}/{num_groups}: Samples {start_idx}-{end_idx - 1}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
group_start_time = time.time()
|
||||
|
||||
# Run test for this group
|
||||
correct, tested = run_ruler_niah_test(
|
||||
model_path=model_path,
|
||||
data_file=data_file,
|
||||
sample_indices=sample_indices,
|
||||
max_model_len=max_model_len,
|
||||
max_new_tokens=max_new_tokens,
|
||||
enable_cpu_offload=enable_cpu_offload,
|
||||
num_gpu_blocks=num_gpu_blocks,
|
||||
block_size=block_size,
|
||||
gpu_utilization=gpu_utilization,
|
||||
enforce_eager=enforce_eager,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
group_time = time.time() - group_start_time
|
||||
|
||||
total_correct += correct
|
||||
total_tested += tested
|
||||
|
||||
group_result = {
|
||||
"group": group_idx + 1,
|
||||
"samples": f"{start_idx}-{end_idx - 1}",
|
||||
"correct": correct,
|
||||
"total": tested,
|
||||
"accuracy": 100 * correct / tested if tested > 0 else 0,
|
||||
"time": group_time,
|
||||
}
|
||||
group_results.append(group_result)
|
||||
|
||||
print(f"\nGroup {group_idx + 1} Summary: {correct}/{tested} PASSED ({group_result['accuracy']:.1f}%) in {group_time:.1f}s")
|
||||
|
||||
# Force cleanup between groups
|
||||
gc.collect()
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
# Small delay to ensure port is released
|
||||
if group_idx < num_groups - 1:
|
||||
time.sleep(3)
|
||||
|
||||
total_time = time.time() - test_start_time
|
||||
|
||||
# Final summary
|
||||
print(f"\n{'='*60}")
|
||||
print(f"FINAL SUMMARY")
|
||||
print(f"{'='*60}")
|
||||
print(f"\nGroup Results:")
|
||||
print(f"{'Group':<8} {'Samples':<12} {'Result':<12} {'Accuracy':<10} {'Time':<10}")
|
||||
print(f"{'-'*52}")
|
||||
for r in group_results:
|
||||
print(f"{r['group']:<8} {r['samples']:<12} {r['correct']}/{r['total']:<9} {r['accuracy']:.1f}%{'':<5} {r['time']:.1f}s")
|
||||
|
||||
print(f"{'-'*52}")
|
||||
overall_accuracy = 100 * total_correct / total_tested if total_tested > 0 else 0
|
||||
print(f"{'TOTAL':<8} {'0-' + str(total_tested-1):<12} {total_correct}/{total_tested:<9} {overall_accuracy:.1f}%{'':<5} {total_time:.1f}s")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
return total_correct, total_tested, group_results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# CLI Entry Point
|
||||
# ============================================================
|
||||
@@ -326,6 +466,18 @@ Examples:
|
||||
action="store_true",
|
||||
help="Quiet mode, only print final result"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--group-size",
|
||||
type=int,
|
||||
default=0,
|
||||
help="Enable grouped testing mode with specified group size. Each group initializes LLM separately. (default: 0 = disabled)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--total-samples",
|
||||
type=int,
|
||||
default=0,
|
||||
help="Total number of samples to test in group mode (default: 0 = all samples in file)"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
@@ -334,20 +486,38 @@ Examples:
|
||||
enforce_eager = not args.use_cuda_graph
|
||||
verbose = not args.quiet
|
||||
|
||||
# Run test
|
||||
correct, total = run_ruler_niah_test(
|
||||
model_path=os.path.expanduser(args.model),
|
||||
data_file=Path(args.data_file),
|
||||
sample_indices=sample_indices,
|
||||
max_model_len=args.max_model_len,
|
||||
max_new_tokens=args.max_new_tokens,
|
||||
enable_cpu_offload=args.enable_offload,
|
||||
num_gpu_blocks=args.num_gpu_blocks,
|
||||
block_size=args.block_size,
|
||||
gpu_utilization=args.gpu_utilization,
|
||||
enforce_eager=enforce_eager,
|
||||
verbose=verbose,
|
||||
)
|
||||
# Check if group mode is enabled
|
||||
if args.group_size > 0:
|
||||
# Grouped testing mode
|
||||
total_samples = args.total_samples if args.total_samples > 0 else None
|
||||
correct, total, _ = run_grouped_test(
|
||||
model_path=os.path.expanduser(args.model),
|
||||
data_file=Path(args.data_file),
|
||||
group_size=args.group_size,
|
||||
total_samples=total_samples,
|
||||
max_model_len=args.max_model_len,
|
||||
max_new_tokens=args.max_new_tokens,
|
||||
enable_cpu_offload=args.enable_offload,
|
||||
num_gpu_blocks=args.num_gpu_blocks,
|
||||
block_size=args.block_size,
|
||||
gpu_utilization=args.gpu_utilization,
|
||||
enforce_eager=enforce_eager,
|
||||
)
|
||||
else:
|
||||
# Standard testing mode
|
||||
correct, total = run_ruler_niah_test(
|
||||
model_path=os.path.expanduser(args.model),
|
||||
data_file=Path(args.data_file),
|
||||
sample_indices=sample_indices,
|
||||
max_model_len=args.max_model_len,
|
||||
max_new_tokens=args.max_new_tokens,
|
||||
enable_cpu_offload=args.enable_offload,
|
||||
num_gpu_blocks=args.num_gpu_blocks,
|
||||
block_size=args.block_size,
|
||||
gpu_utilization=args.gpu_utilization,
|
||||
enforce_eager=enforce_eager,
|
||||
verbose=verbose,
|
||||
)
|
||||
|
||||
# Final status
|
||||
if correct == total:
|
||||
|
||||
Reference in New Issue
Block a user