diff --git a/CLAUDE.md b/CLAUDE.md index de95981..7705687 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,389 +1,108 @@ -# Claude Code Configuration - SPARC Development Environment +# CLAUDE.md -## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT +This file provides guidance to Claude Code when working with this repository. -**ABSOLUTE RULES**: -1. ALL operations MUST be concurrent/parallel in a single message -2. **NEVER save working files, text/mds and tests to the root folder** -3. ALWAYS organize files in appropriate subdirectories -4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP +## Overview -### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS" +Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference. -**MANDATORY PATTERNS:** -- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum) -- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions -- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message -- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message -- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message +## GPU Mutex for Multi-Instance Debugging -### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution +**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type: -**Claude Code's Task tool is the PRIMARY way to spawn agents:** -```javascript -// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution -[Single Message]: - Task("Research agent", "Analyze requirements and patterns...", "researcher") - Task("Coder agent", "Implement core features...", "coder") - Task("Tester agent", "Create comprehensive tests...", "tester") - Task("Reviewer agent", "Review code quality...", "reviewer") - Task("Architect agent", "Design system architecture...", "system-architect") -``` +### Benchmarks (`bench*.py`) - Exclusive GPU Access Required -**MCP tools are ONLY for coordination setup:** -- `mcp__claude-flow__swarm_init` - Initialize coordination topology -- `mcp__claude-flow__agent_spawn` - Define agent types for coordination -- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows - -### 📁 File Organization Rules - -**NEVER save to root folder. Use these directories:** -- `/src` - Source code files -- `/tests` - Test files -- `/docs` - Documentation and markdown files -- `/config` - Configuration files -- `/scripts` - Utility scripts -- `/examples` - Example code - -## Project Overview - -This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development. - -## SPARC Commands - -### Core Commands -- `npx claude-flow sparc modes` - List available modes -- `npx claude-flow sparc run ""` - Execute specific mode -- `npx claude-flow sparc tdd ""` - Run complete TDD workflow -- `npx claude-flow sparc info ` - Get mode details - -### Batchtools Commands -- `npx claude-flow sparc batch ""` - Parallel execution -- `npx claude-flow sparc pipeline ""` - Full pipeline processing -- `npx claude-flow sparc concurrent ""` - Multi-task processing - -### Build Commands -- `npm run build` - Build project -- `npm run test` - Run tests -- `npm run lint` - Linting -- `npm run typecheck` - Type checking - -## SPARC Workflow Phases - -1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`) -2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`) -3. **Architecture** - System design (`sparc run architect`) -4. **Refinement** - TDD implementation (`sparc tdd`) -5. **Completion** - Integration (`sparc run integration`) - -## Code Style & Best Practices - -- **Modular Design**: Files under 500 lines -- **Environment Safety**: Never hardcode secrets -- **Test-First**: Write tests before implementation -- **Clean Architecture**: Separate concerns -- **Documentation**: Keep updated - -## 🚀 Available Agents (54 Total) - -### Core Development -`coder`, `reviewer`, `tester`, `planner`, `researcher` - -### Swarm Coordination -`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager` - -### Consensus & Distributed -`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager` - -### Performance & Optimization -`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent` - -### GitHub & Repository -`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm` - -### SPARC Methodology -`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement` - -### Specialized Development -`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator` - -### Testing & Validation -`tdd-london-swarm`, `production-validator` - -### Migration & Planning -`migration-planner`, `swarm-init` - -## 🎯 Claude Code vs MCP Tools - -### Claude Code Handles ALL EXECUTION: -- **Task tool**: Spawn and run agents concurrently for actual work -- File operations (Read, Write, Edit, MultiEdit, Glob, Grep) -- Code generation and programming -- Bash commands and system operations -- Implementation work -- Project navigation and analysis -- TodoWrite and task management -- Git operations -- Package management -- Testing and debugging - -### MCP Tools ONLY COORDINATE: -- Swarm initialization (topology setup) -- Agent type definitions (coordination patterns) -- Task orchestration (high-level planning) -- Memory management -- Neural features -- Performance tracking -- GitHub integration - -**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents. - -## 🚀 Quick Setup +Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access: ```bash -# Add MCP servers (Claude Flow required, others optional) -claude mcp add claude-flow npx claude-flow@alpha mcp start -claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination -claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features +# Check and wait for GPU to be free +while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do + echo "GPU busy, waiting 10s..." + sleep 10 +done ``` -## MCP Tool Categories +### Other Scripts (tests, examples) - Port Conflict Check Only -### Coordination -`swarm_init`, `agent_spawn`, `task_orchestrate` +For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running: -### Monitoring -`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results` - -### Memory & Neural -`memory_usage`, `neural_status`, `neural_train`, `neural_patterns` - -### GitHub Integration -`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review` - -### System -`benchmark_run`, `features_detect`, `swarm_monitor` - -### Flow-Nexus MCP Tools (Optional Advanced Features) -Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools: - -**Key MCP Tool Categories:** -- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate` -- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution) -- **Templates**: `template_list`, `template_deploy` (pre-built project templates) -- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant) -- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management) -- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring) -- **Storage**: `storage_upload`, `storage_list` (cloud file management) - -**Authentication Required:** -- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register` -- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login` -- Access 70+ specialized MCP tools for advanced orchestration - -## 🚀 Agent Execution Flow with Claude Code - -### The Correct Pattern: - -1. **Optional**: Use MCP tools to set up coordination topology -2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work -3. **REQUIRED**: Each agent runs hooks for coordination -4. **REQUIRED**: Batch all operations in single messages - -### Example Full-Stack Development: - -```javascript -// Single message with all agent spawning via Claude Code's Task tool -[Parallel Agent Execution]: - Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev") - Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder") - Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer") - Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester") - Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer") - Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer") - - // All todos batched together - TodoWrite { todos: [...8-10 todos...] } - - // All file operations together - Write "backend/server.js" - Write "frontend/App.jsx" - Write "database/schema.sql" -``` - -## 📋 Agent Coordination Protocol - -### Every Agent Spawned via Task Tool MUST: - -**1️⃣ BEFORE Work:** ```bash -npx claude-flow@alpha hooks pre-task --description "[task]" -npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]" +# Check if port 2333 (nanovllm default) is in use +if lsof -i :2333 >/dev/null 2>&1; then + echo "Port 2333 in use, waiting 10s..." + sleep 10 +fi ``` -**2️⃣ DURING Work:** +**Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process. + +## Multi-Instance Development with PYTHONPATH + +**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances. + +**Use PYTHONPATH directly** - no pip install needed: + ```bash -npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]" -npx claude-flow@alpha hooks notify --message "[what was done]" +# Set PYTHONPATH to point to the project root directory +PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python + +# Example: running tests +PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py ``` -**3️⃣ AFTER Work:** -```bash -npx claude-flow@alpha hooks post-task --task-id "[task]" -npx claude-flow@alpha hooks session-end --export-metrics true -``` +**Benefits**: +- No `pip install` required +- Code changes take effect immediately (no reinstall needed) +- Each worktree is completely isolated -## 🎯 Concurrent Execution Examples +## Documentation Index -### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes +| Document | Purpose | +|----------|---------| +| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details | +| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling | +| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup | +| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow | +| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface | +| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design | +| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) | +| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling | +| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals | +| [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions | +| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark | -```javascript -// Step 1: MCP tools set up coordination (optional, for complex tasks) -[Single Message - Coordination Setup]: - mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 } - mcp__claude-flow__agent_spawn { type: "researcher" } - mcp__claude-flow__agent_spawn { type: "coder" } - mcp__claude-flow__agent_spawn { type: "tester" } +## Configuration -// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work -[Single Message - Parallel Agent Execution]: - // Claude Code's Task tool spawns real agents concurrently - Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher") - Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder") - Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer") - Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester") - Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer") - - // Batch ALL todos in ONE call - TodoWrite { todos: [ - {id: "1", content: "Research API patterns", status: "in_progress", priority: "high"}, - {id: "2", content: "Design database schema", status: "in_progress", priority: "high"}, - {id: "3", content: "Implement authentication", status: "pending", priority: "high"}, - {id: "4", content: "Build REST endpoints", status: "pending", priority: "high"}, - {id: "5", content: "Write unit tests", status: "pending", priority: "medium"}, - {id: "6", content: "Integration tests", status: "pending", priority: "medium"}, - {id: "7", content: "API documentation", status: "pending", priority: "low"}, - {id: "8", content: "Performance optimization", status: "pending", priority: "low"} - ]} - - // Parallel file operations - Bash "mkdir -p app/{src,tests,docs,config}" - Write "app/package.json" - Write "app/src/server.js" - Write "app/tests/server.test.js" - Write "app/docs/API.md" -``` +| Parameter | Default | Notes | +|-----------|---------|-------| +| `kvcache_block_size` | 4096 | Tokens per block | +| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context | +| `gpu_memory_utilization` | 0.9 | GPU memory fraction | +| `enable_cpu_offload` | False | Enable for long context | +| `num_gpu_blocks` | 2 | GPU blocks for offload mode | +| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline | +| `enforce_eager` | False | Set True to disable CUDA graphs | -### ❌ WRONG (Multiple Messages): -```javascript -Message 1: mcp__claude-flow__swarm_init -Message 2: Task("agent 1") -Message 3: TodoWrite { todos: [single todo] } -Message 4: Write "file.js" -// This breaks parallel coordination! -``` +## Benchmarking -## Performance Benefits +**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison) -- **84.8% SWE-Bench solve rate** -- **32.3% token reduction** -- **2.8-4.4x speed improvement** -- **27+ neural models** +**Common Issues**: +1. `max_num_batched_tokens < max_model_len`: Set equal for long context +2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len` +3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json -## Hooks Integration +**Model Limits**: +- Qwen3-0.6B/4B: 40960 tokens +- Qwen2.5-7B-Instruct-1M: 1048576 tokens +- Llama-3.1-8B-Instruct: 131072 tokens -### Pre-Operation -- Auto-assign agents by file type -- Validate commands for safety -- Prepare resources automatically -- Optimize topology by complexity -- Cache searches - -### Post-Operation -- Auto-format code -- Train neural patterns -- Update memory -- Analyze performance -- Track token usage - -### Session Management -- Generate summaries -- Persist state -- Track metrics -- Restore context -- Export workflows - -## Advanced Features (v2.0.0) - -- 🚀 Automatic Topology Selection -- ⚡ Parallel Execution (2.8-4.4x speed) -- 🧠 Neural Training -- 📊 Bottleneck Analysis -- 🤖 Smart Auto-Spawning -- 🛡️ Self-Healing Workflows -- 💾 Cross-Session Memory -- 🔗 GitHub Integration - -## Integration Tips - -1. Start with basic swarm init -2. Scale agents gradually -3. Use memory for context -4. Monitor progress regularly -5. Train patterns from success -6. Enable hooks automation -7. Use GitHub tools first - -## Support - -- Documentation: https://github.com/ruvnet/claude-flow -- Issues: https://github.com/ruvnet/claude-flow/issues -- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features) +**Performance (Qwen3-4B, CPU Offload)**: +- Prefill: ~5700-8000 tok/s (varies by context length) +- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms) +- Decode Eager Mode: ~12 tok/s (TPOT ~80ms) +- **CUDA Graph speedup: 4x decode throughput** --- -Remember: **Claude Flow coordinates, Claude Code creates!** - -# Nano-vLLM Testing - -## RULER NIAH Benchmark Test - -Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens). - -**Documentation**: -- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage -- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%) - -### Quick Start - -```bash -# Single sample test (recommended for initial verification) -CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ - --model ~/models/Llama-3.1-8B-Instruct \ - --enable-offload - -# All 5 samples -CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ - --model ~/models/Llama-3.1-8B-Instruct \ - --enable-offload \ - --sample-indices 0-4 -``` - -### Options - -| Option | Default | Description | -|--------|---------|-------------| -| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path | -| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) | -| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) | -| `--max-model-len` | 32768 | Maximum context length | -| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) | - ---- - -# important-instruction-reminders -Do what has been asked; nothing more, nothing less. -NEVER create files unless they're absolutely necessary for achieving your goal. -ALWAYS prefer editing an existing file to creating a new one. -NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User. -Never save working files, text/mds and tests to the root folder. +**Author**: Zijie Tian diff --git a/docs/torch_distributed_port_issue.md b/docs/torch_distributed_port_issue.md new file mode 100644 index 0000000..889ac44 --- /dev/null +++ b/docs/torch_distributed_port_issue.md @@ -0,0 +1,308 @@ +# Torch Distributed Port Conflict Issue + +## Problem Summary + +When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with: + +``` +torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. +port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use +``` + +## Root Cause Analysis + +### 1. Distributed Process Group Initialization + +In `nanovllm/engine/model_runner.py:30-32`: + +```python +import os +port = os.environ.get("NANOVLLM_DIST_PORT", "2333") +dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank) +``` + +- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var) +- `init_process_group()` binds a TCP socket to this port +- This binding persists until `destroy_process_group()` is called + +### 2. Cleanup Mechanism + +In `nanovllm/engine/llm_engine.py:37`: + +```python +atexit.register(self.exit) +``` + +In `nanovllm/engine/llm_engine.py:39-43`: + +```python +def exit(self): + self.model_runner.call("exit") + del self.model_runner + for p in self.ps: + p.join() +``` + +In `nanovllm/engine/model_runner.py:66-78`: + +```python +def exit(self): + # ... cleanup code ... + dist.destroy_process_group() +``` + +### 3. The Problem + +**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.** + +Timeline of the bug: + +``` +1. Create LLM instance #1 + ├── init_process_group() binds port 2333 ✓ + └── atexit.register(self.exit) registered + +2. LLM #1 goes out of scope (garbage collected) + ├── Python's GC deletes the object + ├── BUT atexit handler NOT triggered yet + └── Port 2333 still bound! ❌ + +3. Create LLM instance #2 + ├── init_process_group() tries to bind port 2333 + └── EADDRINUSE error! ❌ + +4. Program exits (only now atexit runs) + └── Too late - already crashed +``` + +## Impact + +This issue affects: + +1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`) + - Each group needs a fresh LLM instance + - Second group fails with port conflict + +2. **Multiple LLM instances in same process** + - Any code that creates LLM, deletes it, then creates another + +3. **Interactive/notebook usage** + - Re-running cells that create LLM instances + +## Proposed Solutions + +### Solution A: Add `__del__` Method (Quick Fix) + +Add destructor to `LLMEngine` that calls cleanup: + +```python +# In nanovllm/engine/llm_engine.py + +def __del__(self): + try: + self.exit() + except Exception: + pass # Ignore errors during cleanup +``` + +**Pros**: Simple, backwards compatible +**Cons**: `__del__` is not guaranteed to be called (circular references, etc.) + +### Solution B: Context Manager Pattern (Recommended) + +Make `LLMEngine` a context manager: + +```python +# In nanovllm/engine/llm_engine.py + +def __enter__(self): + return self + +def __exit__(self, exc_type, exc_val, exc_tb): + self.exit() + return False +``` + +Usage: +```python +with LLM(model_path) as llm: + outputs = llm.generate(prompts, params) +# Cleanup happens automatically here +``` + +**Pros**: Explicit, guaranteed cleanup, Pythonic +**Cons**: Requires usage pattern change + +### Solution C: Check and Cleanup Before Init (Defensive) + +In `ModelRunner.__init__`, check if process group exists: + +```python +# In nanovllm/engine/model_runner.py + +if dist.is_initialized(): + dist.destroy_process_group() +dist.init_process_group("nccl", f"tcp://localhost:{port}", ...) +``` + +**Pros**: Self-healing, no usage pattern change +**Cons**: May mask other issues, global state manipulation + +### Solution D: Subprocess Isolation (For Testing) + +For grouped testing specifically, run each group in a subprocess: + +```python +import subprocess +for group in groups: + subprocess.run([sys.executable, "test_ruler_niah.py", + "--sample-indices", f"{start}-{end}"]) +``` + +**Pros**: Complete isolation, no code changes to nanovllm +**Cons**: More overhead, only solves testing use case + +### Solution E: Dynamic Port Allocation + +Instead of fixed port 2333, use dynamic port: + +```python +import socket + +def find_free_port(): + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.bind(('', 0)) + return s.getsockname()[1] + +port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port() +``` + +**Pros**: Avoids conflicts entirely +**Cons**: More complex, may have side effects + +## Recommended Implementation + +**Combine Solutions A + B + C** for maximum robustness: + +1. Add `__del__` for best-effort cleanup +2. Add context manager for explicit cleanup +3. Add `is_initialized()` check as defensive measure + +```python +# nanovllm/engine/llm_engine.py + +class LLMEngine: + def __init__(self, model, **kwargs): + # ... existing code ... + atexit.register(self.exit) + self._exited = False + + def exit(self): + if self._exited: + return + self._exited = True + self.model_runner.call("exit") + del self.model_runner + for p in self.ps: + p.join() + + def __del__(self): + try: + self.exit() + except Exception: + pass + + def __enter__(self): + return self + + def __exit__(self, *args): + self.exit() + return False + + +# nanovllm/engine/model_runner.py + +class ModelRunner: + def __init__(self, config: Config, rank: int, event): + # ... existing code before init_process_group ... + + import os + port = os.environ.get("NANOVLLM_DIST_PORT", "2333") + + # Defensive cleanup + if dist.is_initialized(): + dist.destroy_process_group() + + dist.init_process_group("nccl", f"tcp://localhost:{port}", + world_size=self.world_size, rank=rank) + # ... rest of init ... +``` + +## Workaround for Current Code + +Until the fix is implemented, use one of these workarounds: + +### Workaround 1: Manual Cleanup + +```python +import torch.distributed as dist + +llm = LLM(model_path) +outputs = llm.generate(...) +llm.model_runner.call("exit") # Manual cleanup +del llm + +# Now can create new LLM +llm2 = LLM(model_path) +``` + +### Workaround 2: Subprocess Testing + +```bash +# Run each test group as separate process +for i in $(seq 0 5 95); do + python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload +done +``` + +### Workaround 3: Environment Variable Port + +```bash +# Use different port for each run +NANOVLLM_DIST_PORT=2334 python test.py +NANOVLLM_DIST_PORT=2335 python test.py +``` + +## Related Files + +| File | Relevant Code | +|------|---------------| +| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call | +| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` | +| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` | +| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method | + +## Testing the Fix + +After implementing the fix, verify with: + +```python +# test_multiple_llm.py +from nanovllm import LLM, SamplingParams + +for i in range(3): + print(f"Creating LLM instance {i+1}") + llm = LLM("path/to/model", enable_cpu_offload=True) + outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10)) + print(f"Instance {i+1} output: {outputs[0]['text']}") + del llm + print(f"Instance {i+1} deleted\n") + +print("All instances created and deleted successfully!") +``` + +Expected: No port conflict errors, all 3 instances work. + +## Priority + +**High** - This blocks grouped testing and any multi-LLM-instance workflows. diff --git a/tests/test_ruler_niah.py b/tests/test_ruler_niah.py index d39b747..3c727bc 100644 --- a/tests/test_ruler_niah.py +++ b/tests/test_ruler_niah.py @@ -14,6 +14,9 @@ Usage: # Test with custom model python tests/test_ruler_niah.py --model /path/to/model --enable-offload + + # Group mode: test in batches with separate LLM initialization per group + python tests/test_ruler_niah.py --enable-offload --group-size 5 """ import os @@ -216,6 +219,143 @@ def run_ruler_niah_test( return correct, total +# ============================================================ +# Grouped Test Function +# ============================================================ + +def run_grouped_test( + model_path: str, + data_file: Path, + group_size: int = 5, + total_samples: Optional[int] = None, + max_model_len: int = DEFAULT_MAX_MODEL_LEN, + max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS, + enable_cpu_offload: bool = False, + num_gpu_blocks: int = 4, + block_size: int = 1024, + gpu_utilization: float = 0.9, + enforce_eager: bool = True, +) -> Tuple[int, int, List[dict]]: + """ + Run RULER NIAH test in groups, with separate LLM initialization per group. + + This mode is useful for: + - Avoiding state accumulation issues + - Testing LLM initialization stability + - Running large-scale tests with memory cleanup between groups + + Args: + model_path: Path to the model + data_file: Path to JSONL data file + group_size: Number of samples per group + total_samples: Total samples to test (None = all in file) + Other args: Same as run_ruler_niah_test + + Returns: + (total_correct, total_tested, group_results): Results summary + """ + import time + import gc + import torch + + # Count total samples in file + file_sample_count = count_samples(data_file) + if total_samples is None: + total_samples = file_sample_count + else: + total_samples = min(total_samples, file_sample_count) + + num_groups = (total_samples + group_size - 1) // group_size + + print(f"\n{'='*60}") + print(f"RULER NIAH Grouped Test") + print(f"{'='*60}") + print(f"Model: {model_path}") + print(f"Data file: {data_file}") + print(f"Total samples: {total_samples}") + print(f"Group size: {group_size}") + print(f"Number of groups: {num_groups}") + print(f"CPU offload: {enable_cpu_offload}") + print(f"{'='*60}\n") + + total_correct = 0 + total_tested = 0 + group_results = [] + all_failed = [] + + test_start_time = time.time() + + for group_idx in range(num_groups): + start_idx = group_idx * group_size + end_idx = min(start_idx + group_size, total_samples) + sample_indices = list(range(start_idx, end_idx)) + + print(f"\n{'='*60}") + print(f"Group {group_idx + 1}/{num_groups}: Samples {start_idx}-{end_idx - 1}") + print(f"{'='*60}") + + group_start_time = time.time() + + # Run test for this group + correct, tested = run_ruler_niah_test( + model_path=model_path, + data_file=data_file, + sample_indices=sample_indices, + max_model_len=max_model_len, + max_new_tokens=max_new_tokens, + enable_cpu_offload=enable_cpu_offload, + num_gpu_blocks=num_gpu_blocks, + block_size=block_size, + gpu_utilization=gpu_utilization, + enforce_eager=enforce_eager, + verbose=True, + ) + + group_time = time.time() - group_start_time + + total_correct += correct + total_tested += tested + + group_result = { + "group": group_idx + 1, + "samples": f"{start_idx}-{end_idx - 1}", + "correct": correct, + "total": tested, + "accuracy": 100 * correct / tested if tested > 0 else 0, + "time": group_time, + } + group_results.append(group_result) + + print(f"\nGroup {group_idx + 1} Summary: {correct}/{tested} PASSED ({group_result['accuracy']:.1f}%) in {group_time:.1f}s") + + # Force cleanup between groups + gc.collect() + torch.cuda.empty_cache() + + # Small delay to ensure port is released + if group_idx < num_groups - 1: + time.sleep(3) + + total_time = time.time() - test_start_time + + # Final summary + print(f"\n{'='*60}") + print(f"FINAL SUMMARY") + print(f"{'='*60}") + print(f"\nGroup Results:") + print(f"{'Group':<8} {'Samples':<12} {'Result':<12} {'Accuracy':<10} {'Time':<10}") + print(f"{'-'*52}") + for r in group_results: + print(f"{r['group']:<8} {r['samples']:<12} {r['correct']}/{r['total']:<9} {r['accuracy']:.1f}%{'':<5} {r['time']:.1f}s") + + print(f"{'-'*52}") + overall_accuracy = 100 * total_correct / total_tested if total_tested > 0 else 0 + print(f"{'TOTAL':<8} {'0-' + str(total_tested-1):<12} {total_correct}/{total_tested:<9} {overall_accuracy:.1f}%{'':<5} {total_time:.1f}s") + print(f"{'='*60}\n") + + return total_correct, total_tested, group_results + + # ============================================================ # CLI Entry Point # ============================================================ @@ -326,6 +466,18 @@ Examples: action="store_true", help="Quiet mode, only print final result" ) + parser.add_argument( + "--group-size", + type=int, + default=0, + help="Enable grouped testing mode with specified group size. Each group initializes LLM separately. (default: 0 = disabled)" + ) + parser.add_argument( + "--total-samples", + type=int, + default=0, + help="Total number of samples to test in group mode (default: 0 = all samples in file)" + ) args = parser.parse_args() @@ -334,20 +486,38 @@ Examples: enforce_eager = not args.use_cuda_graph verbose = not args.quiet - # Run test - correct, total = run_ruler_niah_test( - model_path=os.path.expanduser(args.model), - data_file=Path(args.data_file), - sample_indices=sample_indices, - max_model_len=args.max_model_len, - max_new_tokens=args.max_new_tokens, - enable_cpu_offload=args.enable_offload, - num_gpu_blocks=args.num_gpu_blocks, - block_size=args.block_size, - gpu_utilization=args.gpu_utilization, - enforce_eager=enforce_eager, - verbose=verbose, - ) + # Check if group mode is enabled + if args.group_size > 0: + # Grouped testing mode + total_samples = args.total_samples if args.total_samples > 0 else None + correct, total, _ = run_grouped_test( + model_path=os.path.expanduser(args.model), + data_file=Path(args.data_file), + group_size=args.group_size, + total_samples=total_samples, + max_model_len=args.max_model_len, + max_new_tokens=args.max_new_tokens, + enable_cpu_offload=args.enable_offload, + num_gpu_blocks=args.num_gpu_blocks, + block_size=args.block_size, + gpu_utilization=args.gpu_utilization, + enforce_eager=enforce_eager, + ) + else: + # Standard testing mode + correct, total = run_ruler_niah_test( + model_path=os.path.expanduser(args.model), + data_file=Path(args.data_file), + sample_indices=sample_indices, + max_model_len=args.max_model_len, + max_new_tokens=args.max_new_tokens, + enable_cpu_offload=args.enable_offload, + num_gpu_blocks=args.num_gpu_blocks, + block_size=args.block_size, + gpu_utilization=args.gpu_utilization, + enforce_eager=enforce_eager, + verbose=verbose, + ) # Final status if correct == total: