[docs] Added dist port issue.

2026-01-12 15:16:39 +08:00
parent 8e0888c20c
commit de6f36bdb2
3 changed files with 569 additions and 372 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,389 +1,108 @@
-# Claude Code Configuration - SPARC Development Environment
+# CLAUDE.md

-## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT
+This file provides guidance to Claude Code when working with this repository.

-**ABSOLUTE RULES**:
-1. ALL operations MUST be concurrent/parallel in a single message
-2. **NEVER save working files, text/mds and tests to the root folder**
-3. ALWAYS organize files in appropriate subdirectories
-4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP
+## Overview

-### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"
+Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.

-**MANDATORY PATTERNS:**
- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message
+## GPU Mutex for Multi-Instance Debugging

-### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution
+**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:

-**Claude Code's Task tool is the PRIMARY way to spawn agents:**
-```javascript
-// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
-[Single Message]:
-  Task("Research agent", "Analyze requirements and patterns...", "researcher")
-  Task("Coder agent", "Implement core features...", "coder")
-  Task("Tester agent", "Create comprehensive tests...", "tester")
-  Task("Reviewer agent", "Review code quality...", "reviewer")
-  Task("Architect agent", "Design system architecture...", "system-architect")
-```
+### Benchmarks (`bench*.py`) - Exclusive GPU Access Required

-**MCP tools are ONLY for coordination setup:**
- `mcp__claude-flow__swarm_init` - Initialize coordination topology
- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows
-
-### 📁 File Organization Rules
-
-**NEVER save to root folder. Use these directories:**
- `/src` - Source code files
- `/tests` - Test files
- `/docs` - Documentation and markdown files
- `/config` - Configuration files
- `/scripts` - Utility scripts
- `/examples` - Example code
-
-## Project Overview
-
-This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
-
-## SPARC Commands
-
-### Core Commands
- `npx claude-flow sparc modes` - List available modes
- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
- `npx claude-flow sparc info <mode>` - Get mode details
-
-### Batchtools Commands
- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
-
-### Build Commands
- `npm run build` - Build project
- `npm run test` - Run tests
- `npm run lint` - Linting
- `npm run typecheck` - Type checking
-
-## SPARC Workflow Phases
-
-1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
-2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
-3. **Architecture** - System design (`sparc run architect`)
-4. **Refinement** - TDD implementation (`sparc tdd`)
-5. **Completion** - Integration (`sparc run integration`)
-
-## Code Style & Best Practices
-
- **Modular Design**: Files under 500 lines
- **Environment Safety**: Never hardcode secrets
- **Test-First**: Write tests before implementation
- **Clean Architecture**: Separate concerns
- **Documentation**: Keep updated
-
-## 🚀 Available Agents (54 Total)
-
-### Core Development
-`coder`, `reviewer`, `tester`, `planner`, `researcher`
-
-### Swarm Coordination
-`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
-
-### Consensus & Distributed
-`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
-
-### Performance & Optimization
-`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
-
-### GitHub & Repository
-`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
-
-### SPARC Methodology
-`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
-
-### Specialized Development
-`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
-
-### Testing & Validation
-`tdd-london-swarm`, `production-validator`
-
-### Migration & Planning
-`migration-planner`, `swarm-init`
-
-## 🎯 Claude Code vs MCP Tools
-
-### Claude Code Handles ALL EXECUTION:
- **Task tool**: Spawn and run agents concurrently for actual work
- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
- Code generation and programming
- Bash commands and system operations
- Implementation work
- Project navigation and analysis
- TodoWrite and task management
- Git operations
- Package management
- Testing and debugging
-
-### MCP Tools ONLY COORDINATE:
- Swarm initialization (topology setup)
- Agent type definitions (coordination patterns)
- Task orchestration (high-level planning)
- Memory management
- Neural features
- Performance tracking
- GitHub integration
-
-**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
-
-## 🚀 Quick Setup
+Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:

 ```bash
-# Add MCP servers (Claude Flow required, others optional)
-claude mcp add claude-flow npx claude-flow@alpha mcp start
-claude mcp add ruv-swarm npx ruv-swarm mcp start  # Optional: Enhanced coordination
-claude mcp add flow-nexus npx flow-nexus@latest mcp start  # Optional: Cloud features
+# Check and wait for GPU to be free
+while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
+  echo "GPU busy, waiting 10s..."
+  sleep 10
+done
 ```

-## MCP Tool Categories
+### Other Scripts (tests, examples) - Port Conflict Check Only

-### Coordination
-`swarm_init`, `agent_spawn`, `task_orchestrate`
+For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:

-### Monitoring
-`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`
-
-### Memory & Neural
-`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`
-
-### GitHub Integration
-`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
-
-### System
-`benchmark_run`, `features_detect`, `swarm_monitor`
-
-### Flow-Nexus MCP Tools (Optional Advanced Features)
-Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
-
-**Key MCP Tool Categories:**
- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
- **Storage**: `storage_upload`, `storage_list` (cloud file management)
-
-**Authentication Required:**
- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
- Access 70+ specialized MCP tools for advanced orchestration
-
-## 🚀 Agent Execution Flow with Claude Code
-
-### The Correct Pattern:
-
-1. **Optional**: Use MCP tools to set up coordination topology
-2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
-3. **REQUIRED**: Each agent runs hooks for coordination
-4. **REQUIRED**: Batch all operations in single messages
-
-### Example Full-Stack Development:
-
-```javascript
-// Single message with all agent spawning via Claude Code's Task tool
-[Parallel Agent Execution]:
-  Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
-  Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
-  Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
-  Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
-  Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
-  Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
-  
-  // All todos batched together
-  TodoWrite { todos: [...8-10 todos...] }
-  
-  // All file operations together
-  Write "backend/server.js"
-  Write "frontend/App.jsx"
-  Write "database/schema.sql"
-```
-
-## 📋 Agent Coordination Protocol
-
-### Every Agent Spawned via Task Tool MUST:
-
-**1️⃣ BEFORE Work:**
 ```bash
-npx claude-flow@alpha hooks pre-task --description "[task]"
-npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
+# Check if port 2333 (nanovllm default) is in use
+if lsof -i :2333 >/dev/null 2>&1; then
+  echo "Port 2333 in use, waiting 10s..."
+  sleep 10
+fi
 ```

-**2️⃣ DURING Work:**
+**Note**: nanovllm uses port 2333 for `torch.distributed`. See [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) for known issues with creating multiple LLM instances in the same process.
+
+## Multi-Instance Development with PYTHONPATH
+
+**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
+
+**Use PYTHONPATH directly** - no pip install needed:
+
 ```bash
-npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
-npx claude-flow@alpha hooks notify --message "[what was done]"
+# Set PYTHONPATH to point to the project root directory
+PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
+
+# Example: running tests
+PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 ```

-**3️⃣ AFTER Work:**
-```bash
-npx claude-flow@alpha hooks post-task --task-id "[task]"
-npx claude-flow@alpha hooks session-end --export-metrics true
-```
+**Benefits**:
+- No `pip install` required
+- Code changes take effect immediately (no reinstall needed)
+- Each worktree is completely isolated

-## 🎯 Concurrent Execution Examples
+## Documentation Index

-### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes
+| Document | Purpose |
+|----------|---------|
+| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
+| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
+| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
+| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
+| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
+| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
+| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
+| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
+| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
+| [`docs/torch_distributed_port_issue.md`](docs/torch_distributed_port_issue.md) | **BUG**: Port conflict when creating multiple LLM instances, root cause and proposed solutions |
+| [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) | **BUG**: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |

-```javascript
-// Step 1: MCP tools set up coordination (optional, for complex tasks)
-[Single Message - Coordination Setup]:
-  mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
-  mcp__claude-flow__agent_spawn { type: "researcher" }
-  mcp__claude-flow__agent_spawn { type: "coder" }
-  mcp__claude-flow__agent_spawn { type: "tester" }
+## Configuration

-// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
-[Single Message - Parallel Agent Execution]:
-  // Claude Code's Task tool spawns real agents concurrently
-  Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
-  Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
-  Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
-  Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
-  Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
-  
-  // Batch ALL todos in ONE call
-  TodoWrite { todos: [
-    {id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
-    {id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
-    {id: "3", content: "Implement authentication", status: "pending", priority: "high"},
-    {id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
-    {id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
-    {id: "6", content: "Integration tests", status: "pending", priority: "medium"},
-    {id: "7", content: "API documentation", status: "pending", priority: "low"},
-    {id: "8", content: "Performance optimization", status: "pending", priority: "low"}
-  ]}
-  
-  // Parallel file operations
-  Bash "mkdir -p app/{src,tests,docs,config}"
-  Write "app/package.json"
-  Write "app/src/server.js"
-  Write "app/tests/server.test.js"
-  Write "app/docs/API.md"
-```
+| Parameter | Default | Notes |
+|-----------|---------|-------|
+| `kvcache_block_size` | 4096 | Tokens per block |
+| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
+| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
+| `enable_cpu_offload` | False | Enable for long context |
+| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
+| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
+| `enforce_eager` | False | Set True to disable CUDA graphs |

-### ❌ WRONG (Multiple Messages):
-```javascript
-Message 1: mcp__claude-flow__swarm_init
-Message 2: Task("agent 1")
-Message 3: TodoWrite { todos: [single todo] }
-Message 4: Write "file.js"
-// This breaks parallel coordination!
-```
+## Benchmarking

-## Performance Benefits
+**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)

- **84.8% SWE-Bench solve rate**
- **32.3% token reduction**
- **2.8-4.4x speed improvement**
- **27+ neural models**
+**Common Issues**:
+1. `max_num_batched_tokens < max_model_len`: Set equal for long context
+2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
+3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json

-## Hooks Integration
+**Model Limits**:
+- Qwen3-0.6B/4B: 40960 tokens
+- Qwen2.5-7B-Instruct-1M: 1048576 tokens
+- Llama-3.1-8B-Instruct: 131072 tokens

-### Pre-Operation
- Auto-assign agents by file type
- Validate commands for safety
- Prepare resources automatically
- Optimize topology by complexity
- Cache searches
-
-### Post-Operation
- Auto-format code
- Train neural patterns
- Update memory
- Analyze performance
- Track token usage
-
-### Session Management
- Generate summaries
- Persist state
- Track metrics
- Restore context
- Export workflows
-
-## Advanced Features (v2.0.0)
-
- 🚀 Automatic Topology Selection
- ⚡ Parallel Execution (2.8-4.4x speed)
- 🧠 Neural Training
- 📊 Bottleneck Analysis
- 🤖 Smart Auto-Spawning
- 🛡️ Self-Healing Workflows
- 💾 Cross-Session Memory
- 🔗 GitHub Integration
-
-## Integration Tips
-
-1. Start with basic swarm init
-2. Scale agents gradually
-3. Use memory for context
-4. Monitor progress regularly
-5. Train patterns from success
-6. Enable hooks automation
-7. Use GitHub tools first
-
-## Support
-
- Documentation: https://github.com/ruvnet/claude-flow
- Issues: https://github.com/ruvnet/claude-flow/issues
- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)
+**Performance (Qwen3-4B, CPU Offload)**:
+- Prefill: ~5700-8000 tok/s (varies by context length)
+- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
+- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
+- **CUDA Graph speedup: 4x decode throughput**

 ---

-Remember: **Claude Flow coordinates, Claude Code creates!**
-
-# Nano-vLLM Testing
-
-## RULER NIAH Benchmark Test
-
-Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
-
-**Documentation**:
- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
-
-### Quick Start
-
-```bash
-# Single sample test (recommended for initial verification)
-CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
-    --model ~/models/Llama-3.1-8B-Instruct \
-    --enable-offload
-
-# All 5 samples
-CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
-    --model ~/models/Llama-3.1-8B-Instruct \
-    --enable-offload \
-    --sample-indices 0-4
-```
-
-### Options
-
-| Option | Default | Description |
-|--------|---------|-------------|
-| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
-| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
-| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
-| `--max-model-len` | 32768 | Maximum context length |
-| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
-
---
-
-# important-instruction-reminders
-Do what has been asked; nothing more, nothing less.
-NEVER create files unless they're absolutely necessary for achieving your goal.
-ALWAYS prefer editing an existing file to creating a new one.
-NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
-Never save working files, text/mds and tests to the root folder.
+**Author**: Zijie Tian
--- a/docs/torch_distributed_port_issue.md
+++ b/docs/torch_distributed_port_issue.md
@@ -0,0 +1,308 @@
+# Torch Distributed Port Conflict Issue
+
+## Problem Summary
+
+When attempting to create multiple `LLM` instances sequentially in the same Python process (e.g., for grouped testing), the second and subsequent instances fail with:
+
+```
+torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
+port: 2333, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use
+```
+
+## Root Cause Analysis
+
+### 1. Distributed Process Group Initialization
+
+In `nanovllm/engine/model_runner.py:30-32`:
+
+```python
+import os
+port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
+dist.init_process_group("nccl", f"tcp://localhost:{port}", world_size=self.world_size, rank=rank)
+```
+
+- Default port is **2333** (configurable via `NANOVLLM_DIST_PORT` env var)
+- `init_process_group()` binds a TCP socket to this port
+- This binding persists until `destroy_process_group()` is called
+
+### 2. Cleanup Mechanism
+
+In `nanovllm/engine/llm_engine.py:37`:
+
+```python
+atexit.register(self.exit)
+```
+
+In `nanovllm/engine/llm_engine.py:39-43`:
+
+```python
+def exit(self):
+    self.model_runner.call("exit")
+    del self.model_runner
+    for p in self.ps:
+        p.join()
+```
+
+In `nanovllm/engine/model_runner.py:66-78`:
+
+```python
+def exit(self):
+    # ... cleanup code ...
+    dist.destroy_process_group()
+```
+
+### 3. The Problem
+
+**`atexit` only triggers when the Python interpreter exits, NOT when the object is deleted or goes out of scope.**
+
+Timeline of the bug:
+
+```
+1. Create LLM instance #1
+   ├── init_process_group() binds port 2333 ✓
+   └── atexit.register(self.exit) registered
+
+2. LLM #1 goes out of scope (garbage collected)
+   ├── Python's GC deletes the object
+   ├── BUT atexit handler NOT triggered yet
+   └── Port 2333 still bound! ❌
+
+3. Create LLM instance #2
+   ├── init_process_group() tries to bind port 2333
+   └── EADDRINUSE error! ❌
+
+4. Program exits (only now atexit runs)
+   └── Too late - already crashed
+```
+
+## Impact
+
+This issue affects:
+
+1. **Grouped testing mode** (`test_ruler_niah.py --group-size N`)
+   - Each group needs a fresh LLM instance
+   - Second group fails with port conflict
+
+2. **Multiple LLM instances in same process**
+   - Any code that creates LLM, deletes it, then creates another
+
+3. **Interactive/notebook usage**
+   - Re-running cells that create LLM instances
+
+## Proposed Solutions
+
+### Solution A: Add `__del__` Method (Quick Fix)
+
+Add destructor to `LLMEngine` that calls cleanup:
+
+```python
+# In nanovllm/engine/llm_engine.py
+
+def __del__(self):
+    try:
+        self.exit()
+    except Exception:
+        pass  # Ignore errors during cleanup
+```
+
+**Pros**: Simple, backwards compatible
+**Cons**: `__del__` is not guaranteed to be called (circular references, etc.)
+
+### Solution B: Context Manager Pattern (Recommended)
+
+Make `LLMEngine` a context manager:
+
+```python
+# In nanovllm/engine/llm_engine.py
+
+def __enter__(self):
+    return self
+
+def __exit__(self, exc_type, exc_val, exc_tb):
+    self.exit()
+    return False
+```
+
+Usage:
+```python
+with LLM(model_path) as llm:
+    outputs = llm.generate(prompts, params)
+# Cleanup happens automatically here
+```
+
+**Pros**: Explicit, guaranteed cleanup, Pythonic
+**Cons**: Requires usage pattern change
+
+### Solution C: Check and Cleanup Before Init (Defensive)
+
+In `ModelRunner.__init__`, check if process group exists:
+
+```python
+# In nanovllm/engine/model_runner.py
+
+if dist.is_initialized():
+    dist.destroy_process_group()
+dist.init_process_group("nccl", f"tcp://localhost:{port}", ...)
+```
+
+**Pros**: Self-healing, no usage pattern change
+**Cons**: May mask other issues, global state manipulation
+
+### Solution D: Subprocess Isolation (For Testing)
+
+For grouped testing specifically, run each group in a subprocess:
+
+```python
+import subprocess
+for group in groups:
+    subprocess.run([sys.executable, "test_ruler_niah.py",
+                    "--sample-indices", f"{start}-{end}"])
+```
+
+**Pros**: Complete isolation, no code changes to nanovllm
+**Cons**: More overhead, only solves testing use case
+
+### Solution E: Dynamic Port Allocation
+
+Instead of fixed port 2333, use dynamic port:
+
+```python
+import socket
+
+def find_free_port():
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(('', 0))
+        return s.getsockname()[1]
+
+port = os.environ.get("NANOVLLM_DIST_PORT") or find_free_port()
+```
+
+**Pros**: Avoids conflicts entirely
+**Cons**: More complex, may have side effects
+
+## Recommended Implementation
+
+**Combine Solutions A + B + C** for maximum robustness:
+
+1. Add `__del__` for best-effort cleanup
+2. Add context manager for explicit cleanup
+3. Add `is_initialized()` check as defensive measure
+
+```python
+# nanovllm/engine/llm_engine.py
+
+class LLMEngine:
+    def __init__(self, model, **kwargs):
+        # ... existing code ...
+        atexit.register(self.exit)
+        self._exited = False
+
+    def exit(self):
+        if self._exited:
+            return
+        self._exited = True
+        self.model_runner.call("exit")
+        del self.model_runner
+        for p in self.ps:
+            p.join()
+
+    def __del__(self):
+        try:
+            self.exit()
+        except Exception:
+            pass
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, *args):
+        self.exit()
+        return False
+
+
+# nanovllm/engine/model_runner.py
+
+class ModelRunner:
+    def __init__(self, config: Config, rank: int, event):
+        # ... existing code before init_process_group ...
+
+        import os
+        port = os.environ.get("NANOVLLM_DIST_PORT", "2333")
+
+        # Defensive cleanup
+        if dist.is_initialized():
+            dist.destroy_process_group()
+
+        dist.init_process_group("nccl", f"tcp://localhost:{port}",
+                                world_size=self.world_size, rank=rank)
+        # ... rest of init ...
+```
+
+## Workaround for Current Code
+
+Until the fix is implemented, use one of these workarounds:
+
+### Workaround 1: Manual Cleanup
+
+```python
+import torch.distributed as dist
+
+llm = LLM(model_path)
+outputs = llm.generate(...)
+llm.model_runner.call("exit")  # Manual cleanup
+del llm
+
+# Now can create new LLM
+llm2 = LLM(model_path)
+```
+
+### Workaround 2: Subprocess Testing
+
+```bash
+# Run each test group as separate process
+for i in $(seq 0 5 95); do
+    python test_ruler_niah.py --sample-indices $i-$((i+4)) --enable-offload
+done
+```
+
+### Workaround 3: Environment Variable Port
+
+```bash
+# Use different port for each run
+NANOVLLM_DIST_PORT=2334 python test.py
+NANOVLLM_DIST_PORT=2335 python test.py
+```
+
+## Related Files
+
+| File | Relevant Code |
+|------|---------------|
+| `nanovllm/engine/model_runner.py:30-32` | `init_process_group()` call |
+| `nanovllm/engine/model_runner.py:66-78` | `exit()` and `destroy_process_group()` |
+| `nanovllm/engine/llm_engine.py:37` | `atexit.register()` |
+| `nanovllm/engine/llm_engine.py:39-43` | `exit()` method |
+
+## Testing the Fix
+
+After implementing the fix, verify with:
+
+```python
+# test_multiple_llm.py
+from nanovllm import LLM, SamplingParams
+
+for i in range(3):
+    print(f"Creating LLM instance {i+1}")
+    llm = LLM("path/to/model", enable_cpu_offload=True)
+    outputs = llm.generate(["Hello"], SamplingParams(max_tokens=10))
+    print(f"Instance {i+1} output: {outputs[0]['text']}")
+    del llm
+    print(f"Instance {i+1} deleted\n")
+
+print("All instances created and deleted successfully!")
+```
+
+Expected: No port conflict errors, all 3 instances work.
+
+## Priority
+
+**High** - This blocks grouped testing and any multi-LLM-instance workflows.
--- a/tests/test_ruler_niah.py
+++ b/tests/test_ruler_niah.py
@@ -14,6 +14,9 @@ Usage:

    # Test with custom model
    python tests/test_ruler_niah.py --model /path/to/model --enable-offload
+
+    # Group mode: test in batches with separate LLM initialization per group
+    python tests/test_ruler_niah.py --enable-offload --group-size 5
 """

 import os
@@ -216,6 +219,143 @@ def run_ruler_niah_test(
    return correct, total


+# ============================================================
+# Grouped Test Function
+# ============================================================
+
+def run_grouped_test(
+    model_path: str,
+    data_file: Path,
+    group_size: int = 5,
+    total_samples: Optional[int] = None,
+    max_model_len: int = DEFAULT_MAX_MODEL_LEN,
+    max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,
+    enable_cpu_offload: bool = False,
+    num_gpu_blocks: int = 4,
+    block_size: int = 1024,
+    gpu_utilization: float = 0.9,
+    enforce_eager: bool = True,
+) -> Tuple[int, int, List[dict]]:
+    """
+    Run RULER NIAH test in groups, with separate LLM initialization per group.
+
+    This mode is useful for:
+    - Avoiding state accumulation issues
+    - Testing LLM initialization stability
+    - Running large-scale tests with memory cleanup between groups
+
+    Args:
+        model_path: Path to the model
+        data_file: Path to JSONL data file
+        group_size: Number of samples per group
+        total_samples: Total samples to test (None = all in file)
+        Other args: Same as run_ruler_niah_test
+
+    Returns:
+        (total_correct, total_tested, group_results): Results summary
+    """
+    import time
+    import gc
+    import torch
+
+    # Count total samples in file
+    file_sample_count = count_samples(data_file)
+    if total_samples is None:
+        total_samples = file_sample_count
+    else:
+        total_samples = min(total_samples, file_sample_count)
+
+    num_groups = (total_samples + group_size - 1) // group_size
+
+    print(f"\n{'='*60}")
+    print(f"RULER NIAH Grouped Test")
+    print(f"{'='*60}")
+    print(f"Model: {model_path}")
+    print(f"Data file: {data_file}")
+    print(f"Total samples: {total_samples}")
+    print(f"Group size: {group_size}")
+    print(f"Number of groups: {num_groups}")
+    print(f"CPU offload: {enable_cpu_offload}")
+    print(f"{'='*60}\n")
+
+    total_correct = 0
+    total_tested = 0
+    group_results = []
+    all_failed = []
+
+    test_start_time = time.time()
+
+    for group_idx in range(num_groups):
+        start_idx = group_idx * group_size
+        end_idx = min(start_idx + group_size, total_samples)
+        sample_indices = list(range(start_idx, end_idx))
+
+        print(f"\n{'='*60}")
+        print(f"Group {group_idx + 1}/{num_groups}: Samples {start_idx}-{end_idx - 1}")
+        print(f"{'='*60}")
+
+        group_start_time = time.time()
+
+        # Run test for this group
+        correct, tested = run_ruler_niah_test(
+            model_path=model_path,
+            data_file=data_file,
+            sample_indices=sample_indices,
+            max_model_len=max_model_len,
+            max_new_tokens=max_new_tokens,
+            enable_cpu_offload=enable_cpu_offload,
+            num_gpu_blocks=num_gpu_blocks,
+            block_size=block_size,
+            gpu_utilization=gpu_utilization,
+            enforce_eager=enforce_eager,
+            verbose=True,
+        )
+
+        group_time = time.time() - group_start_time
+
+        total_correct += correct
+        total_tested += tested
+
+        group_result = {
+            "group": group_idx + 1,
+            "samples": f"{start_idx}-{end_idx - 1}",
+            "correct": correct,
+            "total": tested,
+            "accuracy": 100 * correct / tested if tested > 0 else 0,
+            "time": group_time,
+        }
+        group_results.append(group_result)
+
+        print(f"\nGroup {group_idx + 1} Summary: {correct}/{tested} PASSED ({group_result['accuracy']:.1f}%) in {group_time:.1f}s")
+
+        # Force cleanup between groups
+        gc.collect()
+        torch.cuda.empty_cache()
+
+        # Small delay to ensure port is released
+        if group_idx < num_groups - 1:
+            time.sleep(3)
+
+    total_time = time.time() - test_start_time
+
+    # Final summary
+    print(f"\n{'='*60}")
+    print(f"FINAL SUMMARY")
+    print(f"{'='*60}")
+    print(f"\nGroup Results:")
+    print(f"{'Group':<8} {'Samples':<12} {'Result':<12} {'Accuracy':<10} {'Time':<10}")
+    print(f"{'-'*52}")
+    for r in group_results:
+        print(f"{r['group']:<8} {r['samples']:<12} {r['correct']}/{r['total']:<9} {r['accuracy']:.1f}%{'':<5} {r['time']:.1f}s")
+
+    print(f"{'-'*52}")
+    overall_accuracy = 100 * total_correct / total_tested if total_tested > 0 else 0
+    print(f"{'TOTAL':<8} {'0-' + str(total_tested-1):<12} {total_correct}/{total_tested:<9} {overall_accuracy:.1f}%{'':<5} {total_time:.1f}s")
+    print(f"{'='*60}\n")
+
+    return total_correct, total_tested, group_results
+
+
 # ============================================================
 # CLI Entry Point
 # ============================================================
@@ -326,6 +466,18 @@ Examples:
        action="store_true",
        help="Quiet mode, only print final result"
    )
+    parser.add_argument(
+        "--group-size",
+        type=int,
+        default=0,
+        help="Enable grouped testing mode with specified group size. Each group initializes LLM separately. (default: 0 = disabled)"
+    )
+    parser.add_argument(
+        "--total-samples",
+        type=int,
+        default=0,
+        help="Total number of samples to test in group mode (default: 0 = all samples in file)"
+    )

    args = parser.parse_args()

@@ -334,20 +486,38 @@ Examples:
    enforce_eager = not args.use_cuda_graph
    verbose = not args.quiet

-    # Run test
-    correct, total = run_ruler_niah_test(
-        model_path=os.path.expanduser(args.model),
-        data_file=Path(args.data_file),
-        sample_indices=sample_indices,
-        max_model_len=args.max_model_len,
-        max_new_tokens=args.max_new_tokens,
-        enable_cpu_offload=args.enable_offload,
-        num_gpu_blocks=args.num_gpu_blocks,
-        block_size=args.block_size,
-        gpu_utilization=args.gpu_utilization,
-        enforce_eager=enforce_eager,
-        verbose=verbose,
-    )
+    # Check if group mode is enabled
+    if args.group_size > 0:
+        # Grouped testing mode
+        total_samples = args.total_samples if args.total_samples > 0 else None
+        correct, total, _ = run_grouped_test(
+            model_path=os.path.expanduser(args.model),
+            data_file=Path(args.data_file),
+            group_size=args.group_size,
+            total_samples=total_samples,
+            max_model_len=args.max_model_len,
+            max_new_tokens=args.max_new_tokens,
+            enable_cpu_offload=args.enable_offload,
+            num_gpu_blocks=args.num_gpu_blocks,
+            block_size=args.block_size,
+            gpu_utilization=args.gpu_utilization,
+            enforce_eager=enforce_eager,
+        )
+    else:
+        # Standard testing mode
+        correct, total = run_ruler_niah_test(
+            model_path=os.path.expanduser(args.model),
+            data_file=Path(args.data_file),
+            sample_indices=sample_indices,
+            max_model_len=args.max_model_len,
+            max_new_tokens=args.max_new_tokens,
+            enable_cpu_offload=args.enable_offload,
+            num_gpu_blocks=args.num_gpu_blocks,
+            block_size=args.block_size,
+            gpu_utilization=args.gpu_utilization,
+            enforce_eager=enforce_eager,
+            verbose=verbose,
+        )

    # Final status
    if correct == total: