[docs] Added offload_acc issue.
This commit is contained in:
27
.gitignore
vendored
27
.gitignore
vendored
@@ -197,3 +197,30 @@ cython_debug/
|
||||
results/
|
||||
outputs/
|
||||
.local/
|
||||
|
||||
# Claude Flow generated files
|
||||
.claude/settings.local.json
|
||||
.mcp.json
|
||||
claude-flow.config.json
|
||||
.swarm/
|
||||
.hive-mind/
|
||||
.claude-flow/
|
||||
memory/
|
||||
coordination/
|
||||
memory/claude-flow-data.json
|
||||
memory/sessions/*
|
||||
!memory/sessions/README.md
|
||||
memory/agents/*
|
||||
!memory/agents/README.md
|
||||
coordination/memory_bank/*
|
||||
coordination/subtasks/*
|
||||
coordination/orchestration/*
|
||||
*.db
|
||||
*.db-journal
|
||||
*.db-wal
|
||||
*.sqlite
|
||||
*.sqlite-journal
|
||||
*.sqlite-wal
|
||||
claude-flow
|
||||
# Removed Windows wrapper files per user request
|
||||
hive-mind-prompt-*.txt
|
||||
|
||||
431
CLAUDE.md
431
CLAUDE.md
@@ -1,106 +1,389 @@
|
||||
# CLAUDE.md
|
||||
# Claude Code Configuration - SPARC Development Environment
|
||||
|
||||
This file provides guidance to Claude Code when working with this repository.
|
||||
## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT
|
||||
|
||||
## Overview
|
||||
**ABSOLUTE RULES**:
|
||||
1. ALL operations MUST be concurrent/parallel in a single message
|
||||
2. **NEVER save working files, text/mds and tests to the root folder**
|
||||
3. ALWAYS organize files in appropriate subdirectories
|
||||
4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP
|
||||
|
||||
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
|
||||
### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"
|
||||
|
||||
## GPU Mutex for Multi-Instance Debugging
|
||||
**MANDATORY PATTERNS:**
|
||||
- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
|
||||
- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
|
||||
- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
|
||||
- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
|
||||
- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message
|
||||
|
||||
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
|
||||
### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution
|
||||
|
||||
### Benchmarks (`bench*.py`) - Exclusive GPU Access Required
|
||||
|
||||
Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:
|
||||
|
||||
```bash
|
||||
# Check and wait for GPU to be free
|
||||
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
|
||||
echo "GPU busy, waiting 10s..."
|
||||
sleep 10
|
||||
done
|
||||
**Claude Code's Task tool is the PRIMARY way to spawn agents:**
|
||||
```javascript
|
||||
// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
|
||||
[Single Message]:
|
||||
Task("Research agent", "Analyze requirements and patterns...", "researcher")
|
||||
Task("Coder agent", "Implement core features...", "coder")
|
||||
Task("Tester agent", "Create comprehensive tests...", "tester")
|
||||
Task("Reviewer agent", "Review code quality...", "reviewer")
|
||||
Task("Architect agent", "Design system architecture...", "system-architect")
|
||||
```
|
||||
|
||||
### Other Scripts (tests, examples) - Port Conflict Check Only
|
||||
**MCP tools are ONLY for coordination setup:**
|
||||
- `mcp__claude-flow__swarm_init` - Initialize coordination topology
|
||||
- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
|
||||
- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows
|
||||
|
||||
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
|
||||
### 📁 File Organization Rules
|
||||
|
||||
**NEVER save to root folder. Use these directories:**
|
||||
- `/src` - Source code files
|
||||
- `/tests` - Test files
|
||||
- `/docs` - Documentation and markdown files
|
||||
- `/config` - Configuration files
|
||||
- `/scripts` - Utility scripts
|
||||
- `/examples` - Example code
|
||||
|
||||
## Project Overview
|
||||
|
||||
This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
|
||||
|
||||
## SPARC Commands
|
||||
|
||||
### Core Commands
|
||||
- `npx claude-flow sparc modes` - List available modes
|
||||
- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
|
||||
- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
|
||||
- `npx claude-flow sparc info <mode>` - Get mode details
|
||||
|
||||
### Batchtools Commands
|
||||
- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
|
||||
- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
|
||||
- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
|
||||
|
||||
### Build Commands
|
||||
- `npm run build` - Build project
|
||||
- `npm run test` - Run tests
|
||||
- `npm run lint` - Linting
|
||||
- `npm run typecheck` - Type checking
|
||||
|
||||
## SPARC Workflow Phases
|
||||
|
||||
1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
|
||||
2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
|
||||
3. **Architecture** - System design (`sparc run architect`)
|
||||
4. **Refinement** - TDD implementation (`sparc tdd`)
|
||||
5. **Completion** - Integration (`sparc run integration`)
|
||||
|
||||
## Code Style & Best Practices
|
||||
|
||||
- **Modular Design**: Files under 500 lines
|
||||
- **Environment Safety**: Never hardcode secrets
|
||||
- **Test-First**: Write tests before implementation
|
||||
- **Clean Architecture**: Separate concerns
|
||||
- **Documentation**: Keep updated
|
||||
|
||||
## 🚀 Available Agents (54 Total)
|
||||
|
||||
### Core Development
|
||||
`coder`, `reviewer`, `tester`, `planner`, `researcher`
|
||||
|
||||
### Swarm Coordination
|
||||
`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
|
||||
|
||||
### Consensus & Distributed
|
||||
`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
|
||||
|
||||
### Performance & Optimization
|
||||
`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
|
||||
|
||||
### GitHub & Repository
|
||||
`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
|
||||
|
||||
### SPARC Methodology
|
||||
`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
|
||||
|
||||
### Specialized Development
|
||||
`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
|
||||
|
||||
### Testing & Validation
|
||||
`tdd-london-swarm`, `production-validator`
|
||||
|
||||
### Migration & Planning
|
||||
`migration-planner`, `swarm-init`
|
||||
|
||||
## 🎯 Claude Code vs MCP Tools
|
||||
|
||||
### Claude Code Handles ALL EXECUTION:
|
||||
- **Task tool**: Spawn and run agents concurrently for actual work
|
||||
- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
|
||||
- Code generation and programming
|
||||
- Bash commands and system operations
|
||||
- Implementation work
|
||||
- Project navigation and analysis
|
||||
- TodoWrite and task management
|
||||
- Git operations
|
||||
- Package management
|
||||
- Testing and debugging
|
||||
|
||||
### MCP Tools ONLY COORDINATE:
|
||||
- Swarm initialization (topology setup)
|
||||
- Agent type definitions (coordination patterns)
|
||||
- Task orchestration (high-level planning)
|
||||
- Memory management
|
||||
- Neural features
|
||||
- Performance tracking
|
||||
- GitHub integration
|
||||
|
||||
**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
|
||||
|
||||
## 🚀 Quick Setup
|
||||
|
||||
```bash
|
||||
# Check if port 29500 (default torch distributed port) is in use
|
||||
if lsof -i :29500 >/dev/null 2>&1; then
|
||||
echo "Port 29500 in use, waiting 10s..."
|
||||
sleep 10
|
||||
fi
|
||||
# Add MCP servers (Claude Flow required, others optional)
|
||||
claude mcp add claude-flow npx claude-flow@alpha mcp start
|
||||
claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination
|
||||
claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features
|
||||
```
|
||||
|
||||
**Note**: nanovllm's distributed port handling is not yet robust - two processes competing for the same port will cause errors. This check prevents that issue.
|
||||
## MCP Tool Categories
|
||||
|
||||
## Multi-Instance Development with PYTHONPATH
|
||||
### Coordination
|
||||
`swarm_init`, `agent_spawn`, `task_orchestrate`
|
||||
|
||||
**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
|
||||
### Monitoring
|
||||
`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`
|
||||
|
||||
**Use PYTHONPATH directly** - no pip install needed:
|
||||
### Memory & Neural
|
||||
`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`
|
||||
|
||||
### GitHub Integration
|
||||
`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
|
||||
|
||||
### System
|
||||
`benchmark_run`, `features_detect`, `swarm_monitor`
|
||||
|
||||
### Flow-Nexus MCP Tools (Optional Advanced Features)
|
||||
Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
|
||||
|
||||
**Key MCP Tool Categories:**
|
||||
- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
|
||||
- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
|
||||
- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
|
||||
- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
|
||||
- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
|
||||
- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
|
||||
- **Storage**: `storage_upload`, `storage_list` (cloud file management)
|
||||
|
||||
**Authentication Required:**
|
||||
- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
|
||||
- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
|
||||
- Access 70+ specialized MCP tools for advanced orchestration
|
||||
|
||||
## 🚀 Agent Execution Flow with Claude Code
|
||||
|
||||
### The Correct Pattern:
|
||||
|
||||
1. **Optional**: Use MCP tools to set up coordination topology
|
||||
2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
|
||||
3. **REQUIRED**: Each agent runs hooks for coordination
|
||||
4. **REQUIRED**: Batch all operations in single messages
|
||||
|
||||
### Example Full-Stack Development:
|
||||
|
||||
```javascript
|
||||
// Single message with all agent spawning via Claude Code's Task tool
|
||||
[Parallel Agent Execution]:
|
||||
Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
|
||||
Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
|
||||
Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
|
||||
Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
|
||||
Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
|
||||
Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
|
||||
|
||||
// All todos batched together
|
||||
TodoWrite { todos: [...8-10 todos...] }
|
||||
|
||||
// All file operations together
|
||||
Write "backend/server.js"
|
||||
Write "frontend/App.jsx"
|
||||
Write "database/schema.sql"
|
||||
```
|
||||
|
||||
## 📋 Agent Coordination Protocol
|
||||
|
||||
### Every Agent Spawned via Task Tool MUST:
|
||||
|
||||
**1️⃣ BEFORE Work:**
|
||||
```bash
|
||||
# Set PYTHONPATH to point to the project root directory
|
||||
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
|
||||
|
||||
# Example: running tests
|
||||
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
npx claude-flow@alpha hooks pre-task --description "[task]"
|
||||
npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- No `pip install` required
|
||||
- Code changes take effect immediately (no reinstall needed)
|
||||
- Each worktree is completely isolated
|
||||
**2️⃣ DURING Work:**
|
||||
```bash
|
||||
npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
|
||||
npx claude-flow@alpha hooks notify --message "[what was done]"
|
||||
```
|
||||
|
||||
## Documentation Index
|
||||
**3️⃣ AFTER Work:**
|
||||
```bash
|
||||
npx claude-flow@alpha hooks post-task --task-id "[task]"
|
||||
npx claude-flow@alpha hooks session-end --export-metrics true
|
||||
```
|
||||
|
||||
| Document | Purpose |
|
||||
|----------|---------|
|
||||
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
|
||||
| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
|
||||
| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
|
||||
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
|
||||
| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
|
||||
| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
|
||||
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
||||
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
||||
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
|
||||
## 🎯 Concurrent Execution Examples
|
||||
|
||||
## Configuration
|
||||
### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes
|
||||
|
||||
| Parameter | Default | Notes |
|
||||
|-----------|---------|-------|
|
||||
| `kvcache_block_size` | 4096 | Tokens per block |
|
||||
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
|
||||
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
|
||||
| `enable_cpu_offload` | False | Enable for long context |
|
||||
| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
|
||||
| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
|
||||
| `enforce_eager` | False | Set True to disable CUDA graphs |
|
||||
```javascript
|
||||
// Step 1: MCP tools set up coordination (optional, for complex tasks)
|
||||
[Single Message - Coordination Setup]:
|
||||
mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
|
||||
mcp__claude-flow__agent_spawn { type: "researcher" }
|
||||
mcp__claude-flow__agent_spawn { type: "coder" }
|
||||
mcp__claude-flow__agent_spawn { type: "tester" }
|
||||
|
||||
## Benchmarking
|
||||
// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
|
||||
[Single Message - Parallel Agent Execution]:
|
||||
// Claude Code's Task tool spawns real agents concurrently
|
||||
Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
|
||||
Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
|
||||
Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
|
||||
Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
|
||||
Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
|
||||
|
||||
// Batch ALL todos in ONE call
|
||||
TodoWrite { todos: [
|
||||
{id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
|
||||
{id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
|
||||
{id: "3", content: "Implement authentication", status: "pending", priority: "high"},
|
||||
{id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
|
||||
{id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
|
||||
{id: "6", content: "Integration tests", status: "pending", priority: "medium"},
|
||||
{id: "7", content: "API documentation", status: "pending", priority: "low"},
|
||||
{id: "8", content: "Performance optimization", status: "pending", priority: "low"}
|
||||
]}
|
||||
|
||||
// Parallel file operations
|
||||
Bash "mkdir -p app/{src,tests,docs,config}"
|
||||
Write "app/package.json"
|
||||
Write "app/src/server.js"
|
||||
Write "app/tests/server.test.js"
|
||||
Write "app/docs/API.md"
|
||||
```
|
||||
|
||||
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
|
||||
### ❌ WRONG (Multiple Messages):
|
||||
```javascript
|
||||
Message 1: mcp__claude-flow__swarm_init
|
||||
Message 2: Task("agent 1")
|
||||
Message 3: TodoWrite { todos: [single todo] }
|
||||
Message 4: Write "file.js"
|
||||
// This breaks parallel coordination!
|
||||
```
|
||||
|
||||
**Common Issues**:
|
||||
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
|
||||
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
|
||||
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
|
||||
## Performance Benefits
|
||||
|
||||
**Model Limits**:
|
||||
- Qwen3-0.6B/4B: 40960 tokens
|
||||
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
|
||||
- Llama-3.1-8B-Instruct: 131072 tokens
|
||||
- **84.8% SWE-Bench solve rate**
|
||||
- **32.3% token reduction**
|
||||
- **2.8-4.4x speed improvement**
|
||||
- **27+ neural models**
|
||||
|
||||
**Performance (Qwen3-4B, CPU Offload)**:
|
||||
- Prefill: ~5700-8000 tok/s (varies by context length)
|
||||
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
|
||||
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
|
||||
- **CUDA Graph speedup: 4x decode throughput**
|
||||
## Hooks Integration
|
||||
|
||||
### Pre-Operation
|
||||
- Auto-assign agents by file type
|
||||
- Validate commands for safety
|
||||
- Prepare resources automatically
|
||||
- Optimize topology by complexity
|
||||
- Cache searches
|
||||
|
||||
### Post-Operation
|
||||
- Auto-format code
|
||||
- Train neural patterns
|
||||
- Update memory
|
||||
- Analyze performance
|
||||
- Track token usage
|
||||
|
||||
### Session Management
|
||||
- Generate summaries
|
||||
- Persist state
|
||||
- Track metrics
|
||||
- Restore context
|
||||
- Export workflows
|
||||
|
||||
## Advanced Features (v2.0.0)
|
||||
|
||||
- 🚀 Automatic Topology Selection
|
||||
- ⚡ Parallel Execution (2.8-4.4x speed)
|
||||
- 🧠 Neural Training
|
||||
- 📊 Bottleneck Analysis
|
||||
- 🤖 Smart Auto-Spawning
|
||||
- 🛡️ Self-Healing Workflows
|
||||
- 💾 Cross-Session Memory
|
||||
- 🔗 GitHub Integration
|
||||
|
||||
## Integration Tips
|
||||
|
||||
1. Start with basic swarm init
|
||||
2. Scale agents gradually
|
||||
3. Use memory for context
|
||||
4. Monitor progress regularly
|
||||
5. Train patterns from success
|
||||
6. Enable hooks automation
|
||||
7. Use GitHub tools first
|
||||
|
||||
## Support
|
||||
|
||||
- Documentation: https://github.com/ruvnet/claude-flow
|
||||
- Issues: https://github.com/ruvnet/claude-flow/issues
|
||||
- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)
|
||||
|
||||
---
|
||||
|
||||
**Author**: Zijie Tian
|
||||
Remember: **Claude Flow coordinates, Claude Code creates!**
|
||||
|
||||
# Nano-vLLM Testing
|
||||
|
||||
## RULER NIAH Benchmark Test
|
||||
|
||||
Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
|
||||
|
||||
**Documentation**:
|
||||
- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
|
||||
- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Single sample test (recommended for initial verification)
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload
|
||||
|
||||
# All 5 samples
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--sample-indices 0-4
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
|
||||
| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
|
||||
| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
|
||||
| `--max-model-len` | 32768 | Maximum context length |
|
||||
| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
|
||||
|
||||
---
|
||||
|
||||
# important-instruction-reminders
|
||||
Do what has been asked; nothing more, nothing less.
|
||||
NEVER create files unless they're absolutely necessary for achieving your goal.
|
||||
ALWAYS prefer editing an existing file to creating a new one.
|
||||
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
|
||||
Never save working files, text/mds and tests to the root folder.
|
||||
|
||||
239
docs/offload_accuracy_issue.md
Normal file
239
docs/offload_accuracy_issue.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# CPU Offload Accuracy Issue Investigation
|
||||
|
||||
## Problem Summary
|
||||
|
||||
CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark.
|
||||
|
||||
| Mode | Accuracy | Pass/Total |
|
||||
|------|----------|------------|
|
||||
| **Non-Offload (GPU only)** | **100%** | 100/100 |
|
||||
| **CPU Offload** | **66%** | 66/100 |
|
||||
|
||||
This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness.
|
||||
|
||||
## Test Environment
|
||||
|
||||
- **Model**: Llama-3.1-8B-Instruct
|
||||
- **Task**: RULER NIAH (Needle-In-A-Haystack) 32K context
|
||||
- **GPU**: NVIDIA A100-SXM4-80GB
|
||||
- **Data**: `tests/data/ruler_niah/niah_single_1_32k.jsonl` (100 samples)
|
||||
|
||||
## Reproduction Commands
|
||||
|
||||
### Non-Offload Mode (100% accuracy)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--gpu-utilization 0.7 \
|
||||
--quiet
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
- KV Cache: GPU only, 51 blocks (6528 MB)
|
||||
- Block size: 1024 tokens
|
||||
|
||||
### Offload Mode (66% accuracy)
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--quiet
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
|
||||
- Ring buffer: 4 buffers × 33280 tokens (520 MB)
|
||||
- Per-layer decode buffer: 128 MB
|
||||
- Block size: 1024 tokens
|
||||
|
||||
## Observed Failure Patterns
|
||||
|
||||
From the 5-sample verbose test:
|
||||
|
||||
| Sample | Expected | Offload Output | Status |
|
||||
|--------|----------|----------------|--------|
|
||||
| 0 | 8930103 | `: 8930103.` | PASS |
|
||||
| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** |
|
||||
| 2 | 8231838 | `:ное 8231838.` | PASS |
|
||||
| 3 | 8835373 | `: 8835373.` | PASS |
|
||||
| 4 | 7754864 | `aster 7754864.` | PASS |
|
||||
|
||||
**Failure pattern**: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Offload Mode Data Flow
|
||||
|
||||
```
|
||||
Prefill Phase:
|
||||
1. Input tokens → chunked into 2048-token chunks
|
||||
2. Each chunk processed layer by layer:
|
||||
- Load KV from CPU → GPU ring buffer
|
||||
- Compute attention
|
||||
- Store KV back to CPU
|
||||
3. Ring buffer holds recent KV for decode
|
||||
|
||||
Decode Phase:
|
||||
1. For each new token:
|
||||
- Load all layer KV from CPU (one layer at a time)
|
||||
- Compute attention against full context
|
||||
- Generate next token
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
| File | Component | Description |
|
||||
|------|-----------|-------------|
|
||||
| `nanovllm/kvcache/offload_engine.py` | `OffloadEngine` | Manages CPU↔GPU KV cache transfers |
|
||||
| `nanovllm/kvcache/offload_engine.py` | `RingKVBuffer` | GPU ring buffer for recent KV |
|
||||
| `nanovllm/engine/model_runner.py` | `run_chunked_offload_prefill()` | Chunked prefill with offload |
|
||||
| `nanovllm/engine/model_runner.py` | `run_offload_decode()` | Layer-wise decode with offload |
|
||||
| `nanovllm/kvcache/hybrid_manager.py` | `HybridBlockManager` | CPU block allocation |
|
||||
|
||||
## Potential Root Causes
|
||||
|
||||
### 1. Ring Buffer Index/Position Issues
|
||||
|
||||
**Location**: `nanovllm/kvcache/offload_engine.py`
|
||||
|
||||
The ring buffer uses modular indexing. Potential issues:
|
||||
- Position calculation errors during prefill/decode transition
|
||||
- Off-by-one errors in KV storage/retrieval
|
||||
- Incorrect handling when sequence length approaches `max_seq_len`
|
||||
|
||||
**Recent fix applied**: `max_seq_len = max_model_len + 512` to prevent overflow, but there may be other indexing issues.
|
||||
|
||||
### 2. Chunked Prefill KV Storage
|
||||
|
||||
**Location**: `nanovllm/engine/model_runner.py:run_chunked_offload_prefill()`
|
||||
|
||||
During chunked prefill:
|
||||
- KV computed for chunk N must be correctly stored before processing chunk N+1
|
||||
- Position IDs must be correctly accumulated across chunks
|
||||
- CPU block allocation must be contiguous and correctly tracked
|
||||
|
||||
**Suspect areas**:
|
||||
```python
|
||||
# Check if positions are correctly tracked across chunks
|
||||
# Check if KV is correctly copied to CPU after each chunk
|
||||
# Check if ring buffer indices align with CPU block indices
|
||||
```
|
||||
|
||||
### 3. Decode Phase KV Loading
|
||||
|
||||
**Location**: `nanovllm/engine/model_runner.py:run_offload_decode()`
|
||||
|
||||
During decode:
|
||||
- Must load KV for ALL previous tokens (both prefill and decode)
|
||||
- Layer-by-layer loading must be synchronized correctly
|
||||
- Attention computation must use correct sequence length
|
||||
|
||||
**Suspect areas**:
|
||||
```python
|
||||
# Check if decode loads KV for full context length
|
||||
# Check if new decode KV is stored correctly
|
||||
# Check if attention mask/positions are correct
|
||||
```
|
||||
|
||||
### 4. CPU↔GPU Transfer Synchronization
|
||||
|
||||
**Location**: `nanovllm/kvcache/offload_engine.py`
|
||||
|
||||
CUDA streams and synchronization:
|
||||
- Async copies may complete out of order
|
||||
- Missing synchronization points could cause stale data
|
||||
- Stream priorities may affect correctness
|
||||
|
||||
### 5. Numerical Precision
|
||||
|
||||
- CPU tensors use float16/bfloat16
|
||||
- GPU computation precision
|
||||
- Potential precision loss during transfers
|
||||
|
||||
## Debugging Strategy
|
||||
|
||||
### Step 1: Identify Failing Samples
|
||||
|
||||
```bash
|
||||
# Run verbose mode to see which samples fail
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--verbose 2>&1 | tee offload_verbose.log
|
||||
```
|
||||
|
||||
### Step 2: Compare Token-by-Token
|
||||
|
||||
Create a debug script to compare token generation between offload and non-offload modes for a failing sample:
|
||||
|
||||
```python
|
||||
# Compare logits at each decode step
|
||||
# Check if divergence starts at a specific position
|
||||
# Log KV cache contents at divergence point
|
||||
```
|
||||
|
||||
### Step 3: Verify KV Cache Contents
|
||||
|
||||
Add debugging to `OffloadEngine`:
|
||||
|
||||
```python
|
||||
# In store_kv(): Log what's being stored
|
||||
# In load_kv(): Log what's being loaded
|
||||
# Compare loaded KV with expected values
|
||||
```
|
||||
|
||||
### Step 4: Check Position/Index Calculations
|
||||
|
||||
```python
|
||||
# Log ring buffer write/read positions
|
||||
# Log CPU block indices
|
||||
# Verify position IDs match actual token positions
|
||||
```
|
||||
|
||||
### Step 5: Isolate the Bug
|
||||
|
||||
1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent
|
||||
2. Test with single chunk (no chunking) to isolate chunked prefill
|
||||
3. Test prefill-only (no decode) to isolate decode phase
|
||||
|
||||
## Quick Debugging Commands
|
||||
|
||||
```bash
|
||||
# Test single failing sample with verbose output
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--sample-indices 1 \
|
||||
--verbose
|
||||
|
||||
# Test with different context lengths
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--max-model-len 16384 \
|
||||
--verbose
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [`docs/ruler_niah_standalone_test.md`](ruler_niah_standalone_test.md) - Test setup and background
|
||||
- [`docs/layerwise_offload_memory_analysis.md`](layerwise_offload_memory_analysis.md) - Memory analysis (if exists)
|
||||
|
||||
## Test Results Log
|
||||
|
||||
**Date**: 2025-01-12
|
||||
|
||||
| Test | Mode | Samples | Passed | Accuracy |
|
||||
|------|------|---------|--------|----------|
|
||||
| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
|
||||
| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. [ ] Identify pattern in failing samples (position of needle? specific numbers?)
|
||||
2. [ ] Add detailed logging to offload engine
|
||||
3. [ ] Compare logits between offload and non-offload modes
|
||||
4. [ ] Bisect the code to find the exact bug location
|
||||
5. [ ] Write unit test that isolates the bug
|
||||
Reference in New Issue
Block a user