[docs] Added offload_acc issue.
This commit is contained in:
431
CLAUDE.md
431
CLAUDE.md
@@ -1,106 +1,389 @@
|
||||
# CLAUDE.md
|
||||
# Claude Code Configuration - SPARC Development Environment
|
||||
|
||||
This file provides guidance to Claude Code when working with this repository.
|
||||
## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT
|
||||
|
||||
## Overview
|
||||
**ABSOLUTE RULES**:
|
||||
1. ALL operations MUST be concurrent/parallel in a single message
|
||||
2. **NEVER save working files, text/mds and tests to the root folder**
|
||||
3. ALWAYS organize files in appropriate subdirectories
|
||||
4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP
|
||||
|
||||
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
|
||||
### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"
|
||||
|
||||
## GPU Mutex for Multi-Instance Debugging
|
||||
**MANDATORY PATTERNS:**
|
||||
- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
|
||||
- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
|
||||
- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
|
||||
- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
|
||||
- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message
|
||||
|
||||
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
|
||||
### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution
|
||||
|
||||
### Benchmarks (`bench*.py`) - Exclusive GPU Access Required
|
||||
|
||||
Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:
|
||||
|
||||
```bash
|
||||
# Check and wait for GPU to be free
|
||||
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
|
||||
echo "GPU busy, waiting 10s..."
|
||||
sleep 10
|
||||
done
|
||||
**Claude Code's Task tool is the PRIMARY way to spawn agents:**
|
||||
```javascript
|
||||
// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
|
||||
[Single Message]:
|
||||
Task("Research agent", "Analyze requirements and patterns...", "researcher")
|
||||
Task("Coder agent", "Implement core features...", "coder")
|
||||
Task("Tester agent", "Create comprehensive tests...", "tester")
|
||||
Task("Reviewer agent", "Review code quality...", "reviewer")
|
||||
Task("Architect agent", "Design system architecture...", "system-architect")
|
||||
```
|
||||
|
||||
### Other Scripts (tests, examples) - Port Conflict Check Only
|
||||
**MCP tools are ONLY for coordination setup:**
|
||||
- `mcp__claude-flow__swarm_init` - Initialize coordination topology
|
||||
- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
|
||||
- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows
|
||||
|
||||
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
|
||||
### 📁 File Organization Rules
|
||||
|
||||
**NEVER save to root folder. Use these directories:**
|
||||
- `/src` - Source code files
|
||||
- `/tests` - Test files
|
||||
- `/docs` - Documentation and markdown files
|
||||
- `/config` - Configuration files
|
||||
- `/scripts` - Utility scripts
|
||||
- `/examples` - Example code
|
||||
|
||||
## Project Overview
|
||||
|
||||
This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
|
||||
|
||||
## SPARC Commands
|
||||
|
||||
### Core Commands
|
||||
- `npx claude-flow sparc modes` - List available modes
|
||||
- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
|
||||
- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
|
||||
- `npx claude-flow sparc info <mode>` - Get mode details
|
||||
|
||||
### Batchtools Commands
|
||||
- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
|
||||
- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
|
||||
- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
|
||||
|
||||
### Build Commands
|
||||
- `npm run build` - Build project
|
||||
- `npm run test` - Run tests
|
||||
- `npm run lint` - Linting
|
||||
- `npm run typecheck` - Type checking
|
||||
|
||||
## SPARC Workflow Phases
|
||||
|
||||
1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
|
||||
2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
|
||||
3. **Architecture** - System design (`sparc run architect`)
|
||||
4. **Refinement** - TDD implementation (`sparc tdd`)
|
||||
5. **Completion** - Integration (`sparc run integration`)
|
||||
|
||||
## Code Style & Best Practices
|
||||
|
||||
- **Modular Design**: Files under 500 lines
|
||||
- **Environment Safety**: Never hardcode secrets
|
||||
- **Test-First**: Write tests before implementation
|
||||
- **Clean Architecture**: Separate concerns
|
||||
- **Documentation**: Keep updated
|
||||
|
||||
## 🚀 Available Agents (54 Total)
|
||||
|
||||
### Core Development
|
||||
`coder`, `reviewer`, `tester`, `planner`, `researcher`
|
||||
|
||||
### Swarm Coordination
|
||||
`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
|
||||
|
||||
### Consensus & Distributed
|
||||
`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
|
||||
|
||||
### Performance & Optimization
|
||||
`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
|
||||
|
||||
### GitHub & Repository
|
||||
`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
|
||||
|
||||
### SPARC Methodology
|
||||
`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
|
||||
|
||||
### Specialized Development
|
||||
`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
|
||||
|
||||
### Testing & Validation
|
||||
`tdd-london-swarm`, `production-validator`
|
||||
|
||||
### Migration & Planning
|
||||
`migration-planner`, `swarm-init`
|
||||
|
||||
## 🎯 Claude Code vs MCP Tools
|
||||
|
||||
### Claude Code Handles ALL EXECUTION:
|
||||
- **Task tool**: Spawn and run agents concurrently for actual work
|
||||
- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
|
||||
- Code generation and programming
|
||||
- Bash commands and system operations
|
||||
- Implementation work
|
||||
- Project navigation and analysis
|
||||
- TodoWrite and task management
|
||||
- Git operations
|
||||
- Package management
|
||||
- Testing and debugging
|
||||
|
||||
### MCP Tools ONLY COORDINATE:
|
||||
- Swarm initialization (topology setup)
|
||||
- Agent type definitions (coordination patterns)
|
||||
- Task orchestration (high-level planning)
|
||||
- Memory management
|
||||
- Neural features
|
||||
- Performance tracking
|
||||
- GitHub integration
|
||||
|
||||
**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
|
||||
|
||||
## 🚀 Quick Setup
|
||||
|
||||
```bash
|
||||
# Check if port 29500 (default torch distributed port) is in use
|
||||
if lsof -i :29500 >/dev/null 2>&1; then
|
||||
echo "Port 29500 in use, waiting 10s..."
|
||||
sleep 10
|
||||
fi
|
||||
# Add MCP servers (Claude Flow required, others optional)
|
||||
claude mcp add claude-flow npx claude-flow@alpha mcp start
|
||||
claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination
|
||||
claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features
|
||||
```
|
||||
|
||||
**Note**: nanovllm's distributed port handling is not yet robust - two processes competing for the same port will cause errors. This check prevents that issue.
|
||||
## MCP Tool Categories
|
||||
|
||||
## Multi-Instance Development with PYTHONPATH
|
||||
### Coordination
|
||||
`swarm_init`, `agent_spawn`, `task_orchestrate`
|
||||
|
||||
**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
|
||||
### Monitoring
|
||||
`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`
|
||||
|
||||
**Use PYTHONPATH directly** - no pip install needed:
|
||||
### Memory & Neural
|
||||
`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`
|
||||
|
||||
### GitHub Integration
|
||||
`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
|
||||
|
||||
### System
|
||||
`benchmark_run`, `features_detect`, `swarm_monitor`
|
||||
|
||||
### Flow-Nexus MCP Tools (Optional Advanced Features)
|
||||
Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
|
||||
|
||||
**Key MCP Tool Categories:**
|
||||
- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
|
||||
- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
|
||||
- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
|
||||
- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
|
||||
- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
|
||||
- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
|
||||
- **Storage**: `storage_upload`, `storage_list` (cloud file management)
|
||||
|
||||
**Authentication Required:**
|
||||
- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
|
||||
- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
|
||||
- Access 70+ specialized MCP tools for advanced orchestration
|
||||
|
||||
## 🚀 Agent Execution Flow with Claude Code
|
||||
|
||||
### The Correct Pattern:
|
||||
|
||||
1. **Optional**: Use MCP tools to set up coordination topology
|
||||
2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
|
||||
3. **REQUIRED**: Each agent runs hooks for coordination
|
||||
4. **REQUIRED**: Batch all operations in single messages
|
||||
|
||||
### Example Full-Stack Development:
|
||||
|
||||
```javascript
|
||||
// Single message with all agent spawning via Claude Code's Task tool
|
||||
[Parallel Agent Execution]:
|
||||
Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
|
||||
Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
|
||||
Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
|
||||
Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
|
||||
Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
|
||||
Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
|
||||
|
||||
// All todos batched together
|
||||
TodoWrite { todos: [...8-10 todos...] }
|
||||
|
||||
// All file operations together
|
||||
Write "backend/server.js"
|
||||
Write "frontend/App.jsx"
|
||||
Write "database/schema.sql"
|
||||
```
|
||||
|
||||
## 📋 Agent Coordination Protocol
|
||||
|
||||
### Every Agent Spawned via Task Tool MUST:
|
||||
|
||||
**1️⃣ BEFORE Work:**
|
||||
```bash
|
||||
# Set PYTHONPATH to point to the project root directory
|
||||
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
|
||||
|
||||
# Example: running tests
|
||||
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
npx claude-flow@alpha hooks pre-task --description "[task]"
|
||||
npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- No `pip install` required
|
||||
- Code changes take effect immediately (no reinstall needed)
|
||||
- Each worktree is completely isolated
|
||||
**2️⃣ DURING Work:**
|
||||
```bash
|
||||
npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
|
||||
npx claude-flow@alpha hooks notify --message "[what was done]"
|
||||
```
|
||||
|
||||
## Documentation Index
|
||||
**3️⃣ AFTER Work:**
|
||||
```bash
|
||||
npx claude-flow@alpha hooks post-task --task-id "[task]"
|
||||
npx claude-flow@alpha hooks session-end --export-metrics true
|
||||
```
|
||||
|
||||
| Document | Purpose |
|
||||
|----------|---------|
|
||||
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
|
||||
| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
|
||||
| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
|
||||
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
|
||||
| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
|
||||
| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
|
||||
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
|
||||
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
|
||||
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
|
||||
## 🎯 Concurrent Execution Examples
|
||||
|
||||
## Configuration
|
||||
### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes
|
||||
|
||||
| Parameter | Default | Notes |
|
||||
|-----------|---------|-------|
|
||||
| `kvcache_block_size` | 4096 | Tokens per block |
|
||||
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
|
||||
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
|
||||
| `enable_cpu_offload` | False | Enable for long context |
|
||||
| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
|
||||
| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
|
||||
| `enforce_eager` | False | Set True to disable CUDA graphs |
|
||||
```javascript
|
||||
// Step 1: MCP tools set up coordination (optional, for complex tasks)
|
||||
[Single Message - Coordination Setup]:
|
||||
mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
|
||||
mcp__claude-flow__agent_spawn { type: "researcher" }
|
||||
mcp__claude-flow__agent_spawn { type: "coder" }
|
||||
mcp__claude-flow__agent_spawn { type: "tester" }
|
||||
|
||||
## Benchmarking
|
||||
// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
|
||||
[Single Message - Parallel Agent Execution]:
|
||||
// Claude Code's Task tool spawns real agents concurrently
|
||||
Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
|
||||
Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
|
||||
Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
|
||||
Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
|
||||
Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
|
||||
|
||||
// Batch ALL todos in ONE call
|
||||
TodoWrite { todos: [
|
||||
{id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
|
||||
{id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
|
||||
{id: "3", content: "Implement authentication", status: "pending", priority: "high"},
|
||||
{id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
|
||||
{id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
|
||||
{id: "6", content: "Integration tests", status: "pending", priority: "medium"},
|
||||
{id: "7", content: "API documentation", status: "pending", priority: "low"},
|
||||
{id: "8", content: "Performance optimization", status: "pending", priority: "low"}
|
||||
]}
|
||||
|
||||
// Parallel file operations
|
||||
Bash "mkdir -p app/{src,tests,docs,config}"
|
||||
Write "app/package.json"
|
||||
Write "app/src/server.js"
|
||||
Write "app/tests/server.test.js"
|
||||
Write "app/docs/API.md"
|
||||
```
|
||||
|
||||
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
|
||||
### ❌ WRONG (Multiple Messages):
|
||||
```javascript
|
||||
Message 1: mcp__claude-flow__swarm_init
|
||||
Message 2: Task("agent 1")
|
||||
Message 3: TodoWrite { todos: [single todo] }
|
||||
Message 4: Write "file.js"
|
||||
// This breaks parallel coordination!
|
||||
```
|
||||
|
||||
**Common Issues**:
|
||||
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
|
||||
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
|
||||
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
|
||||
## Performance Benefits
|
||||
|
||||
**Model Limits**:
|
||||
- Qwen3-0.6B/4B: 40960 tokens
|
||||
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
|
||||
- Llama-3.1-8B-Instruct: 131072 tokens
|
||||
- **84.8% SWE-Bench solve rate**
|
||||
- **32.3% token reduction**
|
||||
- **2.8-4.4x speed improvement**
|
||||
- **27+ neural models**
|
||||
|
||||
**Performance (Qwen3-4B, CPU Offload)**:
|
||||
- Prefill: ~5700-8000 tok/s (varies by context length)
|
||||
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
|
||||
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
|
||||
- **CUDA Graph speedup: 4x decode throughput**
|
||||
## Hooks Integration
|
||||
|
||||
### Pre-Operation
|
||||
- Auto-assign agents by file type
|
||||
- Validate commands for safety
|
||||
- Prepare resources automatically
|
||||
- Optimize topology by complexity
|
||||
- Cache searches
|
||||
|
||||
### Post-Operation
|
||||
- Auto-format code
|
||||
- Train neural patterns
|
||||
- Update memory
|
||||
- Analyze performance
|
||||
- Track token usage
|
||||
|
||||
### Session Management
|
||||
- Generate summaries
|
||||
- Persist state
|
||||
- Track metrics
|
||||
- Restore context
|
||||
- Export workflows
|
||||
|
||||
## Advanced Features (v2.0.0)
|
||||
|
||||
- 🚀 Automatic Topology Selection
|
||||
- ⚡ Parallel Execution (2.8-4.4x speed)
|
||||
- 🧠 Neural Training
|
||||
- 📊 Bottleneck Analysis
|
||||
- 🤖 Smart Auto-Spawning
|
||||
- 🛡️ Self-Healing Workflows
|
||||
- 💾 Cross-Session Memory
|
||||
- 🔗 GitHub Integration
|
||||
|
||||
## Integration Tips
|
||||
|
||||
1. Start with basic swarm init
|
||||
2. Scale agents gradually
|
||||
3. Use memory for context
|
||||
4. Monitor progress regularly
|
||||
5. Train patterns from success
|
||||
6. Enable hooks automation
|
||||
7. Use GitHub tools first
|
||||
|
||||
## Support
|
||||
|
||||
- Documentation: https://github.com/ruvnet/claude-flow
|
||||
- Issues: https://github.com/ruvnet/claude-flow/issues
|
||||
- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)
|
||||
|
||||
---
|
||||
|
||||
**Author**: Zijie Tian
|
||||
Remember: **Claude Flow coordinates, Claude Code creates!**
|
||||
|
||||
# Nano-vLLM Testing
|
||||
|
||||
## RULER NIAH Benchmark Test
|
||||
|
||||
Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
|
||||
|
||||
**Documentation**:
|
||||
- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
|
||||
- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Single sample test (recommended for initial verification)
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload
|
||||
|
||||
# All 5 samples
|
||||
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
|
||||
--model ~/models/Llama-3.1-8B-Instruct \
|
||||
--enable-offload \
|
||||
--sample-indices 0-4
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
|
||||
| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
|
||||
| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
|
||||
| `--max-model-len` | 32768 | Maximum context length |
|
||||
| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
|
||||
|
||||
---
|
||||
|
||||
# important-instruction-reminders
|
||||
Do what has been asked; nothing more, nothing less.
|
||||
NEVER create files unless they're absolutely necessary for achieving your goal.
|
||||
ALWAYS prefer editing an existing file to creating a new one.
|
||||
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
|
||||
Never save working files, text/mds and tests to the root folder.
|
||||
|
||||
Reference in New Issue
Block a user