From 8e0888c20cdfba1c0378a2462f2e99a0928a8518 Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Mon, 12 Jan 2026 15:05:55 +0800 Subject: [PATCH] [docs] Added offload_acc issue. --- .gitignore | 27 +++ CLAUDE.md | 431 +++++++++++++++++++++++++++------ docs/offload_accuracy_issue.md | 239 ++++++++++++++++++ 3 files changed, 623 insertions(+), 74 deletions(-) create mode 100644 docs/offload_accuracy_issue.md diff --git a/.gitignore b/.gitignore index b5557b5..39b338c 100644 --- a/.gitignore +++ b/.gitignore @@ -197,3 +197,30 @@ cython_debug/ results/ outputs/ .local/ + +# Claude Flow generated files +.claude/settings.local.json +.mcp.json +claude-flow.config.json +.swarm/ +.hive-mind/ +.claude-flow/ +memory/ +coordination/ +memory/claude-flow-data.json +memory/sessions/* +!memory/sessions/README.md +memory/agents/* +!memory/agents/README.md +coordination/memory_bank/* +coordination/subtasks/* +coordination/orchestration/* +*.db +*.db-journal +*.db-wal +*.sqlite +*.sqlite-journal +*.sqlite-wal +claude-flow +# Removed Windows wrapper files per user request +hive-mind-prompt-*.txt diff --git a/CLAUDE.md b/CLAUDE.md index 38d83ec..de95981 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,106 +1,389 @@ -# CLAUDE.md +# Claude Code Configuration - SPARC Development Environment -This file provides guidance to Claude Code when working with this repository. +## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT -## Overview +**ABSOLUTE RULES**: +1. ALL operations MUST be concurrent/parallel in a single message +2. **NEVER save working files, text/mds and tests to the root folder** +3. ALWAYS organize files in appropriate subdirectories +4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP -Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference. +### ⚑ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS" -## GPU Mutex for Multi-Instance Debugging +**MANDATORY PATTERNS:** +- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum) +- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions +- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message +- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message +- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message -**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type: +### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution -### Benchmarks (`bench*.py`) - Exclusive GPU Access Required - -Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access: - -```bash -# Check and wait for GPU to be free -while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do - echo "GPU busy, waiting 10s..." - sleep 10 -done +**Claude Code's Task tool is the PRIMARY way to spawn agents:** +```javascript +// βœ… CORRECT: Use Claude Code's Task tool for parallel agent execution +[Single Message]: + Task("Research agent", "Analyze requirements and patterns...", "researcher") + Task("Coder agent", "Implement core features...", "coder") + Task("Tester agent", "Create comprehensive tests...", "tester") + Task("Reviewer agent", "Review code quality...", "reviewer") + Task("Architect agent", "Design system architecture...", "system-architect") ``` -### Other Scripts (tests, examples) - Port Conflict Check Only +**MCP tools are ONLY for coordination setup:** +- `mcp__claude-flow__swarm_init` - Initialize coordination topology +- `mcp__claude-flow__agent_spawn` - Define agent types for coordination +- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows -For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running: +### πŸ“ File Organization Rules + +**NEVER save to root folder. Use these directories:** +- `/src` - Source code files +- `/tests` - Test files +- `/docs` - Documentation and markdown files +- `/config` - Configuration files +- `/scripts` - Utility scripts +- `/examples` - Example code + +## Project Overview + +This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development. + +## SPARC Commands + +### Core Commands +- `npx claude-flow sparc modes` - List available modes +- `npx claude-flow sparc run ""` - Execute specific mode +- `npx claude-flow sparc tdd ""` - Run complete TDD workflow +- `npx claude-flow sparc info ` - Get mode details + +### Batchtools Commands +- `npx claude-flow sparc batch ""` - Parallel execution +- `npx claude-flow sparc pipeline ""` - Full pipeline processing +- `npx claude-flow sparc concurrent ""` - Multi-task processing + +### Build Commands +- `npm run build` - Build project +- `npm run test` - Run tests +- `npm run lint` - Linting +- `npm run typecheck` - Type checking + +## SPARC Workflow Phases + +1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`) +2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`) +3. **Architecture** - System design (`sparc run architect`) +4. **Refinement** - TDD implementation (`sparc tdd`) +5. **Completion** - Integration (`sparc run integration`) + +## Code Style & Best Practices + +- **Modular Design**: Files under 500 lines +- **Environment Safety**: Never hardcode secrets +- **Test-First**: Write tests before implementation +- **Clean Architecture**: Separate concerns +- **Documentation**: Keep updated + +## πŸš€ Available Agents (54 Total) + +### Core Development +`coder`, `reviewer`, `tester`, `planner`, `researcher` + +### Swarm Coordination +`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager` + +### Consensus & Distributed +`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager` + +### Performance & Optimization +`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent` + +### GitHub & Repository +`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm` + +### SPARC Methodology +`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement` + +### Specialized Development +`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator` + +### Testing & Validation +`tdd-london-swarm`, `production-validator` + +### Migration & Planning +`migration-planner`, `swarm-init` + +## 🎯 Claude Code vs MCP Tools + +### Claude Code Handles ALL EXECUTION: +- **Task tool**: Spawn and run agents concurrently for actual work +- File operations (Read, Write, Edit, MultiEdit, Glob, Grep) +- Code generation and programming +- Bash commands and system operations +- Implementation work +- Project navigation and analysis +- TodoWrite and task management +- Git operations +- Package management +- Testing and debugging + +### MCP Tools ONLY COORDINATE: +- Swarm initialization (topology setup) +- Agent type definitions (coordination patterns) +- Task orchestration (high-level planning) +- Memory management +- Neural features +- Performance tracking +- GitHub integration + +**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents. + +## πŸš€ Quick Setup ```bash -# Check if port 29500 (default torch distributed port) is in use -if lsof -i :29500 >/dev/null 2>&1; then - echo "Port 29500 in use, waiting 10s..." - sleep 10 -fi +# Add MCP servers (Claude Flow required, others optional) +claude mcp add claude-flow npx claude-flow@alpha mcp start +claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination +claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features ``` -**Note**: nanovllm's distributed port handling is not yet robust - two processes competing for the same port will cause errors. This check prevents that issue. +## MCP Tool Categories -## Multi-Instance Development with PYTHONPATH +### Coordination +`swarm_init`, `agent_spawn`, `task_orchestrate` -**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances. +### Monitoring +`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results` -**Use PYTHONPATH directly** - no pip install needed: +### Memory & Neural +`memory_usage`, `neural_status`, `neural_train`, `neural_patterns` +### GitHub Integration +`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review` + +### System +`benchmark_run`, `features_detect`, `swarm_monitor` + +### Flow-Nexus MCP Tools (Optional Advanced Features) +Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools: + +**Key MCP Tool Categories:** +- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate` +- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution) +- **Templates**: `template_list`, `template_deploy` (pre-built project templates) +- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant) +- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management) +- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring) +- **Storage**: `storage_upload`, `storage_list` (cloud file management) + +**Authentication Required:** +- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register` +- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login` +- Access 70+ specialized MCP tools for advanced orchestration + +## πŸš€ Agent Execution Flow with Claude Code + +### The Correct Pattern: + +1. **Optional**: Use MCP tools to set up coordination topology +2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work +3. **REQUIRED**: Each agent runs hooks for coordination +4. **REQUIRED**: Batch all operations in single messages + +### Example Full-Stack Development: + +```javascript +// Single message with all agent spawning via Claude Code's Task tool +[Parallel Agent Execution]: + Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev") + Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder") + Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer") + Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester") + Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer") + Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer") + + // All todos batched together + TodoWrite { todos: [...8-10 todos...] } + + // All file operations together + Write "backend/server.js" + Write "frontend/App.jsx" + Write "database/schema.sql" +``` + +## πŸ“‹ Agent Coordination Protocol + +### Every Agent Spawned via Task Tool MUST: + +**1️⃣ BEFORE Work:** ```bash -# Set PYTHONPATH to point to the project root directory -PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python - -# Example: running tests -PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py +npx claude-flow@alpha hooks pre-task --description "[task]" +npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]" ``` -**Benefits**: -- No `pip install` required -- Code changes take effect immediately (no reinstall needed) -- Each worktree is completely isolated +**2️⃣ DURING Work:** +```bash +npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]" +npx claude-flow@alpha hooks notify --message "[what was done]" +``` -## Documentation Index +**3️⃣ AFTER Work:** +```bash +npx claude-flow@alpha hooks post-task --task-id "[task]" +npx claude-flow@alpha hooks session-end --export-metrics true +``` -| Document | Purpose | -|----------|---------| -| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details | -| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling | -| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup | -| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow | -| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface | -| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design | -| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) | -| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling | -| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals | +## 🎯 Concurrent Execution Examples -## Configuration +### βœ… CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes -| Parameter | Default | Notes | -|-----------|---------|-------| -| `kvcache_block_size` | 4096 | Tokens per block | -| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context | -| `gpu_memory_utilization` | 0.9 | GPU memory fraction | -| `enable_cpu_offload` | False | Enable for long context | -| `num_gpu_blocks` | 2 | GPU blocks for offload mode | -| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline | -| `enforce_eager` | False | Set True to disable CUDA graphs | +```javascript +// Step 1: MCP tools set up coordination (optional, for complex tasks) +[Single Message - Coordination Setup]: + mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 } + mcp__claude-flow__agent_spawn { type: "researcher" } + mcp__claude-flow__agent_spawn { type: "coder" } + mcp__claude-flow__agent_spawn { type: "tester" } -## Benchmarking +// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work +[Single Message - Parallel Agent Execution]: + // Claude Code's Task tool spawns real agents concurrently + Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher") + Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder") + Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer") + Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester") + Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer") + + // Batch ALL todos in ONE call + TodoWrite { todos: [ + {id: "1", content: "Research API patterns", status: "in_progress", priority: "high"}, + {id: "2", content: "Design database schema", status: "in_progress", priority: "high"}, + {id: "3", content: "Implement authentication", status: "pending", priority: "high"}, + {id: "4", content: "Build REST endpoints", status: "pending", priority: "high"}, + {id: "5", content: "Write unit tests", status: "pending", priority: "medium"}, + {id: "6", content: "Integration tests", status: "pending", priority: "medium"}, + {id: "7", content: "API documentation", status: "pending", priority: "low"}, + {id: "8", content: "Performance optimization", status: "pending", priority: "low"} + ]} + + // Parallel file operations + Bash "mkdir -p app/{src,tests,docs,config}" + Write "app/package.json" + Write "app/src/server.js" + Write "app/tests/server.test.js" + Write "app/docs/API.md" +``` -**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison) +### ❌ WRONG (Multiple Messages): +```javascript +Message 1: mcp__claude-flow__swarm_init +Message 2: Task("agent 1") +Message 3: TodoWrite { todos: [single todo] } +Message 4: Write "file.js" +// This breaks parallel coordination! +``` -**Common Issues**: -1. `max_num_batched_tokens < max_model_len`: Set equal for long context -2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len` -3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json +## Performance Benefits -**Model Limits**: -- Qwen3-0.6B/4B: 40960 tokens -- Qwen2.5-7B-Instruct-1M: 1048576 tokens -- Llama-3.1-8B-Instruct: 131072 tokens +- **84.8% SWE-Bench solve rate** +- **32.3% token reduction** +- **2.8-4.4x speed improvement** +- **27+ neural models** -**Performance (Qwen3-4B, CPU Offload)**: -- Prefill: ~5700-8000 tok/s (varies by context length) -- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms) -- Decode Eager Mode: ~12 tok/s (TPOT ~80ms) -- **CUDA Graph speedup: 4x decode throughput** +## Hooks Integration + +### Pre-Operation +- Auto-assign agents by file type +- Validate commands for safety +- Prepare resources automatically +- Optimize topology by complexity +- Cache searches + +### Post-Operation +- Auto-format code +- Train neural patterns +- Update memory +- Analyze performance +- Track token usage + +### Session Management +- Generate summaries +- Persist state +- Track metrics +- Restore context +- Export workflows + +## Advanced Features (v2.0.0) + +- πŸš€ Automatic Topology Selection +- ⚑ Parallel Execution (2.8-4.4x speed) +- 🧠 Neural Training +- πŸ“Š Bottleneck Analysis +- πŸ€– Smart Auto-Spawning +- πŸ›‘οΈ Self-Healing Workflows +- πŸ’Ύ Cross-Session Memory +- πŸ”— GitHub Integration + +## Integration Tips + +1. Start with basic swarm init +2. Scale agents gradually +3. Use memory for context +4. Monitor progress regularly +5. Train patterns from success +6. Enable hooks automation +7. Use GitHub tools first + +## Support + +- Documentation: https://github.com/ruvnet/claude-flow +- Issues: https://github.com/ruvnet/claude-flow/issues +- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features) --- -**Author**: Zijie Tian +Remember: **Claude Flow coordinates, Claude Code creates!** + +# Nano-vLLM Testing + +## RULER NIAH Benchmark Test + +Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens). + +**Documentation**: +- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage +- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%) + +### Quick Start + +```bash +# Single sample test (recommended for initial verification) +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --enable-offload + +# All 5 samples +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --enable-offload \ + --sample-indices 0-4 +``` + +### Options + +| Option | Default | Description | +|--------|---------|-------------| +| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path | +| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) | +| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) | +| `--max-model-len` | 32768 | Maximum context length | +| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) | + +--- + +# important-instruction-reminders +Do what has been asked; nothing more, nothing less. +NEVER create files unless they're absolutely necessary for achieving your goal. +ALWAYS prefer editing an existing file to creating a new one. +NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User. +Never save working files, text/mds and tests to the root folder. diff --git a/docs/offload_accuracy_issue.md b/docs/offload_accuracy_issue.md new file mode 100644 index 0000000..febadea --- /dev/null +++ b/docs/offload_accuracy_issue.md @@ -0,0 +1,239 @@ +# CPU Offload Accuracy Issue Investigation + +## Problem Summary + +CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark. + +| Mode | Accuracy | Pass/Total | +|------|----------|------------| +| **Non-Offload (GPU only)** | **100%** | 100/100 | +| **CPU Offload** | **66%** | 66/100 | + +This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness. + +## Test Environment + +- **Model**: Llama-3.1-8B-Instruct +- **Task**: RULER NIAH (Needle-In-A-Haystack) 32K context +- **GPU**: NVIDIA A100-SXM4-80GB +- **Data**: `tests/data/ruler_niah/niah_single_1_32k.jsonl` (100 samples) + +## Reproduction Commands + +### Non-Offload Mode (100% accuracy) + +```bash +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --gpu-utilization 0.7 \ + --quiet +``` + +**Configuration**: +- KV Cache: GPU only, 51 blocks (6528 MB) +- Block size: 1024 tokens + +### Offload Mode (66% accuracy) + +```bash +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --enable-offload \ + --quiet +``` + +**Configuration**: +- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB) +- Ring buffer: 4 buffers Γ— 33280 tokens (520 MB) +- Per-layer decode buffer: 128 MB +- Block size: 1024 tokens + +## Observed Failure Patterns + +From the 5-sample verbose test: + +| Sample | Expected | Offload Output | Status | +|--------|----------|----------------|--------| +| 0 | 8930103 | `: 8930103.` | PASS | +| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** | +| 2 | 8231838 | `:Π½ΠΎΠ΅ 8231838.` | PASS | +| 3 | 8835373 | `: 8835373.` | PASS | +| 4 | 7754864 | `aster 7754864.` | PASS | + +**Failure pattern**: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548"). + +## Architecture Overview + +### Offload Mode Data Flow + +``` +Prefill Phase: +1. Input tokens β†’ chunked into 2048-token chunks +2. Each chunk processed layer by layer: + - Load KV from CPU β†’ GPU ring buffer + - Compute attention + - Store KV back to CPU +3. Ring buffer holds recent KV for decode + +Decode Phase: +1. For each new token: + - Load all layer KV from CPU (one layer at a time) + - Compute attention against full context + - Generate next token +``` + +### Key Components + +| File | Component | Description | +|------|-----------|-------------| +| `nanovllm/kvcache/offload_engine.py` | `OffloadEngine` | Manages CPU↔GPU KV cache transfers | +| `nanovllm/kvcache/offload_engine.py` | `RingKVBuffer` | GPU ring buffer for recent KV | +| `nanovllm/engine/model_runner.py` | `run_chunked_offload_prefill()` | Chunked prefill with offload | +| `nanovllm/engine/model_runner.py` | `run_offload_decode()` | Layer-wise decode with offload | +| `nanovllm/kvcache/hybrid_manager.py` | `HybridBlockManager` | CPU block allocation | + +## Potential Root Causes + +### 1. Ring Buffer Index/Position Issues + +**Location**: `nanovllm/kvcache/offload_engine.py` + +The ring buffer uses modular indexing. Potential issues: +- Position calculation errors during prefill/decode transition +- Off-by-one errors in KV storage/retrieval +- Incorrect handling when sequence length approaches `max_seq_len` + +**Recent fix applied**: `max_seq_len = max_model_len + 512` to prevent overflow, but there may be other indexing issues. + +### 2. Chunked Prefill KV Storage + +**Location**: `nanovllm/engine/model_runner.py:run_chunked_offload_prefill()` + +During chunked prefill: +- KV computed for chunk N must be correctly stored before processing chunk N+1 +- Position IDs must be correctly accumulated across chunks +- CPU block allocation must be contiguous and correctly tracked + +**Suspect areas**: +```python +# Check if positions are correctly tracked across chunks +# Check if KV is correctly copied to CPU after each chunk +# Check if ring buffer indices align with CPU block indices +``` + +### 3. Decode Phase KV Loading + +**Location**: `nanovllm/engine/model_runner.py:run_offload_decode()` + +During decode: +- Must load KV for ALL previous tokens (both prefill and decode) +- Layer-by-layer loading must be synchronized correctly +- Attention computation must use correct sequence length + +**Suspect areas**: +```python +# Check if decode loads KV for full context length +# Check if new decode KV is stored correctly +# Check if attention mask/positions are correct +``` + +### 4. CPU↔GPU Transfer Synchronization + +**Location**: `nanovllm/kvcache/offload_engine.py` + +CUDA streams and synchronization: +- Async copies may complete out of order +- Missing synchronization points could cause stale data +- Stream priorities may affect correctness + +### 5. Numerical Precision + +- CPU tensors use float16/bfloat16 +- GPU computation precision +- Potential precision loss during transfers + +## Debugging Strategy + +### Step 1: Identify Failing Samples + +```bash +# Run verbose mode to see which samples fail +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --enable-offload \ + --verbose 2>&1 | tee offload_verbose.log +``` + +### Step 2: Compare Token-by-Token + +Create a debug script to compare token generation between offload and non-offload modes for a failing sample: + +```python +# Compare logits at each decode step +# Check if divergence starts at a specific position +# Log KV cache contents at divergence point +``` + +### Step 3: Verify KV Cache Contents + +Add debugging to `OffloadEngine`: + +```python +# In store_kv(): Log what's being stored +# In load_kv(): Log what's being loaded +# Compare loaded KV with expected values +``` + +### Step 4: Check Position/Index Calculations + +```python +# Log ring buffer write/read positions +# Log CPU block indices +# Verify position IDs match actual token positions +``` + +### Step 5: Isolate the Bug + +1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent +2. Test with single chunk (no chunking) to isolate chunked prefill +3. Test prefill-only (no decode) to isolate decode phase + +## Quick Debugging Commands + +```bash +# Test single failing sample with verbose output +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --enable-offload \ + --sample-indices 1 \ + --verbose + +# Test with different context lengths +CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \ + --model ~/models/Llama-3.1-8B-Instruct \ + --enable-offload \ + --max-model-len 16384 \ + --verbose +``` + +## Related Documentation + +- [`docs/ruler_niah_standalone_test.md`](ruler_niah_standalone_test.md) - Test setup and background +- [`docs/layerwise_offload_memory_analysis.md`](layerwise_offload_memory_analysis.md) - Memory analysis (if exists) + +## Test Results Log + +**Date**: 2025-01-12 + +| Test | Mode | Samples | Passed | Accuracy | +|------|------|---------|--------|----------| +| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% | +| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% | + +## Next Steps + +1. [ ] Identify pattern in failing samples (position of needle? specific numbers?) +2. [ ] Add detailed logging to offload engine +3. [ ] Compare logits between offload and non-offload modes +4. [ ] Bisect the code to find the exact bug location +5. [ ] Write unit test that isolates the bug