[docs] Added offload_acc issue.

2026-01-12 15:05:55 +08:00
parent a6cc703d73
commit 8e0888c20c
3 changed files with 623 additions and 74 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -197,3 +197,30 @@ cython_debug/
 results/
 outputs/
 .local/
+
+# Claude Flow generated files
+.claude/settings.local.json
+.mcp.json
+claude-flow.config.json
+.swarm/
+.hive-mind/
+.claude-flow/
+memory/
+coordination/
+memory/claude-flow-data.json
+memory/sessions/*
+!memory/sessions/README.md
+memory/agents/*
+!memory/agents/README.md
+coordination/memory_bank/*
+coordination/subtasks/*
+coordination/orchestration/*
+*.db
+*.db-journal
+*.db-wal
+*.sqlite
+*.sqlite-journal
+*.sqlite-wal
+claude-flow
+# Removed Windows wrapper files per user request
+hive-mind-prompt-*.txt
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,106 +1,389 @@
-# CLAUDE.md
+# Claude Code Configuration - SPARC Development Environment

-This file provides guidance to Claude Code when working with this repository.
+## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT

-## Overview
+**ABSOLUTE RULES**:
+1. ALL operations MUST be concurrent/parallel in a single message
+2. **NEVER save working files, text/mds and tests to the root folder**
+3. ALWAYS organize files in appropriate subdirectories
+4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP

-Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
+### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"

-## GPU Mutex for Multi-Instance Debugging
+**MANDATORY PATTERNS:**
+- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
+- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
+- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
+- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
+- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message

-**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
+### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution

-### Benchmarks (`bench*.py`) - Exclusive GPU Access Required
-
-Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:
-
-```bash
-# Check and wait for GPU to be free
-while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
-  echo "GPU busy, waiting 10s..."
-  sleep 10
-done
+**Claude Code's Task tool is the PRIMARY way to spawn agents:**
+```javascript
+// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
+[Single Message]:
+  Task("Research agent", "Analyze requirements and patterns...", "researcher")
+  Task("Coder agent", "Implement core features...", "coder")
+  Task("Tester agent", "Create comprehensive tests...", "tester")
+  Task("Reviewer agent", "Review code quality...", "reviewer")
+  Task("Architect agent", "Design system architecture...", "system-architect")
 ```

-### Other Scripts (tests, examples) - Port Conflict Check Only
+**MCP tools are ONLY for coordination setup:**
+- `mcp__claude-flow__swarm_init` - Initialize coordination topology
+- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
+- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows

-For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
+### 📁 File Organization Rules
+
+**NEVER save to root folder. Use these directories:**
+- `/src` - Source code files
+- `/tests` - Test files
+- `/docs` - Documentation and markdown files
+- `/config` - Configuration files
+- `/scripts` - Utility scripts
+- `/examples` - Example code
+
+## Project Overview
+
+This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
+
+## SPARC Commands
+
+### Core Commands
+- `npx claude-flow sparc modes` - List available modes
+- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
+- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
+- `npx claude-flow sparc info <mode>` - Get mode details
+
+### Batchtools Commands
+- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
+- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
+- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
+
+### Build Commands
+- `npm run build` - Build project
+- `npm run test` - Run tests
+- `npm run lint` - Linting
+- `npm run typecheck` - Type checking
+
+## SPARC Workflow Phases
+
+1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
+2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
+3. **Architecture** - System design (`sparc run architect`)
+4. **Refinement** - TDD implementation (`sparc tdd`)
+5. **Completion** - Integration (`sparc run integration`)
+
+## Code Style & Best Practices
+
+- **Modular Design**: Files under 500 lines
+- **Environment Safety**: Never hardcode secrets
+- **Test-First**: Write tests before implementation
+- **Clean Architecture**: Separate concerns
+- **Documentation**: Keep updated
+
+## 🚀 Available Agents (54 Total)
+
+### Core Development
+`coder`, `reviewer`, `tester`, `planner`, `researcher`
+
+### Swarm Coordination
+`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
+
+### Consensus & Distributed
+`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
+
+### Performance & Optimization
+`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
+
+### GitHub & Repository
+`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
+
+### SPARC Methodology
+`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
+
+### Specialized Development
+`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
+
+### Testing & Validation
+`tdd-london-swarm`, `production-validator`
+
+### Migration & Planning
+`migration-planner`, `swarm-init`
+
+## 🎯 Claude Code vs MCP Tools
+
+### Claude Code Handles ALL EXECUTION:
+- **Task tool**: Spawn and run agents concurrently for actual work
+- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
+- Code generation and programming
+- Bash commands and system operations
+- Implementation work
+- Project navigation and analysis
+- TodoWrite and task management
+- Git operations
+- Package management
+- Testing and debugging
+
+### MCP Tools ONLY COORDINATE:
+- Swarm initialization (topology setup)
+- Agent type definitions (coordination patterns)
+- Task orchestration (high-level planning)
+- Memory management
+- Neural features
+- Performance tracking
+- GitHub integration
+
+**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
+
+## 🚀 Quick Setup

 ```bash
-# Check if port 29500 (default torch distributed port) is in use
-if lsof -i :29500 >/dev/null 2>&1; then
-  echo "Port 29500 in use, waiting 10s..."
-  sleep 10
-fi
+# Add MCP servers (Claude Flow required, others optional)
+claude mcp add claude-flow npx claude-flow@alpha mcp start
+claude mcp add ruv-swarm npx ruv-swarm mcp start  # Optional: Enhanced coordination
+claude mcp add flow-nexus npx flow-nexus@latest mcp start  # Optional: Cloud features
 ```

-**Note**: nanovllm's distributed port handling is not yet robust - two processes competing for the same port will cause errors. This check prevents that issue.
+## MCP Tool Categories

-## Multi-Instance Development with PYTHONPATH
+### Coordination
+`swarm_init`, `agent_spawn`, `task_orchestrate`

-**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
+### Monitoring
+`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`

-**Use PYTHONPATH directly** - no pip install needed:
+### Memory & Neural
+`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`

+### GitHub Integration
+`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
+
+### System
+`benchmark_run`, `features_detect`, `swarm_monitor`
+
+### Flow-Nexus MCP Tools (Optional Advanced Features)
+Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
+
+**Key MCP Tool Categories:**
+- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
+- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
+- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
+- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
+- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
+- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
+- **Storage**: `storage_upload`, `storage_list` (cloud file management)
+
+**Authentication Required:**
+- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
+- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
+- Access 70+ specialized MCP tools for advanced orchestration
+
+## 🚀 Agent Execution Flow with Claude Code
+
+### The Correct Pattern:
+
+1. **Optional**: Use MCP tools to set up coordination topology
+2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
+3. **REQUIRED**: Each agent runs hooks for coordination
+4. **REQUIRED**: Batch all operations in single messages
+
+### Example Full-Stack Development:
+
+```javascript
+// Single message with all agent spawning via Claude Code's Task tool
+[Parallel Agent Execution]:
+  Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
+  Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
+  Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
+  Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
+  Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
+  Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
+  
+  // All todos batched together
+  TodoWrite { todos: [...8-10 todos...] }
+  
+  // All file operations together
+  Write "backend/server.js"
+  Write "frontend/App.jsx"
+  Write "database/schema.sql"
+```
+
+## 📋 Agent Coordination Protocol
+
+### Every Agent Spawned via Task Tool MUST:
+
+**1️⃣ BEFORE Work:**
 ```bash
-# Set PYTHONPATH to point to the project root directory
-PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
-
-# Example: running tests
-PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
+npx claude-flow@alpha hooks pre-task --description "[task]"
+npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
 ```

-**Benefits**:
- No `pip install` required
- Code changes take effect immediately (no reinstall needed)
- Each worktree is completely isolated
+**2️⃣ DURING Work:**
+```bash
+npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
+npx claude-flow@alpha hooks notify --message "[what was done]"
+```

-## Documentation Index
+**3️⃣ AFTER Work:**
+```bash
+npx claude-flow@alpha hooks post-task --task-id "[task]"
+npx claude-flow@alpha hooks session-end --export-metrics true
+```

-| Document | Purpose |
-|----------|---------|
-| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
-| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
-| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
-| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
-| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
-| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
-| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
-| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
-| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
+## 🎯 Concurrent Execution Examples

-## Configuration
+### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes

-| Parameter | Default | Notes |
-|-----------|---------|-------|
-| `kvcache_block_size` | 4096 | Tokens per block |
-| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
-| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
-| `enable_cpu_offload` | False | Enable for long context |
-| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
-| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
-| `enforce_eager` | False | Set True to disable CUDA graphs |
+```javascript
+// Step 1: MCP tools set up coordination (optional, for complex tasks)
+[Single Message - Coordination Setup]:
+  mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
+  mcp__claude-flow__agent_spawn { type: "researcher" }
+  mcp__claude-flow__agent_spawn { type: "coder" }
+  mcp__claude-flow__agent_spawn { type: "tester" }

-## Benchmarking
+// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
+[Single Message - Parallel Agent Execution]:
+  // Claude Code's Task tool spawns real agents concurrently
+  Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
+  Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
+  Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
+  Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
+  Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
+  
+  // Batch ALL todos in ONE call
+  TodoWrite { todos: [
+    {id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
+    {id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
+    {id: "3", content: "Implement authentication", status: "pending", priority: "high"},
+    {id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
+    {id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
+    {id: "6", content: "Integration tests", status: "pending", priority: "medium"},
+    {id: "7", content: "API documentation", status: "pending", priority: "low"},
+    {id: "8", content: "Performance optimization", status: "pending", priority: "low"}
+  ]}
+  
+  // Parallel file operations
+  Bash "mkdir -p app/{src,tests,docs,config}"
+  Write "app/package.json"
+  Write "app/src/server.js"
+  Write "app/tests/server.test.js"
+  Write "app/docs/API.md"
+```

-**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
+### ❌ WRONG (Multiple Messages):
+```javascript
+Message 1: mcp__claude-flow__swarm_init
+Message 2: Task("agent 1")
+Message 3: TodoWrite { todos: [single todo] }
+Message 4: Write "file.js"
+// This breaks parallel coordination!
+```

-**Common Issues**:
-1. `max_num_batched_tokens < max_model_len`: Set equal for long context
-2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
-3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
+## Performance Benefits

-**Model Limits**:
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
- Llama-3.1-8B-Instruct: 131072 tokens
+- **84.8% SWE-Bench solve rate**
+- **32.3% token reduction**
+- **2.8-4.4x speed improvement**
+- **27+ neural models**

-**Performance (Qwen3-4B, CPU Offload)**:
- Prefill: ~5700-8000 tok/s (varies by context length)
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
- **CUDA Graph speedup: 4x decode throughput**
+## Hooks Integration
+
+### Pre-Operation
+- Auto-assign agents by file type
+- Validate commands for safety
+- Prepare resources automatically
+- Optimize topology by complexity
+- Cache searches
+
+### Post-Operation
+- Auto-format code
+- Train neural patterns
+- Update memory
+- Analyze performance
+- Track token usage
+
+### Session Management
+- Generate summaries
+- Persist state
+- Track metrics
+- Restore context
+- Export workflows
+
+## Advanced Features (v2.0.0)
+
+- 🚀 Automatic Topology Selection
+- ⚡ Parallel Execution (2.8-4.4x speed)
+- 🧠 Neural Training
+- 📊 Bottleneck Analysis
+- 🤖 Smart Auto-Spawning
+- 🛡️ Self-Healing Workflows
+- 💾 Cross-Session Memory
+- 🔗 GitHub Integration
+
+## Integration Tips
+
+1. Start with basic swarm init
+2. Scale agents gradually
+3. Use memory for context
+4. Monitor progress regularly
+5. Train patterns from success
+6. Enable hooks automation
+7. Use GitHub tools first
+
+## Support
+
+- Documentation: https://github.com/ruvnet/claude-flow
+- Issues: https://github.com/ruvnet/claude-flow/issues
+- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)

 ---

-**Author**: Zijie Tian
+Remember: **Claude Flow coordinates, Claude Code creates!**
+
+# Nano-vLLM Testing
+
+## RULER NIAH Benchmark Test
+
+Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
+
+**Documentation**:
+- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
+- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
+
+### Quick Start
+
+```bash
+# Single sample test (recommended for initial verification)
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload
+
+# All 5 samples
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --sample-indices 0-4
+```
+
+### Options
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
+| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
+| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
+| `--max-model-len` | 32768 | Maximum context length |
+| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
+
+---
+
+# important-instruction-reminders
+Do what has been asked; nothing more, nothing less.
+NEVER create files unless they're absolutely necessary for achieving your goal.
+ALWAYS prefer editing an existing file to creating a new one.
+NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
+Never save working files, text/mds and tests to the root folder.
--- a/docs/offload_accuracy_issue.md
+++ b/docs/offload_accuracy_issue.md
@@ -0,0 +1,239 @@
+# CPU Offload Accuracy Issue Investigation
+
+## Problem Summary
+
+CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark.
+
+| Mode | Accuracy | Pass/Total |
+|------|----------|------------|
+| **Non-Offload (GPU only)** | **100%** | 100/100 |
+| **CPU Offload** | **66%** | 66/100 |
+
+This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness.
+
+## Test Environment
+
+- **Model**: Llama-3.1-8B-Instruct
+- **Task**: RULER NIAH (Needle-In-A-Haystack) 32K context
+- **GPU**: NVIDIA A100-SXM4-80GB
+- **Data**: `tests/data/ruler_niah/niah_single_1_32k.jsonl` (100 samples)
+
+## Reproduction Commands
+
+### Non-Offload Mode (100% accuracy)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --gpu-utilization 0.7 \
+    --quiet
+```
+
+**Configuration**:
+- KV Cache: GPU only, 51 blocks (6528 MB)
+- Block size: 1024 tokens
+
+### Offload Mode (66% accuracy)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --quiet
+```
+
+**Configuration**:
+- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
+- Ring buffer: 4 buffers × 33280 tokens (520 MB)
+- Per-layer decode buffer: 128 MB
+- Block size: 1024 tokens
+
+## Observed Failure Patterns
+
+From the 5-sample verbose test:
+
+| Sample | Expected | Offload Output | Status |
+|--------|----------|----------------|--------|
+| 0 | 8930103 | `: 8930103.` | PASS |
+| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** |
+| 2 | 8231838 | `:ное 8231838.` | PASS |
+| 3 | 8835373 | `: 8835373.` | PASS |
+| 4 | 7754864 | `aster 7754864.` | PASS |
+
+**Failure pattern**: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").
+
+## Architecture Overview
+
+### Offload Mode Data Flow
+
+```
+Prefill Phase:
+1. Input tokens → chunked into 2048-token chunks
+2. Each chunk processed layer by layer:
+   - Load KV from CPU → GPU ring buffer
+   - Compute attention
+   - Store KV back to CPU
+3. Ring buffer holds recent KV for decode
+
+Decode Phase:
+1. For each new token:
+   - Load all layer KV from CPU (one layer at a time)
+   - Compute attention against full context
+   - Generate next token
+```
+
+### Key Components
+
+| File | Component | Description |
+|------|-----------|-------------|
+| `nanovllm/kvcache/offload_engine.py` | `OffloadEngine` | Manages CPU↔GPU KV cache transfers |
+| `nanovllm/kvcache/offload_engine.py` | `RingKVBuffer` | GPU ring buffer for recent KV |
+| `nanovllm/engine/model_runner.py` | `run_chunked_offload_prefill()` | Chunked prefill with offload |
+| `nanovllm/engine/model_runner.py` | `run_offload_decode()` | Layer-wise decode with offload |
+| `nanovllm/kvcache/hybrid_manager.py` | `HybridBlockManager` | CPU block allocation |
+
+## Potential Root Causes
+
+### 1. Ring Buffer Index/Position Issues
+
+**Location**: `nanovllm/kvcache/offload_engine.py`
+
+The ring buffer uses modular indexing. Potential issues:
+- Position calculation errors during prefill/decode transition
+- Off-by-one errors in KV storage/retrieval
+- Incorrect handling when sequence length approaches `max_seq_len`
+
+**Recent fix applied**: `max_seq_len = max_model_len + 512` to prevent overflow, but there may be other indexing issues.
+
+### 2. Chunked Prefill KV Storage
+
+**Location**: `nanovllm/engine/model_runner.py:run_chunked_offload_prefill()`
+
+During chunked prefill:
+- KV computed for chunk N must be correctly stored before processing chunk N+1
+- Position IDs must be correctly accumulated across chunks
+- CPU block allocation must be contiguous and correctly tracked
+
+**Suspect areas**:
+```python
+# Check if positions are correctly tracked across chunks
+# Check if KV is correctly copied to CPU after each chunk
+# Check if ring buffer indices align with CPU block indices
+```
+
+### 3. Decode Phase KV Loading
+
+**Location**: `nanovllm/engine/model_runner.py:run_offload_decode()`
+
+During decode:
+- Must load KV for ALL previous tokens (both prefill and decode)
+- Layer-by-layer loading must be synchronized correctly
+- Attention computation must use correct sequence length
+
+**Suspect areas**:
+```python
+# Check if decode loads KV for full context length
+# Check if new decode KV is stored correctly
+# Check if attention mask/positions are correct
+```
+
+### 4. CPU↔GPU Transfer Synchronization
+
+**Location**: `nanovllm/kvcache/offload_engine.py`
+
+CUDA streams and synchronization:
+- Async copies may complete out of order
+- Missing synchronization points could cause stale data
+- Stream priorities may affect correctness
+
+### 5. Numerical Precision
+
+- CPU tensors use float16/bfloat16
+- GPU computation precision
+- Potential precision loss during transfers
+
+## Debugging Strategy
+
+### Step 1: Identify Failing Samples
+
+```bash
+# Run verbose mode to see which samples fail
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --verbose 2>&1 | tee offload_verbose.log
+```
+
+### Step 2: Compare Token-by-Token
+
+Create a debug script to compare token generation between offload and non-offload modes for a failing sample:
+
+```python
+# Compare logits at each decode step
+# Check if divergence starts at a specific position
+# Log KV cache contents at divergence point
+```
+
+### Step 3: Verify KV Cache Contents
+
+Add debugging to `OffloadEngine`:
+
+```python
+# In store_kv(): Log what's being stored
+# In load_kv(): Log what's being loaded
+# Compare loaded KV with expected values
+```
+
+### Step 4: Check Position/Index Calculations
+
+```python
+# Log ring buffer write/read positions
+# Log CPU block indices
+# Verify position IDs match actual token positions
+```
+
+### Step 5: Isolate the Bug
+
+1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent
+2. Test with single chunk (no chunking) to isolate chunked prefill
+3. Test prefill-only (no decode) to isolate decode phase
+
+## Quick Debugging Commands
+
+```bash
+# Test single failing sample with verbose output
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --sample-indices 1 \
+    --verbose
+
+# Test with different context lengths
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
+    --model ~/models/Llama-3.1-8B-Instruct \
+    --enable-offload \
+    --max-model-len 16384 \
+    --verbose
+```
+
+## Related Documentation
+
+- [`docs/ruler_niah_standalone_test.md`](ruler_niah_standalone_test.md) - Test setup and background
+- [`docs/layerwise_offload_memory_analysis.md`](layerwise_offload_memory_analysis.md) - Memory analysis (if exists)
+
+## Test Results Log
+
+**Date**: 2025-01-12
+
+| Test | Mode | Samples | Passed | Accuracy |
+|------|------|---------|--------|----------|
+| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
+| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
+
+## Next Steps
+
+1. [ ] Identify pattern in failing samples (position of needle? specific numbers?)
+2. [ ] Add detailed logging to offload engine
+3. [ ] Compare logits between offload and non-offload modes
+4. [ ] Bisect the code to find the exact bug location
+5. [ ] Write unit test that isolates the bug