[docs] Added offload_acc issue.

This commit is contained in:
Zijie Tian
2026-01-12 15:05:55 +08:00
parent a6cc703d73
commit 8e0888c20c
3 changed files with 623 additions and 74 deletions

27
.gitignore vendored
View File

@@ -197,3 +197,30 @@ cython_debug/
results/
outputs/
.local/
# Claude Flow generated files
.claude/settings.local.json
.mcp.json
claude-flow.config.json
.swarm/
.hive-mind/
.claude-flow/
memory/
coordination/
memory/claude-flow-data.json
memory/sessions/*
!memory/sessions/README.md
memory/agents/*
!memory/agents/README.md
coordination/memory_bank/*
coordination/subtasks/*
coordination/orchestration/*
*.db
*.db-journal
*.db-wal
*.sqlite
*.sqlite-journal
*.sqlite-wal
claude-flow
# Removed Windows wrapper files per user request
hive-mind-prompt-*.txt

431
CLAUDE.md
View File

@@ -1,106 +1,389 @@
# CLAUDE.md
# Claude Code Configuration - SPARC Development Environment
This file provides guidance to Claude Code when working with this repository.
## 🚨 CRITICAL: CONCURRENT EXECUTION & FILE MANAGEMENT
## Overview
**ABSOLUTE RULES**:
1. ALL operations MUST be concurrent/parallel in a single message
2. **NEVER save working files, text/mds and tests to the root folder**
3. ALWAYS organize files in appropriate subdirectories
4. **USE CLAUDE CODE'S TASK TOOL** for spawning agents concurrently, not just MCP
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
### ⚡ GOLDEN RULE: "1 MESSAGE = ALL RELATED OPERATIONS"
## GPU Mutex for Multi-Instance Debugging
**MANDATORY PATTERNS:**
- **TodoWrite**: ALWAYS batch ALL todos in ONE call (5-10+ todos minimum)
- **Task tool (Claude Code)**: ALWAYS spawn ALL agents in ONE message with full instructions
- **File operations**: ALWAYS batch ALL reads/writes/edits in ONE message
- **Bash commands**: ALWAYS batch ALL terminal operations in ONE message
- **Memory operations**: ALWAYS batch ALL memory store/retrieve in ONE message
**IMPORTANT**: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
### 🎯 CRITICAL: Claude Code Task Tool for Agent Execution
### Benchmarks (`bench*.py`) - Exclusive GPU Access Required
Before running any `bench*.py` script, Claude MUST wait for exclusive GPU access:
```bash
# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
echo "GPU busy, waiting 10s..."
sleep 10
done
**Claude Code's Task tool is the PRIMARY way to spawn agents:**
```javascript
// ✅ CORRECT: Use Claude Code's Task tool for parallel agent execution
[Single Message]:
Task("Research agent", "Analyze requirements and patterns...", "researcher")
Task("Coder agent", "Implement core features...", "coder")
Task("Tester agent", "Create comprehensive tests...", "tester")
Task("Reviewer agent", "Review code quality...", "reviewer")
Task("Architect agent", "Design system architecture...", "system-architect")
```
### Other Scripts (tests, examples) - Port Conflict Check Only
**MCP tools are ONLY for coordination setup:**
- `mcp__claude-flow__swarm_init` - Initialize coordination topology
- `mcp__claude-flow__agent_spawn` - Define agent types for coordination
- `mcp__claude-flow__task_orchestrate` - Orchestrate high-level workflows
For non-benchmark scripts, exclusive GPU access is NOT required. However, check for **distributed port conflicts** before running:
### 📁 File Organization Rules
**NEVER save to root folder. Use these directories:**
- `/src` - Source code files
- `/tests` - Test files
- `/docs` - Documentation and markdown files
- `/config` - Configuration files
- `/scripts` - Utility scripts
- `/examples` - Example code
## Project Overview
This project uses SPARC (Specification, Pseudocode, Architecture, Refinement, Completion) methodology with Claude-Flow orchestration for systematic Test-Driven Development.
## SPARC Commands
### Core Commands
- `npx claude-flow sparc modes` - List available modes
- `npx claude-flow sparc run <mode> "<task>"` - Execute specific mode
- `npx claude-flow sparc tdd "<feature>"` - Run complete TDD workflow
- `npx claude-flow sparc info <mode>` - Get mode details
### Batchtools Commands
- `npx claude-flow sparc batch <modes> "<task>"` - Parallel execution
- `npx claude-flow sparc pipeline "<task>"` - Full pipeline processing
- `npx claude-flow sparc concurrent <mode> "<tasks-file>"` - Multi-task processing
### Build Commands
- `npm run build` - Build project
- `npm run test` - Run tests
- `npm run lint` - Linting
- `npm run typecheck` - Type checking
## SPARC Workflow Phases
1. **Specification** - Requirements analysis (`sparc run spec-pseudocode`)
2. **Pseudocode** - Algorithm design (`sparc run spec-pseudocode`)
3. **Architecture** - System design (`sparc run architect`)
4. **Refinement** - TDD implementation (`sparc tdd`)
5. **Completion** - Integration (`sparc run integration`)
## Code Style & Best Practices
- **Modular Design**: Files under 500 lines
- **Environment Safety**: Never hardcode secrets
- **Test-First**: Write tests before implementation
- **Clean Architecture**: Separate concerns
- **Documentation**: Keep updated
## 🚀 Available Agents (54 Total)
### Core Development
`coder`, `reviewer`, `tester`, `planner`, `researcher`
### Swarm Coordination
`hierarchical-coordinator`, `mesh-coordinator`, `adaptive-coordinator`, `collective-intelligence-coordinator`, `swarm-memory-manager`
### Consensus & Distributed
`byzantine-coordinator`, `raft-manager`, `gossip-coordinator`, `consensus-builder`, `crdt-synchronizer`, `quorum-manager`, `security-manager`
### Performance & Optimization
`perf-analyzer`, `performance-benchmarker`, `task-orchestrator`, `memory-coordinator`, `smart-agent`
### GitHub & Repository
`github-modes`, `pr-manager`, `code-review-swarm`, `issue-tracker`, `release-manager`, `workflow-automation`, `project-board-sync`, `repo-architect`, `multi-repo-swarm`
### SPARC Methodology
`sparc-coord`, `sparc-coder`, `specification`, `pseudocode`, `architecture`, `refinement`
### Specialized Development
`backend-dev`, `mobile-dev`, `ml-developer`, `cicd-engineer`, `api-docs`, `system-architect`, `code-analyzer`, `base-template-generator`
### Testing & Validation
`tdd-london-swarm`, `production-validator`
### Migration & Planning
`migration-planner`, `swarm-init`
## 🎯 Claude Code vs MCP Tools
### Claude Code Handles ALL EXECUTION:
- **Task tool**: Spawn and run agents concurrently for actual work
- File operations (Read, Write, Edit, MultiEdit, Glob, Grep)
- Code generation and programming
- Bash commands and system operations
- Implementation work
- Project navigation and analysis
- TodoWrite and task management
- Git operations
- Package management
- Testing and debugging
### MCP Tools ONLY COORDINATE:
- Swarm initialization (topology setup)
- Agent type definitions (coordination patterns)
- Task orchestration (high-level planning)
- Memory management
- Neural features
- Performance tracking
- GitHub integration
**KEY**: MCP coordinates the strategy, Claude Code's Task tool executes with real agents.
## 🚀 Quick Setup
```bash
# Check if port 29500 (default torch distributed port) is in use
if lsof -i :29500 >/dev/null 2>&1; then
echo "Port 29500 in use, waiting 10s..."
sleep 10
fi
# Add MCP servers (Claude Flow required, others optional)
claude mcp add claude-flow npx claude-flow@alpha mcp start
claude mcp add ruv-swarm npx ruv-swarm mcp start # Optional: Enhanced coordination
claude mcp add flow-nexus npx flow-nexus@latest mcp start # Optional: Cloud features
```
**Note**: nanovllm's distributed port handling is not yet robust - two processes competing for the same port will cause errors. This check prevents that issue.
## MCP Tool Categories
## Multi-Instance Development with PYTHONPATH
### Coordination
`swarm_init`, `agent_spawn`, `task_orchestrate`
**IMPORTANT**: When running multiple Claude instances on different worktrees, do NOT use `pip install -e .` globally as it will affect other instances.
### Monitoring
`swarm_status`, `agent_list`, `agent_metrics`, `task_status`, `task_results`
**Use PYTHONPATH directly** - no pip install needed:
### Memory & Neural
`memory_usage`, `neural_status`, `neural_train`, `neural_patterns`
### GitHub Integration
`github_swarm`, `repo_analyze`, `pr_enhance`, `issue_triage`, `code_review`
### System
`benchmark_run`, `features_detect`, `swarm_monitor`
### Flow-Nexus MCP Tools (Optional Advanced Features)
Flow-Nexus extends MCP capabilities with 70+ cloud-based orchestration tools:
**Key MCP Tool Categories:**
- **Swarm & Agents**: `swarm_init`, `swarm_scale`, `agent_spawn`, `task_orchestrate`
- **Sandboxes**: `sandbox_create`, `sandbox_execute`, `sandbox_upload` (cloud execution)
- **Templates**: `template_list`, `template_deploy` (pre-built project templates)
- **Neural AI**: `neural_train`, `neural_patterns`, `seraphina_chat` (AI assistant)
- **GitHub**: `github_repo_analyze`, `github_pr_manage` (repository management)
- **Real-time**: `execution_stream_subscribe`, `realtime_subscribe` (live monitoring)
- **Storage**: `storage_upload`, `storage_list` (cloud file management)
**Authentication Required:**
- Register: `mcp__flow-nexus__user_register` or `npx flow-nexus@latest register`
- Login: `mcp__flow-nexus__user_login` or `npx flow-nexus@latest login`
- Access 70+ specialized MCP tools for advanced orchestration
## 🚀 Agent Execution Flow with Claude Code
### The Correct Pattern:
1. **Optional**: Use MCP tools to set up coordination topology
2. **REQUIRED**: Use Claude Code's Task tool to spawn agents that do actual work
3. **REQUIRED**: Each agent runs hooks for coordination
4. **REQUIRED**: Batch all operations in single messages
### Example Full-Stack Development:
```javascript
// Single message with all agent spawning via Claude Code's Task tool
[Parallel Agent Execution]:
Task("Backend Developer", "Build REST API with Express. Use hooks for coordination.", "backend-dev")
Task("Frontend Developer", "Create React UI. Coordinate with backend via memory.", "coder")
Task("Database Architect", "Design PostgreSQL schema. Store schema in memory.", "code-analyzer")
Task("Test Engineer", "Write Jest tests. Check memory for API contracts.", "tester")
Task("DevOps Engineer", "Setup Docker and CI/CD. Document in memory.", "cicd-engineer")
Task("Security Auditor", "Review authentication. Report findings via hooks.", "reviewer")
// All todos batched together
TodoWrite { todos: [...8-10 todos...] }
// All file operations together
Write "backend/server.js"
Write "frontend/App.jsx"
Write "database/schema.sql"
```
## 📋 Agent Coordination Protocol
### Every Agent Spawned via Task Tool MUST:
**1⃣ BEFORE Work:**
```bash
# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
npx claude-flow@alpha hooks pre-task --description "[task]"
npx claude-flow@alpha hooks session-restore --session-id "swarm-[id]"
```
**Benefits**:
- No `pip install` required
- Code changes take effect immediately (no reinstall needed)
- Each worktree is completely isolated
**2⃣ DURING Work:**
```bash
npx claude-flow@alpha hooks post-edit --file "[file]" --memory-key "swarm/[agent]/[step]"
npx claude-flow@alpha hooks notify --message "[what was done]"
```
## Documentation Index
**3⃣ AFTER Work:**
```bash
npx claude-flow@alpha hooks post-task --task-id "[task]"
npx claude-flow@alpha hooks session-end --export-metrics true
```
| Document | Purpose |
|----------|---------|
| [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
| [`docs/multi_model_support.md`](docs/multi_model_support.md) | Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
| [`docs/cuda_graph_offload_guide.md`](docs/cuda_graph_offload_guide.md) | CUDA graph support for CPU offload decode path, 4x decode speedup |
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
| [`docs/sparse_prefill_integration_plan.md`](docs/sparse_prefill_integration_plan.md) | Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
| [`docs/sparse_offload_integration.md`](docs/sparse_offload_integration.md) | Sparse policy integration with layerwise offload, `requires_block_selection` interface design |
| [`docs/layerwise_offload_memory_analysis.md`](docs/layerwise_offload_memory_analysis.md) | Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, tensor comparison, memory profiling |
| [`docs/gpu_only_performance_issue.md`](docs/gpu_only_performance_issue.md) | GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
## 🎯 Concurrent Execution Examples
## Configuration
### ✅ CORRECT WORKFLOW: MCP Coordinates, Claude Code Executes
| Parameter | Default | Notes |
|-----------|---------|-------|
| `kvcache_block_size` | 4096 | Tokens per block |
| `max_num_batched_tokens` | 16384 | Set = max_model_len for long context |
| `gpu_memory_utilization` | 0.9 | GPU memory fraction |
| `enable_cpu_offload` | False | Enable for long context |
| `num_gpu_blocks` | 2 | GPU blocks for offload mode |
| `num_kv_buffers` | 4 | Ring buffer size for decode pipeline |
| `enforce_eager` | False | Set True to disable CUDA graphs |
```javascript
// Step 1: MCP tools set up coordination (optional, for complex tasks)
[Single Message - Coordination Setup]:
mcp__claude-flow__swarm_init { topology: "mesh", maxAgents: 6 }
mcp__claude-flow__agent_spawn { type: "researcher" }
mcp__claude-flow__agent_spawn { type: "coder" }
mcp__claude-flow__agent_spawn { type: "tester" }
## Benchmarking
// Step 2: Claude Code Task tool spawns ACTUAL agents that do the work
[Single Message - Parallel Agent Execution]:
// Claude Code's Task tool spawns real agents concurrently
Task("Research agent", "Analyze API requirements and best practices. Check memory for prior decisions.", "researcher")
Task("Coder agent", "Implement REST endpoints with authentication. Coordinate via hooks.", "coder")
Task("Database agent", "Design and implement database schema. Store decisions in memory.", "code-analyzer")
Task("Tester agent", "Create comprehensive test suite with 90% coverage.", "tester")
Task("Reviewer agent", "Review code quality and security. Document findings.", "reviewer")
// Batch ALL todos in ONE call
TodoWrite { todos: [
{id: "1", content: "Research API patterns", status: "in_progress", priority: "high"},
{id: "2", content: "Design database schema", status: "in_progress", priority: "high"},
{id: "3", content: "Implement authentication", status: "pending", priority: "high"},
{id: "4", content: "Build REST endpoints", status: "pending", priority: "high"},
{id: "5", content: "Write unit tests", status: "pending", priority: "medium"},
{id: "6", content: "Integration tests", status: "pending", priority: "medium"},
{id: "7", content: "API documentation", status: "pending", priority: "low"},
{id: "8", content: "Performance optimization", status: "pending", priority: "low"}
]}
// Parallel file operations
Bash "mkdir -p app/{src,tests,docs,config}"
Write "app/package.json"
Write "app/src/server.js"
Write "app/tests/server.test.js"
Write "app/docs/API.md"
```
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
### ❌ WRONG (Multiple Messages):
```javascript
Message 1: mcp__claude-flow__swarm_init
Message 2: Task("agent 1")
Message 3: TodoWrite { todos: [single todo] }
Message 4: Write "file.js"
// This breaks parallel coordination!
```
**Common Issues**:
1. `max_num_batched_tokens < max_model_len`: Set equal for long context
2. CUDA graph dimension mismatch: Ensure `input_len + output_len <= max_model_len`
3. RoPE out of bounds: Check model's `max_position_embeddings` in config.json
## Performance Benefits
**Model Limits**:
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
- Llama-3.1-8B-Instruct: 131072 tokens
- **84.8% SWE-Bench solve rate**
- **32.3% token reduction**
- **2.8-4.4x speed improvement**
- **27+ neural models**
**Performance (Qwen3-4B, CPU Offload)**:
- Prefill: ~5700-8000 tok/s (varies by context length)
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
- **CUDA Graph speedup: 4x decode throughput**
## Hooks Integration
### Pre-Operation
- Auto-assign agents by file type
- Validate commands for safety
- Prepare resources automatically
- Optimize topology by complexity
- Cache searches
### Post-Operation
- Auto-format code
- Train neural patterns
- Update memory
- Analyze performance
- Track token usage
### Session Management
- Generate summaries
- Persist state
- Track metrics
- Restore context
- Export workflows
## Advanced Features (v2.0.0)
- 🚀 Automatic Topology Selection
- ⚡ Parallel Execution (2.8-4.4x speed)
- 🧠 Neural Training
- 📊 Bottleneck Analysis
- 🤖 Smart Auto-Spawning
- 🛡️ Self-Healing Workflows
- 💾 Cross-Session Memory
- 🔗 GitHub Integration
## Integration Tips
1. Start with basic swarm init
2. Scale agents gradually
3. Use memory for context
4. Monitor progress regularly
5. Train patterns from success
6. Enable hooks automation
7. Use GitHub tools first
## Support
- Documentation: https://github.com/ruvnet/claude-flow
- Issues: https://github.com/ruvnet/claude-flow/issues
- Flow-Nexus Platform: https://flow-nexus.ruv.io (registration required for cloud features)
---
**Author**: Zijie Tian
Remember: **Claude Flow coordinates, Claude Code creates!**
# Nano-vLLM Testing
## RULER NIAH Benchmark Test
Tests long context retrieval capability using RULER benchmark's NIAH (Needle-In-A-Haystack) task data (~32K tokens).
**Documentation**:
- [`docs/ruler_niah_standalone_test.md`](docs/ruler_niah_standalone_test.md) - Test setup and usage
- [`docs/offload_accuracy_issue.md`](docs/offload_accuracy_issue.md) - **[BUG]** Offload mode accuracy issue (66% vs 100%)
### Quick Start
```bash
# Single sample test (recommended for initial verification)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload
# All 5 samples
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--sample-indices 0-4
```
### Options
| Option | Default | Description |
|--------|---------|-------------|
| `--model` | `~/models/Llama-3.1-8B-Instruct` | Model path |
| `--enable-offload` | False | Enable CPU offload (required for 24GB GPUs) |
| `--sample-indices` | all | Samples to test (e.g., `0,1,2` or `0-4`) |
| `--max-model-len` | 32768 | Maximum context length |
| `--use-cuda-graph` | False | Enable CUDA graph (faster decode) |
---
# important-instruction-reminders
Do what has been asked; nothing more, nothing less.
NEVER create files unless they're absolutely necessary for achieving your goal.
ALWAYS prefer editing an existing file to creating a new one.
NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User.
Never save working files, text/mds and tests to the root folder.

View File

@@ -0,0 +1,239 @@
# CPU Offload Accuracy Issue Investigation
## Problem Summary
CPU offload mode produces significantly lower accuracy than non-offload mode on the RULER NIAH benchmark.
| Mode | Accuracy | Pass/Total |
|------|----------|------------|
| **Non-Offload (GPU only)** | **100%** | 100/100 |
| **CPU Offload** | **66%** | 66/100 |
This 34% accuracy drop indicates a bug in the offload implementation that affects inference correctness.
## Test Environment
- **Model**: Llama-3.1-8B-Instruct
- **Task**: RULER NIAH (Needle-In-A-Haystack) 32K context
- **GPU**: NVIDIA A100-SXM4-80GB
- **Data**: `tests/data/ruler_niah/niah_single_1_32k.jsonl` (100 samples)
## Reproduction Commands
### Non-Offload Mode (100% accuracy)
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--gpu-utilization 0.7 \
--quiet
```
**Configuration**:
- KV Cache: GPU only, 51 blocks (6528 MB)
- Block size: 1024 tokens
### Offload Mode (66% accuracy)
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--quiet
```
**Configuration**:
- KV Cache: GPU 4 blocks (512 MB) + CPU 32 blocks (4096 MB)
- Ring buffer: 4 buffers × 33280 tokens (520 MB)
- Per-layer decode buffer: 128 MB
- Block size: 1024 tokens
## Observed Failure Patterns
From the 5-sample verbose test:
| Sample | Expected | Offload Output | Status |
|--------|----------|----------------|--------|
| 0 | 8930103 | `: 8930103.` | PASS |
| 1 | 4194548 | `: 419 multiplication of 4548.` | **FAIL** |
| 2 | 8231838 | `:ное 8231838.` | PASS |
| 3 | 8835373 | `: 8835373.` | PASS |
| 4 | 7754864 | `aster 7754864.` | PASS |
**Failure pattern**: The model sometimes produces corrupted or split outputs (e.g., "419 multiplication of 4548" instead of "4194548").
## Architecture Overview
### Offload Mode Data Flow
```
Prefill Phase:
1. Input tokens → chunked into 2048-token chunks
2. Each chunk processed layer by layer:
- Load KV from CPU → GPU ring buffer
- Compute attention
- Store KV back to CPU
3. Ring buffer holds recent KV for decode
Decode Phase:
1. For each new token:
- Load all layer KV from CPU (one layer at a time)
- Compute attention against full context
- Generate next token
```
### Key Components
| File | Component | Description |
|------|-----------|-------------|
| `nanovllm/kvcache/offload_engine.py` | `OffloadEngine` | Manages CPU↔GPU KV cache transfers |
| `nanovllm/kvcache/offload_engine.py` | `RingKVBuffer` | GPU ring buffer for recent KV |
| `nanovllm/engine/model_runner.py` | `run_chunked_offload_prefill()` | Chunked prefill with offload |
| `nanovllm/engine/model_runner.py` | `run_offload_decode()` | Layer-wise decode with offload |
| `nanovllm/kvcache/hybrid_manager.py` | `HybridBlockManager` | CPU block allocation |
## Potential Root Causes
### 1. Ring Buffer Index/Position Issues
**Location**: `nanovllm/kvcache/offload_engine.py`
The ring buffer uses modular indexing. Potential issues:
- Position calculation errors during prefill/decode transition
- Off-by-one errors in KV storage/retrieval
- Incorrect handling when sequence length approaches `max_seq_len`
**Recent fix applied**: `max_seq_len = max_model_len + 512` to prevent overflow, but there may be other indexing issues.
### 2. Chunked Prefill KV Storage
**Location**: `nanovllm/engine/model_runner.py:run_chunked_offload_prefill()`
During chunked prefill:
- KV computed for chunk N must be correctly stored before processing chunk N+1
- Position IDs must be correctly accumulated across chunks
- CPU block allocation must be contiguous and correctly tracked
**Suspect areas**:
```python
# Check if positions are correctly tracked across chunks
# Check if KV is correctly copied to CPU after each chunk
# Check if ring buffer indices align with CPU block indices
```
### 3. Decode Phase KV Loading
**Location**: `nanovllm/engine/model_runner.py:run_offload_decode()`
During decode:
- Must load KV for ALL previous tokens (both prefill and decode)
- Layer-by-layer loading must be synchronized correctly
- Attention computation must use correct sequence length
**Suspect areas**:
```python
# Check if decode loads KV for full context length
# Check if new decode KV is stored correctly
# Check if attention mask/positions are correct
```
### 4. CPU↔GPU Transfer Synchronization
**Location**: `nanovllm/kvcache/offload_engine.py`
CUDA streams and synchronization:
- Async copies may complete out of order
- Missing synchronization points could cause stale data
- Stream priorities may affect correctness
### 5. Numerical Precision
- CPU tensors use float16/bfloat16
- GPU computation precision
- Potential precision loss during transfers
## Debugging Strategy
### Step 1: Identify Failing Samples
```bash
# Run verbose mode to see which samples fail
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--verbose 2>&1 | tee offload_verbose.log
```
### Step 2: Compare Token-by-Token
Create a debug script to compare token generation between offload and non-offload modes for a failing sample:
```python
# Compare logits at each decode step
# Check if divergence starts at a specific position
# Log KV cache contents at divergence point
```
### Step 3: Verify KV Cache Contents
Add debugging to `OffloadEngine`:
```python
# In store_kv(): Log what's being stored
# In load_kv(): Log what's being loaded
# Compare loaded KV with expected values
```
### Step 4: Check Position/Index Calculations
```python
# Log ring buffer write/read positions
# Log CPU block indices
# Verify position IDs match actual token positions
```
### Step 5: Isolate the Bug
1. Test with shorter sequences (16K, 8K) to see if issue is length-dependent
2. Test with single chunk (no chunking) to isolate chunked prefill
3. Test prefill-only (no decode) to isolate decode phase
## Quick Debugging Commands
```bash
# Test single failing sample with verbose output
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--sample-indices 1 \
--verbose
# Test with different context lengths
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler_niah.py \
--model ~/models/Llama-3.1-8B-Instruct \
--enable-offload \
--max-model-len 16384 \
--verbose
```
## Related Documentation
- [`docs/ruler_niah_standalone_test.md`](ruler_niah_standalone_test.md) - Test setup and background
- [`docs/layerwise_offload_memory_analysis.md`](layerwise_offload_memory_analysis.md) - Memory analysis (if exists)
## Test Results Log
**Date**: 2025-01-12
| Test | Mode | Samples | Passed | Accuracy |
|------|------|---------|--------|----------|
| RULER NIAH 32K | Non-Offload | 100 | 100 | 100% |
| RULER NIAH 32K | CPU Offload | 100 | 66 | 66% |
## Next Steps
1. [ ] Identify pattern in failing samples (position of needle? specific numbers?)
2. [ ] Add detailed logging to offload engine
3. [ ] Compare logits between offload and non-offload modes
4. [ ] Bisect the code to find the exact bug location
5. [ ] Write unit test that isolates the bug