- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
- Delete layer_k/v_buffer_a/b double buffers
- Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
- Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add planning files (task_plan.md, findings.md, progress.md) to .gitignore
- Remove existing planning files from git index (keep local)
- Update planning-with-files rule with git management policy
These temporary session files should not be version controlled.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation for the SparsePolicy abstraction:
- SparsePolicy base class and abstract methods
- FullAttentionPolicy prefill/decode flow
- Ring buffer and cross-layer pipeline modes
- Code conventions and testing guidelines
Update CLAUDE.md documentation index with reference.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
- Ring buffer pipeline (_decode_ring_buffer_pipeline)
- Cross-layer pipeline (_decode_with_layer_pipeline)
- Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation
This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove unused claude-flow hooks, permissions, and daemon settings.
Add disabled MCP servers list for claude-flow related servers.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move all chunked prefill attention computation from attention.py to
SparsePolicy.compute_chunked_attention(). This is the v4 architecture
refactoring for sparse attention policies.
Changes:
- Add compute_chunked_attention abstract method to SparsePolicy base
- Add offload_engine parameter to select_blocks for policies needing
KV access during block selection
- Implement compute_chunked_attention in FullAttentionPolicy with
complete ring buffer pipeline logic
- Simplify attention.py to delegate all chunked prefill to policy
- Remove redundant _sync_load_previous_chunks and
_ring_buffer_pipeline_load methods from Attention class
Test: test_needle.py --enable-offload PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Import os and socket modules
- Add _find_free_port() function for automatic port detection
- Use NANOVLLM_DIST_PORT env var if set, otherwise auto-assign
- Enables running multiple model instances without port conflicts
Co-Authored-By: Claude <noreply@anthropic.com>
- Add Claude Flow generated files ignore patterns
- Add test data directory ignore
- Add Serena MCP tool config ignore
- Add Windows wrapper files ignore
These configurations improve development workflow by excluding temporary
and generated files from version control.
Add .claude/settings.json to enable claude-flow MCP in all worktrees.
This configuration includes:
- SessionStart hook to auto-start claude-flow daemon
- Auto-approval for claude-flow MCP tools and CLI commands
- Basic claude-flow settings
Co-Authored-By: Claude <noreply@anthropic.com>
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>