- Add xattn_estimate_chunked function ported from COMPASS
- Support chunked prefill with q_start_pos parameter
- Ensure 100% consistency with standard xattn_estimate when
using matching chunk_size parameter
- Add test and documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation analyzing the 32K chunked offload
accuracy issues with proposed solutions covering LSE precision,
ring buffer state management, and position encoding validation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add docs/block_sparse_attn_interface.md with BSA function signatures
- Update CLAUDE.md documentation index
- Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md
- Add notes.md to .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create docs/sparse_policy_implementation_guide.md with comprehensive guide
- Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements
- Add new doc reference to CLAUDE.md documentation index
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation for the SparsePolicy abstraction:
- SparsePolicy base class and abstract methods
- FullAttentionPolicy prefill/decode flow
- Ring buffer and cross-layer pipeline modes
- Code conventions and testing guidelines
Update CLAUDE.md documentation index with reference.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>