- Document ring buffer pipeline triggering nsys timestamp bug
- Update profile_offload.sh to use test_ruler.py with options
- Add reference to new doc in CLAUDE.md
Root cause: 4-slot ring buffer pipeline (4 transfer streams +
1 compute stream) triggers event ordering bug in nsys < 2024.2
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new sections to xattn_bsa_policy_design.md:
- Performance benchmarks: 128K context comparison (Full vs XAttn BSA)
- Density trend analysis across chunks
- Memory leak issue and fix (64GB -> 4GB reduction)
- Memory monitoring guide with gpu-monitor agent
- Density statistics API documentation
- Known issues and optimization directions
Update CLAUDE.md description to reflect new content.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document flat_group_gemm_fuse_reshape and softmax_fuse_block_sum kernels
- Explain anti-diagonal sum principle and stride sampling
- Add GPU-specific BLOCK_M/N constraints (RTX 3090 vs A100)
- Show Q/K can have different lengths (chunked prefill support)
- Update CLAUDE.md with doc reference
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add xattn_estimate_chunked function ported from COMPASS
- Support chunked prefill with q_start_pos parameter
- Ensure 100% consistency with standard xattn_estimate when
using matching chunk_size parameter
- Add test and documentation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document CUDA Graph memory behavior based on actual testing:
- Memory overhead at each stage (model, cache, warmup, capture, replay)
- StaticCache is the main overhead (~144MB for 1K tokens)
- Graph capture adds minimal overhead (~8MB)
- Graph replay requires zero additional allocation
- Performance improvement: ~2.8x decode throughput
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root Cause:
- OffloadEngine.reset() cleared GPU buffers but NOT CPU cache
- Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests
Fixes:
- Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset()
- Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate()
Results:
- niah_single_1 accuracy improved from ~80% to 94% (+14%)
- Remaining ~6% errors are model limitations, not state leakage
Also:
- Update docs/ruler_32k_chunked_offload_issue.md with fix details
- Remove debug planning files (findings.md, progress.md, task_plan.md)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation analyzing the 32K chunked offload
accuracy issues with proposed solutions covering LSE precision,
ring buffer state management, and position encoding validation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add docs/block_sparse_attn_interface.md with BSA function signatures
- Update CLAUDE.md documentation index
- Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md
- Add notes.md to .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create docs/sparse_policy_implementation_guide.md with comprehensive guide
- Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements
- Add new doc reference to CLAUDE.md documentation index
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
- Delete layer_k/v_buffer_a/b double buffers
- Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
- Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive documentation for the SparsePolicy abstraction:
- SparsePolicy base class and abstract methods
- FullAttentionPolicy prefill/decode flow
- Ring buffer and cross-layer pipeline modes
- Code conventions and testing guidelines
Update CLAUDE.md documentation index with reference.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>