Root Cause:
- OffloadEngine.reset() cleared GPU buffers but NOT CPU cache
- Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests
Fixes:
- Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset()
- Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate()
Results:
- niah_single_1 accuracy improved from ~80% to 94% (+14%)
- Remaining ~6% errors are model limitations, not state leakage
Also:
- Update docs/ruler_32k_chunked_offload_issue.md with fix details
- Remove debug planning files (findings.md, progress.md, task_plan.md)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>