Root Cause: - OffloadEngine.reset() cleared GPU buffers but NOT CPU cache - Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests Fixes: - Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset() - Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate() Results: - niah_single_1 accuracy improved from ~80% to 94% (+14%) - Remaining ~6% errors are model limitations, not state leakage Also: - Update docs/ruler_32k_chunked_offload_issue.md with fix details - Remove debug planning files (findings.md, progress.md, task_plan.md) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
20 KiB
20 KiB