🐛 fix: resolve CPU KV cache state leakage between requests

Root Cause:
- OffloadEngine.reset() cleared GPU buffers but NOT CPU cache
- Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests

Fixes:
- Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset()
- Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate()

Results:
- niah_single_1 accuracy improved from ~80% to 94% (+14%)
- Remaining ~6% errors are model limitations, not state leakage

Also:
- Update docs/ruler_32k_chunked_offload_issue.md with fix details
- Remove debug planning files (findings.md, progress.md, task_plan.md)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Zijie Tian
2026-01-21 01:12:21 +08:00
parent 4d8ae951c3
commit 78050aef9f
6 changed files with 67 additions and 425 deletions

View File

@@ -231,6 +231,9 @@ class HybridKVCacheManager(KVCacheManager):
seq.num_cached_tokens = 0
seq.block_table.clear()
# Clear decode position tracking for this sequence
self.clear_decode_tracking(seq)
# Reset OffloadEngine state to prevent request-to-request contamination
# This clears all KV buffers and pending async events
if self.offload_engine is not None: