Root Cause:
- OffloadEngine.reset() cleared GPU buffers but NOT CPU cache
- Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests
Fixes:
- Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset()
- Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate()
Results:
- niah_single_1 accuracy improved from ~80% to 94% (+14%)
- Remaining ~6% errors are model limitations, not state leakage
Also:
- Update docs/ruler_32k_chunked_offload_issue.md with fix details
- Remove debug planning files (findings.md, progress.md, task_plan.md)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create nanovllm/ops/ module for low-level attention operators
- Move chunked_attention.py from kvcache/ to ops/
- Update imports in full_policy.py (3 locations)
- Fix: remove dead code in OffloadEngine.reset() referencing
non-existent layer_k/v_buffer_a/b attributes
Verified with needle test (32K offload): PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
- Delete layer_k/v_buffer_a/b double buffers
- Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
- Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>