Commit Graph

5 Commits

Author SHA1 Message Date
Zijie Tian
a5307fb124 📝 docs: add CUDA Graph optimization plan for offload mode decode
- Update task_plan.md with 6-phase segmented graph implementation plan
- Add findings.md documenting 7 key discoveries about current implementation
- Add progress.md for tracking implementation progress
- Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy

Key architecture decision: Split transformer layer into 3 segments:
- PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused)
- CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager)
- POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused)

Total: 4 graphs serving all layers via copy_() tensor updates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 02:12:24 +08:00
Zijie Tian
78050aef9f 🐛 fix: resolve CPU KV cache state leakage between requests
Root Cause:
- OffloadEngine.reset() cleared GPU buffers but NOT CPU cache
- Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests

Fixes:
- Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset()
- Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate()

Results:
- niah_single_1 accuracy improved from ~80% to 94% (+14%)
- Remaining ~6% errors are model limitations, not state leakage

Also:
- Update docs/ruler_32k_chunked_offload_issue.md with fix details
- Remove debug planning files (findings.md, progress.md, task_plan.md)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 01:12:21 +08:00
Zijie Tian
4d8ae951c3 [WIP] Before debug plan. 2026-01-21 00:01:10 +08:00
Zijie Tian
b97b0b96a0 [WIP] Before refactor the nanovllm sparse policy. 2026-01-19 22:34:44 +08:00
Zijie Tian
03a8c033cb [claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST 2026-01-10 21:03:45 +08:00