Implement correct 3-stage KV chunking for XAttention offload mode: - Stage 1: Compute partial softmax stats (m, l) for each KV chunk - Stage 2: Merge all partial stats to get global normalization factors - Stage 3: Normalize with global stats and compute block sums Key fixes: - Add wait_all_prefill_offloads() before loading CPU blocks to ensure async offload completion (fixes stale data bug) - Pre-allocate m/l partial buffers and block_sums buffer This produces identical density to GPU-only xattn_estimate while using O(S×C) peak memory instead of O(S²). Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>