nano-vllm/nanovllm/kvcache/sparse/xattn_bsa.py at 3da9b8aef29e69f1677fed26e4bb8aa79b5766f8

Files

Zijie Tian 3da9b8aef2 ⚡️ perf: optimize XAttention estimate phase with K-only loading

Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase:
- Only load K (not K+V) during block selection in select_blocks()
- Reduces H2D transfer by 50% in estimate phase
- 64K context: XAttn/Full ratio drops from 1.48x to 0.99x
- 32K context: XAttn/Full ratio drops from 1.67x to 1.20x

The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which
only requires K for attention score computation. V is unused.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 06:24:20 +08:00

32 KiB

Raw Blame History

View Raw

32 KiB Raw Blame History

32 KiB

Raw Blame History