nano-vllm

Files

Zijie Tian 3da9b8aef2 ⚡️ perf: optimize XAttention estimate phase with K-only loading

Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase:
- Only load K (not K+V) during block selection in select_blocks()
- Reduces H2D transfer by 50% in estimate phase
- 64K context: XAttn/Full ratio drops from 1.48x to 0.99x
- 32K context: XAttn/Full ratio drops from 1.67x to 1.20x

The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which
only requires K for attention score computation. V is unused.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 06:24:20 +08:00

policies

[feat] Added chunked prefill and kvcache offload mechenism.

2025-12-10 03:47:37 +08:00

sparse

⚡️ perf: optimize XAttention estimate phase with K-only loading

2026-01-28 06:24:20 +08:00

__init__.py

✨ feat: integrate sparse policy architecture into GPU-only mode

2026-01-27 05:08:02 +08:00

base_manager.py

[feat] Added chunked prefill and kvcache offload mechenism.