nano-vllm

Files

Zijie Tian 6c55c4d2a3 ♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm

Implement correct 3-stage KV chunking for XAttention offload mode:
- Stage 1: Compute partial softmax stats (m, l) for each KV chunk
- Stage 2: Merge all partial stats to get global normalization factors
- Stage 3: Normalize with global stats and compute block sums

Key fixes:
- Add wait_all_prefill_offloads() before loading CPU blocks to ensure
  async offload completion (fixes stale data bug)
- Pre-allocate m/l partial buffers and block_sums buffer

This produces identical density to GPU-only xattn_estimate while using
O(S×C) peak memory instead of O(S²).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-02-02 10:10:10 +08:00

__init__.py

✨ feat: add configurable stride and chunk_size for XAttention BSA

2026-01-23 10:37:04 +08:00

full_policy.py

WIP: Enhance sparse attention with density tracking and block selection improvements

2026-01-31 14:48:23 +08:00

policy.py

WIP: Enhance sparse attention with density tracking and block selection improvements

2026-01-31 14:48:23 +08:00

quest.py

WIP: Enhance sparse attention with density tracking and block selection improvements

2026-01-31 14:48:23 +08:00

xattn_bsa.py

♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm

2026-02-02 10:10:10 +08:00