nano-vllm/nanovllm/kvcache/sparse/xattn_bsa.py at 829b311c028e5513c238bcef2f6679e96a9b7ca6

Files

Zijie Tian 829b311c02 🐛 fix: stream synchronization for XAttention estimate kernels in offload mode

- Wrap all compute kernels in select_blocks with compute_stream context
  (Pass 1 historical blocks, Pass 1 current chunk, Step 2 merge,
   Pass 2 historical blocks, Pass 2 current chunk, Step 4 block selection)
- Fix K data mismatch between Pass 1 and Pass 2 by ensuring wait_slot_layer
  syncs with compute_stream where kernels actually run
- Remove STRONG SYNC code from offload_engine.py (now handled by events)
- Remove debug print statements and torch.save code
- Consolidate fallback conditions in compute_with_xattn
- Change default chunk_size from 16384 to 4096 for density alignment

The bug caused Pass 1 and Pass 2 to see different K data from the same
CPU block because compute kernels ran on default stream while
wait_slot_layer only synced compute_stream.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 01:30:23 +08:00

49 KiB

Raw Blame History

View Raw

49 KiB Raw Blame History

49 KiB

Raw Blame History