nano-vllm

Files

Zijie Tian 829b311c02 🐛 fix: stream synchronization for XAttention estimate kernels in offload mode

- Wrap all compute kernels in select_blocks with compute_stream context
  (Pass 1 historical blocks, Pass 1 current chunk, Step 2 merge,
   Pass 2 historical blocks, Pass 2 current chunk, Step 4 block selection)
- Fix K data mismatch between Pass 1 and Pass 2 by ensuring wait_slot_layer
  syncs with compute_stream where kernels actually run
- Remove STRONG SYNC code from offload_engine.py (now handled by events)
- Remove debug print statements and torch.save code
- Consolidate fallback conditions in compute_with_xattn
- Change default chunk_size from 16384 to 4096 for density alignment

The bug caused Pass 1 and Pass 2 to see different K data from the same
CPU block because compute kernels ran on default stream while
wait_slot_layer only synced compute_stream.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 01:30:23 +08:00

policies

[feat] Added chunked prefill and kvcache offload mechenism.

2025-12-10 03:47:37 +08:00

sparse

🐛 fix: stream synchronization for XAttention estimate kernels in offload mode

2026-02-05 01:30:23 +08:00

__init__.py

✨ feat: integrate sparse policy architecture into GPU-only mode

2026-01-27 05:08:02 +08:00

base_manager.py

[feat] Added chunked prefill and kvcache offload mechenism.