nano-vllm/nanovllm/kvcache/sparse/quest.py at 54fd302fa8449790b6f553eef28dc1a1f0a8170e

Files

Zijie Tian 2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements

- Added analysis documentation for xattn density alignment.
- Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration.
- Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection.
- Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks.
- Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling.
- Introduced DensityObserver to track compute and communication density for sparse attention layers.
- Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios.
- Added tests for attention kernel behavior with enhanced input patterns.

2026-01-31 14:48:23 +08:00

11 KiB

Raw Blame History

View Raw

11 KiB Raw Blame History

11 KiB

Raw Blame History