nano-vllm

Files

Zijie Tian 2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements

- Added analysis documentation for xattn density alignment.
- Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration.
- Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection.
- Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks.
- Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling.
- Introduced DensityObserver to track compute and communication density for sparse attention layers.
- Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios.
- Added tests for attention kernel behavior with enhanced input patterns.

2026-01-31 14:48:23 +08:00

comm

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

debug

[refactor] Refactor the kvcache offload.

2026-01-04 19:37:03 +08:00

engine

WIP: Enhance sparse attention with density tracking and block selection improvements

2026-01-31 14:48:23 +08:00

kvcache

WIP: Enhance sparse attention with density tracking and block selection improvements