nano-vllm

Files

Zijie Tian 2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements

- Added analysis documentation for xattn density alignment.
- Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration.
- Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection.
- Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks.
- Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling.
- Introduced DensityObserver to track compute and communication density for sparse attention layers.
- Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios.
- Added tests for attention kernel behavior with enhanced input patterns.

2026-01-31 14:48:23 +08:00

activation.py

fix

2025-06-15 13:28:29 +08:00

attention.py

WIP: Enhance sparse attention with density tracking and block selection improvements

2026-01-31 14:48:23 +08:00

embed_head.py

simplify

2025-08-31 20:02:51 +08:00

graphed_layers.py

⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill