nano-vllm/nanovllm/kvcache/sparse/full_policy.py at 8d19e61446e2ebd0310a3867297a669bea87ccd3

Files

Zijie Tian 8d19e61446 ⚡️ perf: replace Triton merge with FlashInfer merge_state

Use FlashInfer's optimized merge_state kernel for attention output merging
in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K).

Key changes:
- Add merge_attention_outputs_flashinfer() with LSE format conversion
- FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2
- Keep original Triton kernel for fallback

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-28 10:04:38 +08:00

19 KiB

Raw Blame History

View Raw

19 KiB Raw Blame History

19 KiB

Raw Blame History