✨ feat: add nanovllm.ops module with XAttention estimation kernels

Add ops module ported from tzj/minference branch containing:
- xattn.py: XAttention block importance estimation with Triton kernels
  - xattn_estimate(): standard estimation for sparse attention mask
  - xattn_estimate_chunked(): chunked prefill compatible version
  - flat_group_gemm_fuse_reshape(): fused stride reshape + GEMM kernel
  - softmax_fuse_block_sum(): online softmax + block-wise sum kernel
- chunked_attention.py: Flash attention with LSE output for chunk merging
- test_xattn_estimate_chunked.py: verification test (all seq_lens pass)

This prepares the foundation for AttentionPolicy refactoring where
XAttentionPolicy.estimate() will call these ops.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This commit is contained in:

Zijie Tian

2026-01-22 06:00:42 +08:00

parent 2826a649de

commit 9f3ee9279e

4 changed files with 2073 additions and 0 deletions

1167

nanovllm/ops/xattn.py Normal file

View File

File diff suppressed because it is too large Load Diff

✨ feat: add nanovllm.ops module with XAttention estimation kernels

1167 nanovllm/ops/xattn.py Normal file View File

1167

nanovllm/ops/xattn.py Normal file

View File