nano-vllm

Files

Zijie Tian 5acd5558d6 feat: add KV chunking support for XAttention softmax kernels

Implement three-phase KV chunking for sparse attention estimation:
1. softmax_compute_partial_stats: compute (m, l) per KV chunk
2. merge_softmax_stats: merge partial stats on host
3. softmax_normalize_and_block_sum: normalize with global stats

This allows computing sparse attention masks without storing full
raw attention scores in GPU memory, reducing peak memory usage
from O(q_len * k_full_len) to O(q_len * k_chunk_len).

Key changes:
- Add softmax_partial_stats_kernel with causal mask support
- Add softmax_normalize_block_sum_kernel with kv_offset parameter
- Add Python wrappers for new kernels
- Update test script to validate KV chunking alignment
- Add documentation for the new kernels

Test results show perfect alignment with xattn_estimate API:
- Density difference: 0.000000
- Mask difference: 0.0044%

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-02-01 18:53:26 +08:00

__init__.py

[WIP] NEED refactor nanovllm mechenism.

2025-12-22 23:52:56 +08:00

bench_estimate_block_size.py

📝 docs: add estimate block_size performance analysis

2026-01-28 06:24:28 +08:00

modeling_qwen3.py

[refactor] Refactor needle test.

2026-01-03 19:19:37 +08:00

test_chunk_attention_graph_reuse.py

📝 docs: add CUDA Graph optimization plan for offload mode decode