nano-vllm

Author	SHA1	Message	Date
Zijie Tian	8d19e61446	⚡️ perf: replace Triton merge with FlashInfer merge_state Use FlashInfer's optimized merge_state kernel for attention output merging in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K). Key changes: - Add merge_attention_outputs_flashinfer() with LSE format conversion - FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2 - Keep original Triton kernel for fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 10:04:38 +08:00
Zijie Tian	2c2383c786	⚡️ perf: optimize XAttention estimate with hierarchical block sum Replace slow softmax_fuse_block_sum (block_size=4096) with optimized hierarchical approach (estimate_block_size=1024): - Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024) - Rewrite select_blocks to use hierarchical aggregation: 1. Fine-grained softmax with small block size (15x faster kernel) 2. Aggregate to CPU block level via reshape + sum 3. Score + threshold selection (replaces mask + voting) Performance improvement (CPU Offload mode): - softmax_fuse_block_sum: 48% → 1% of total time (44x faster) - 128K: XAttention now +2.4% faster than Full (was -59%) - 64K: -3.8% (was -21%) - 32K: -6.0% (was -14%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:47:13 +08:00
Zijie Tian	e874229adc	📝 docs: add comprehensive GPU-only vs Offload benchmark results - Add --block-size argument to bench.py for configurable KV cache block size - Update bench_offload_results.md with complete benchmark analysis: - GPU-only: XAttention shows +15% to +41% speedup - CPU Offload: XAttention shows -14% to -59% slowdown - Block size 4096 recommended for best performance - Document why XAttention hurts Offload mode (transfer bottleneck) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 22:32:07 +08:00
Zijie Tian	73c9dc46ff	✨ feat: add XAttention BSA support to bench_offload.py - Add --model parameter (default: Llama-3.1-8B-Instruct) - Add --enable-xattn flag for XAttention BSA sparse prefill - Add --xattn-threshold and --xattn-stride parameters - Change default num-gpu-blocks from 6 to 4 - Add benchmark results doc with Full vs XAttn comparison (32K/128K) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:20:16 +08:00

4 Commits