Zijie Tian
2c2383c786
⚡ ️ perf: optimize XAttention estimate with hierarchical block sum
...
Replace slow softmax_fuse_block_sum (block_size=4096) with optimized
hierarchical approach (estimate_block_size=1024):
- Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024)
- Rewrite select_blocks to use hierarchical aggregation:
1. Fine-grained softmax with small block size (15x faster kernel)
2. Aggregate to CPU block level via reshape + sum
3. Score + threshold selection (replaces mask + voting)
Performance improvement (CPU Offload mode):
- softmax_fuse_block_sum: 48% → 1% of total time (44x faster)
- 128K: XAttention now +2.4% faster than Full (was -59%)
- 64K: -3.8% (was -21%)
- 32K: -6.0% (was -14%)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-28 06:47:13 +08:00
Zijie Tian
e874229adc
📝 docs: add comprehensive GPU-only vs Offload benchmark results
...
- Add --block-size argument to bench.py for configurable KV cache block size
- Update bench_offload_results.md with complete benchmark analysis:
- GPU-only: XAttention shows +15% to +41% speedup
- CPU Offload: XAttention shows -14% to -59% slowdown
- Block size 4096 recommended for best performance
- Document why XAttention hurts Offload mode (transfer bottleneck)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 22:32:07 +08:00
Zijie Tian
73c9dc46ff
✨ feat: add XAttention BSA support to bench_offload.py
...
- Add --model parameter (default: Llama-3.1-8B-Instruct)
- Add --enable-xattn flag for XAttention BSA sparse prefill
- Add --xattn-threshold and --xattn-stride parameters
- Change default num-gpu-blocks from 6 to 4
- Add benchmark results doc with Full vs XAttn comparison (32K/128K)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 04:20:16 +08:00