Commit Graph

2 Commits

Author SHA1 Message Date
Zijie Tian
e874229adc 📝 docs: add comprehensive GPU-only vs Offload benchmark results
- Add --block-size argument to bench.py for configurable KV cache block size
- Update bench_offload_results.md with complete benchmark analysis:
  - GPU-only: XAttention shows +15% to +41% speedup
  - CPU Offload: XAttention shows -14% to -59% slowdown
  - Block size 4096 recommended for best performance
  - Document why XAttention hurts Offload mode (transfer bottleneck)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 22:32:07 +08:00
Zijie Tian
73c9dc46ff feat: add XAttention BSA support to bench_offload.py
- Add --model parameter (default: Llama-3.1-8B-Instruct)
- Add --enable-xattn flag for XAttention BSA sparse prefill
- Add --xattn-threshold and --xattn-stride parameters
- Change default num-gpu-blocks from 6 to 4
- Add benchmark results doc with Full vs XAttn comparison (32K/128K)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 04:20:16 +08:00