nano-vllm

zijie-tian/nano-vllm

Fork 0

Commit Graph

Author	SHA1	Message	Date
Zijie Tian	e874229adc	📝 docs: add comprehensive GPU-only vs Offload benchmark results - Add --block-size argument to bench.py for configurable KV cache block size - Update bench_offload_results.md with complete benchmark analysis: - GPU-only: XAttention shows +15% to +41% speedup - CPU Offload: XAttention shows -14% to -59% slowdown - Block size 4096 recommended for best performance - Document why XAttention hurts Offload mode (transfer bottleneck) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 22:32:07 +08:00
Zijie Tian	73c9dc46ff	✨ feat: add XAttention BSA support to bench_offload.py - Add --model parameter (default: Llama-3.1-8B-Instruct) - Add --enable-xattn flag for XAttention BSA sparse prefill - Add --xattn-threshold and --xattn-stride parameters - Change default num-gpu-blocks from 6 to 4 - Add benchmark results doc with Full vs XAttn comparison (32K/128K) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:20:16 +08:00

Author

SHA1

Message

Date

Zijie Tian

e874229adc

📝 docs: add comprehensive GPU-only vs Offload benchmark results

- Add --block-size argument to bench.py for configurable KV cache block size
- Update bench_offload_results.md with complete benchmark analysis:
  - GPU-only: XAttention shows +15% to +41% speedup
  - CPU Offload: XAttention shows -14% to -59% slowdown
  - Block size 4096 recommended for best performance
  - Document why XAttention hurts Offload mode (transfer bottleneck)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-27 22:32:07 +08:00

Zijie Tian

73c9dc46ff

✨ feat: add XAttention BSA support to bench_offload.py

- Add --model parameter (default: Llama-3.1-8B-Instruct)
- Add --enable-xattn flag for XAttention BSA sparse prefill
- Add --xattn-threshold and --xattn-stride parameters
- Change default num-gpu-blocks from 6 to 4
- Add benchmark results doc with Full vs XAttn comparison (32K/128K)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-27 04:20:16 +08:00

2 Commits