Zijie Tian
e874229adc
📝 docs: add comprehensive GPU-only vs Offload benchmark results
...
- Add --block-size argument to bench.py for configurable KV cache block size
- Update bench_offload_results.md with complete benchmark analysis:
- GPU-only: XAttention shows +15% to +41% speedup
- CPU Offload: XAttention shows -14% to -59% slowdown
- Block size 4096 recommended for best performance
- Document why XAttention hurts Offload mode (transfer bottleneck)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 22:32:07 +08:00
Zijie Tian
9177b62d7f
✨ feat: add --enforce-eager option to bench.py
...
Allow disabling CUDA graphs for benchmarking comparison between
eager mode and graph mode execution.
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 09:19:53 +08:00
Zijie Tian
a504bd873d
⚡ perf: pre-allocate GQA buffers in XAttention policy
...
Add alloc_policy_metadata() method to SparsePolicy base class for
pre-allocating GPU buffers during initialization. This avoids
dynamic memory allocation during forward pass.
Changes:
- Add alloc_policy_metadata() to SparsePolicy base class
- Implement GQA buffer pre-allocation in XAttentionBSAPolicy
- Call alloc_policy_metadata() in model_runner for GPU-only mode
- Modify compute_prefill() to reuse pre-allocated buffers
- Add --gpu-util parameter to bench.py
Memory savings:
- Previously: 2x GQA expansion (~2GB for 64K)
- Now: 1x pre-allocated buffer (~1GB for 64K, reused)
Tested:
- GPU-only 32K: 5602 tok/s (512MB pre-allocated)
- GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7)
- Offload Full: PASSED (no changes to offload path)
- Offload XAttention: PASSED (uses compute_chunked_prefill)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 05:49:23 +08:00
Zijie Tian
076656c9c2
✨ feat: add GPU-only XAttention BSA sparse attention support
...
- Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode
- Uses xattn_estimate to compute sparse block mask
- Uses block_sparse_attn_func for efficient sparse attention
- Handles GQA by expanding K/V heads
- Falls back to flash_attn for paged KV cache (prefix cache)
- Implement compute_decode() by delegating to FullAttentionPolicy
- Add --policy xattn option to bench.py
Verified: RULER 32k niah_single_1 5/5 samples passed (100%)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 05:19:24 +08:00
Zijie Tian
09b2136e9f
✨ feat: integrate sparse policy architecture into GPU-only mode
...
- Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class
- Implement GPU-only methods in FullAttentionPolicy using flash_attn
- Add sparse_policy parameter to GPUOnlyManager
- Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode
- Route GPU-only attention through sparse_policy in attention.py
- Pass kvcache_manager to context for policy access
- Add --enable-policy flag to bench.py for testing
- Handle warmup phase when kvcache_manager is not yet allocated
This allows GPU-only mode to use the same policy architecture as CPU offload mode,
enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode.
Performance verified: ~4890 tok/s (unchanged from baseline)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 05:08:02 +08:00
Zijie Tian
c717072f31
✨ feat: add --model argument to bench.py for configurable model path
...
Previously bench.py had a hardcoded model path. Now it accepts --model
argument (default: Llama-3.1-8B-Instruct) to align with bench_offload.py.
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-27 04:36:17 +08:00
Zijie Tian
aa953ecb59
[refactor] Aligned the bench.
2026-01-07 04:25:06 +08:00
Zijie Tian
82ed34fc2d
[opt] optimize nanovllm performance compareable with vllm.
2025-12-25 03:47:07 +08:00
Zijie Tian
08d83185ce
[fix] fix bench*.py.
2025-12-22 19:53:50 +08:00
Zijie Tian
051f2295c9
[feat] Added sparse KVcache feature, NEED VERIFY.
2025-12-22 08:51:02 +08:00
Zijie Tian
0b6f19242d
[feat] Added chunked prefill and kvcache offload mechenism.
2025-12-10 03:47:37 +08:00
Zijie Tian
761929390e
[bench] Added vllm vs nano-vllm bench.
2025-12-10 00:44:57 +08:00
GeeeekExplorer
801365a611
update bench
2025-06-19 23:28:11 +08:00
cheunglei
b5ace32982
use spawn
2025-06-17 23:49:15 +08:00
GeeeekExplorer
59aa3ff57c
better
2025-06-13 13:07:33 +08:00
GeeeekExplorer
135d1b38a2
release
2025-06-13 09:01:08 +08:00
GeeeekExplorer
ec3c60d96f
update bench
2025-06-12 22:54:51 +08:00
GeeeekExplorer
fee58d44e4
fix
2025-06-12 01:00:31 +08:00
GeeeekExplorer
b98e1ca305
fix
2025-06-10 21:25:54 +08:00
GeeeekExplorer
a5a4909e6a
init commit
2025-06-10 00:27:01 +08:00