📝 docs: add comprehensive GPU-only vs Offload benchmark results

- Add --block-size argument to bench.py for configurable KV cache block size - Update bench_offload_results.md with complete benchmark analysis: - GPU-only: XAttention shows +15% to +41% speedup - CPU Offload: XAttention shows -14% to -59% slowdown - Block size 4096 recommended for best performance - Document why XAttention hurts Offload mode (transfer bottleneck) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 22:32:07 +08:00
parent 4fe7dfb239
commit e874229adc
2 changed files with 93 additions and 41 deletions
--- a/bench.py
+++ b/bench.py
@@ -58,6 +58,8 @@ def main():
                        help="Enable sparse policy routing (FullAttentionPolicy by default)")
    parser.add_argument("--gpu-util", type=float, default=0.9,
                        help="GPU memory utilization (default: 0.9)")
+    parser.add_argument("--block-size", type=int, default=1024,
+                        help="KV cache block size (default: 1024)")
    parser.add_argument("--enforce-eager", action="store_true",
                        help="Disable CUDA graphs (default: False)")
    args = parser.parse_args()
@@ -83,6 +85,7 @@ def main():
        max_num_batched_tokens=max_len,
        sparse_policy=sparse_policy,
        gpu_memory_utilization=args.gpu_util,
+        kvcache_block_size=args.block_size,
    )

    # Warmup