⚡ perf: pre-allocate GQA buffers in XAttention policy

Add alloc_policy_metadata() method to SparsePolicy base class for pre-allocating GPU buffers during initialization. This avoids dynamic memory allocation during forward pass. Changes: - Add alloc_policy_metadata() to SparsePolicy base class - Implement GQA buffer pre-allocation in XAttentionBSAPolicy - Call alloc_policy_metadata() in model_runner for GPU-only mode - Modify compute_prefill() to reuse pre-allocated buffers - Add --gpu-util parameter to bench.py Memory savings: - Previously: 2x GQA expansion (~2GB for 64K) - Now: 1x pre-allocated buffer (~1GB for 64K, reused) Tested: - GPU-only 32K: 5602 tok/s (512MB pre-allocated) - GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7) - Offload Full: PASSED (no changes to offload path) - Offload XAttention: PASSED (uses compute_chunked_prefill) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:49:23 +08:00
parent 076656c9c2
commit a504bd873d
4 changed files with 116 additions and 7 deletions
--- a/nanovllm/engine/model_runner.py
+++ b/nanovllm/engine/model_runner.py
@@ -208,6 +208,19 @@ class ModelRunner:
                device=torch.device("cuda"),
            )

+            # GPU-only mode: pre-allocate policy metadata buffers
+            # This avoids dynamic GPU memory allocation during forward pass
+            if not config.enable_cpu_offload:
+                num_heads = hf_config.num_attention_heads // self.world_size
+                self.kvcache_manager.sparse_policy.alloc_policy_metadata(
+                    num_heads=num_heads,
+                    num_kv_heads=num_kv_heads,
+                    head_dim=head_dim,
+                    max_seq_len=config.max_model_len,
+                    dtype=hf_config.torch_dtype,
+                    device=torch.device("cuda"),
+                )
+
            # Log policy info (handle both enum and None cases)
            policy_name = config.sparse_policy.name if config.sparse_policy is not None else "FULL"
            logger.info(