In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not
needed since compute_chunked_prefill() handles GQA inline. Previously,
these buffers were always allocated based on max_model_len, causing OOM
on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer).
Changes:
- Add enable_cpu_offload parameter to alloc_policy_metadata() in base class
- Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy
- Pass enable_cpu_offload from model_runner to policy
Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document OOM issue when using XAttention BSA + CPU offload
with large models (GLM-4-9B) on 24GB GPUs.
Issue: 8GB allocation for k_expanded buffer fails due to
using num_heads instead of num_kv_heads in GQA models.
Root cause analysis and proposed fix included.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>