nano-vllm/nanovllm/kvcache/sparse/policy.py at 52b12a89e33458cd53c23f7af15ed54b410990cc

Files

Zijie Tian 11a867f6fb 🐛 fix: skip GQA buffer allocation in XAttention offload mode

In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not
needed since compute_chunked_prefill() handles GQA inline. Previously,
these buffers were always allocated based on max_model_len, causing OOM
on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer).

Changes:
- Add enable_cpu_offload parameter to alloc_policy_metadata() in base class
- Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy
- Pass enable_cpu_offload from model_runner to policy

Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 02:57:18 +08:00

14 KiB

Raw Blame History

View Raw

14 KiB Raw Blame History

14 KiB

Raw Blame History