nano-vllm

zijie-tian/nano-vllm

Fork 0

Commit Graph

Author	SHA1	Message	Date
Zijie Tian	ef37d4f1a8	🐛 docs: document XAttention offload GQA buffer OOM issue Document OOM issue when using XAttention BSA + CPU offload with large models (GLM-4-9B) on 24GB GPUs. Issue: 8GB allocation for k_expanded buffer fails due to using num_heads instead of num_kv_heads in GQA models. Root cause analysis and proposed fix included. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:46:50 +08:00

Author

SHA1

Message

Date

Zijie Tian

ef37d4f1a8

🐛 docs: document XAttention offload GQA buffer OOM issue

Document OOM issue when using XAttention BSA + CPU offload
with large models (GLM-4-9B) on 24GB GPUs.

Issue: 8GB allocation for k_expanded buffer fails due to
using num_heads instead of num_kv_heads in GQA models.

Root cause analysis and proposed fix included.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 02:46:50 +08:00

1 Commits