nano-vllm/docs at ef37d4f1a836fea6b10d3bb5fe8b1b7bdddfd8eb - nano-vllm - Gitea: Git with a cup of tea

zijie-tian/nano-vllm

Files

History

Zijie Tian ef37d4f1a8 🐛 docs: document XAttention offload GQA buffer OOM issue

Document OOM issue when using XAttention BSA + CPU offload
with large models (GLM-4-9B) on 24GB GPUs.

Issue: 8GB allocation for k_expanded buffer fails due to
using num_heads instead of num_kv_heads in GQA models.

Root cause analysis and proposed fix included.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 02:46:50 +08:00

..

architecture_guide.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

bench_offload_results.md

⚡️ perf: replace Triton merge with FlashInfer merge_state

2026-01-28 10:04:38 +08:00

block_sparse_attn_interface.md

📝 docs: add BSA interface documentation and cleanup temp files

2026-01-20 04:27:19 +08:00

chunked_attention_solutions.md

📝 docs: add chunked attention solutions guide and update doc index

2026-01-20 04:48:20 +08:00

cpu_offload_optimization_strategies.md

📝 docs: add CPU offload optimization strategies guide

2026-01-27 04:44:36 +08:00

cpu_scheduling_latency_analysis.md

📝 docs: add CPU scheduling latency analysis for chunked attention

2026-01-27 03:42:12 +08:00

cuda_graph_memory_guide.md

📝 docs: add CUDA Graph memory mechanism guide

2026-01-21 02:59:21 +08:00

cuda_graph_offload_guide.md

[docs] Added cuda_graph_guide.md

2026-01-21 21:56:24 +08:00

debugging_guide.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

estimate_block_size_performance.md

⚡️ perf: optimize XAttention estimate with hierarchical block sum

2026-01-28 06:47:13 +08:00

gpu_only_sparse_integration.md

📝 docs: add GPU-only sparse policy integration baseline

2026-01-27 04:36:31 +08:00

gpu_only_xattn_guide.md

📝 docs: add GPU-Only XAttention guide with performance analysis

2026-01-27 07:21:46 +08:00

gpuonly_density_alignment_test.md

📝 docs: update density alignment test with Offload mode results

2026-02-02 14:22:40 +08:00

issue_xattn_offload_gqa_buffer_oom.md

🐛 docs: document XAttention offload GQA buffer OOM issue

2026-02-05 02:46:50 +08:00

known_issues.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

long_context_models_1m.md

📚 docs: add 1M+ context length models reference list

2026-01-28 09:04:55 +08:00

memory_communication_benchmark.md

⚡️ perf: optimize XAttention estimate phase with K-only loading

2026-01-28 06:24:20 +08:00

new_model_integration_guide.md

📚 docs: add new model integration guide

2026-01-28 13:36:24 +08:00

nsys_wrong_event_order_bug.md

📝 docs: add nsys wrong event order bug investigation

2026-01-24 04:32:05 +08:00

observer_architecture.md

📈 feat: add MemoryObserver for GPU-CPU communication tracking

2026-01-28 04:06:45 +08:00

optimization_guide.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

ruler_32k_chunked_offload_issue.md

🐛 fix: resolve CPU KV cache state leakage between requests

2026-01-21 01:12:21 +08:00

ruler_benchmark_results_32k.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

sparse_attention_guide.md

📝 docs: add XAttention algorithm guide based on COMPASS implementation

2026-01-20 02:50:03 +08:00

sparse_policy_architecture.md

♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill

2026-01-20 02:10:40 +08:00

sparse_policy_implementation_guide.md

📝 docs: add SparsePolicy implementation guide and update rules

2026-01-20 02:25:46 +08:00

test_ruler_usage_guide.md

📝 docs: add test_ruler.py usage guide and rule

2026-02-05 02:46:44 +08:00

xattention_algorithm_guide.md

📝 docs: add XAttention algorithm guide based on COMPASS implementation

2026-01-20 02:50:03 +08:00

xattention_bsa_test_report.md

[WIP] Before integrate the xattn operator.

2026-01-19 21:19:21 +08:00

xattn_bsa_policy_design.md

📝 docs: update XAttention BSA Policy with benchmarks and memory management

2026-01-23 09:35:18 +08:00

xattn_chunked_prefill.md

feat: add xattn_estimate_chunked for chunked prefill support

2026-01-22 01:13:17 +08:00

xattn_density_alignment_verification.md

📝 docs: add XAttention density alignment verification results

2026-02-05 01:59:11 +08:00

xattn_density_benchmark.md

✨ feat: add DensityObserver for XAttention sparse attention density tracking

2026-01-30 16:26:56 +08:00

xattn_density_types.md

📝 docs: add XAttention density types documentation

2026-02-05 01:44:11 +08:00

xattn_kernels_guide.md

docs: add XAttention kernels guide

2026-01-23 03:22:25 +08:00

xattn_kv_chunking_density_test.md

📝 docs: add XAttention KV chunking density test results

2026-02-01 17:36:19 +08:00

xattn_kv_chunking_kernels.md

📝 docs: add storage overhead analysis and batch tests for KV chunking

2026-02-01 19:22:36 +08:00

xattn_memory_benchmark.md

📊 docs: add XAttention memory benchmark for 24GB GPUs

2026-02-02 14:38:27 +08:00

xattn_offload_stream_sync_fix.md

📝 docs: add XAttention offload stream sync fix documentation

2026-02-05 01:32:50 +08:00

xattn_performance_analysis.md

📝 docs: add XAttention performance analysis documentation

2026-01-28 00:57:20 +08:00