nano-vllm/docs at d21b40f48f19f794568955117114be5cfa6ec80c - nano-vllm - Gitea: Git with a cup of tea

zijie-tian/nano-vllm

Files

History

Zijie Tian 42cf124343 📝 docs: add CUDA Graph memory mechanism guide

Document CUDA Graph memory behavior based on actual testing:
- Memory overhead at each stage (model, cache, warmup, capture, replay)
- StaticCache is the main overhead (~144MB for 1K tokens)
- Graph capture adds minimal overhead (~8MB)
- Graph replay requires zero additional allocation
- Performance improvement: ~2.8x decode throughput

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-21 02:59:21 +08:00

..

architecture_guide.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

block_sparse_attn_interface.md

📝 docs: add BSA interface documentation and cleanup temp files

2026-01-20 04:27:19 +08:00

chunked_attention_solutions.md

📝 docs: add chunked attention solutions guide and update doc index

2026-01-20 04:48:20 +08:00

cuda_graph_memory_guide.md

📝 docs: add CUDA Graph memory mechanism guide

2026-01-21 02:59:21 +08:00

debugging_guide.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

known_issues.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

optimization_guide.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

ruler_32k_chunked_offload_issue.md

🐛 fix: resolve CPU KV cache state leakage between requests

2026-01-21 01:12:21 +08:00

ruler_benchmark_results_32k.md

✨ feat: add comprehensive RULER benchmark testing

2026-01-18 20:34:06 +08:00

sparse_attention_guide.md

📝 docs: add XAttention algorithm guide based on COMPASS implementation

2026-01-20 02:50:03 +08:00

sparse_policy_architecture.md

♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill

2026-01-20 02:10:40 +08:00

sparse_policy_implementation_guide.md

📝 docs: add SparsePolicy implementation guide and update rules

2026-01-20 02:25:46 +08:00

xattention_algorithm_guide.md

📝 docs: add XAttention algorithm guide based on COMPASS implementation

2026-01-20 02:50:03 +08:00

xattention_bsa_test_report.md

[WIP] Before integrate the xattn operator.

2026-01-19 21:19:21 +08:00