nano-vllm

zijie-tian/nano-vllm

Fork 0

Commit Graph

Author	SHA1	Message	Date
Zijie Tian	cfb188c34a	docs: add chunked prefill analysis for ultra-long sequences Add comprehensive analysis document covering: - MLP activation memory bottlenecks with SwiGLU architecture - Chunked MLP strategy (98% memory reduction) - Chunked prefill for single layers (78% memory reduction) - Streaming Chunked Prefill (最优方案): GPU memory becomes constant - Memory formulas and implementation guidance - Theoretical maximum: 4M tokens on 24GB GPU (128× improvement) Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-16 10:38:02 +08:00

Author

SHA1

Message

Date

Zijie Tian

cfb188c34a

docs: add chunked prefill analysis for ultra-long sequences

Add comprehensive analysis document covering:
- MLP activation memory bottlenecks with SwiGLU architecture
- Chunked MLP strategy (98% memory reduction)
- Chunked prefill for single layers (78% memory reduction)
- Streaming Chunked Prefill (最优方案): GPU memory becomes constant
- Memory formulas and implementation guidance
- Theoretical maximum: 4M tokens on 24GB GPU (128× improvement)

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-16 10:38:02 +08:00

1 Commits