docs: add chunked prefill analysis for ultra-long sequences
Add comprehensive analysis document covering: - MLP activation memory bottlenecks with SwiGLU architecture - Chunked MLP strategy (98% memory reduction) - Chunked prefill for single layers (78% memory reduction) - Streaming Chunked Prefill (最优方案): GPU memory becomes constant - Memory formulas and implementation guidance - Theoretical maximum: 4M tokens on 24GB GPU (128× improvement) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
1055
docs/chunked_prefill_analysis.md
Normal file
1055
docs/chunked_prefill_analysis.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user