Zijie Tian
cfb188c34a
docs: add chunked prefill analysis for ultra-long sequences
Add comprehensive analysis document covering:
- MLP activation memory bottlenecks with SwiGLU architecture
- Chunked MLP strategy (98% memory reduction)
- Chunked prefill for single layers (78% memory reduction)
- Streaming Chunked Prefill (最优方案): GPU memory becomes constant
- Memory formulas and implementation guidance
- Theoretical maximum: 4M tokens on 24GB GPU (128× improvement)
Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-16 10:38:02 +08:00
..
2026-01-14 07:02:09 +08:00
2026-01-14 07:02:09 +08:00
2026-01-08 21:19:38 +08:00
2026-01-14 08:39:03 +08:00
2026-01-16 10:38:02 +08:00
2026-01-09 16:10:28 +08:00
2026-01-08 21:19:38 +08:00
2026-01-14 10:08:41 +08:00
2026-01-08 23:22:38 +08:00
2026-01-08 21:19:38 +08:00
2026-01-10 21:14:32 +08:00
2026-01-12 21:08:35 +08:00
2026-01-14 00:51:30 +08:00
2026-01-12 00:16:37 +08:00
2026-01-08 21:19:38 +08:00
2026-01-08 23:42:30 +08:00
2026-01-10 23:33:09 +08:00
2026-01-11 18:48:50 +08:00
2026-01-14 10:08:41 +08:00
2026-01-14 10:16:21 +08:00