docs: add chunked prefill analysis for ultra-long sequences

Add comprehensive analysis document covering:
- MLP activation memory bottlenecks with SwiGLU architecture
- Chunked MLP strategy (98% memory reduction)
- Chunked prefill for single layers (78% memory reduction)
- Streaming Chunked Prefill (最优方案): GPU memory becomes constant
- Memory formulas and implementation guidance
- Theoretical maximum: 4M tokens on 24GB GPU (128× improvement)

Co-Authored-By: Claude <noreply@anthropic.com>

This commit is contained in:

Zijie Tian

2026-01-16 10:38:02 +08:00

parent 2826a649de

commit cfb188c34a

2 changed files with 1056 additions and 0 deletions

1055

docs/chunked_prefill_analysis.md Normal file

View File

File diff suppressed because it is too large Load Diff

docs: add chunked prefill analysis for ultra-long sequences

1055 docs/chunked_prefill_analysis.md Normal file View File

1055

docs/chunked_prefill_analysis.md Normal file

View File