docs: add chunked prefill analysis for ultra-long sequences

Add comprehensive analysis document covering: - MLP activation memory bottlenecks with SwiGLU architecture - Chunked MLP strategy (98% memory reduction) - Chunked prefill for single layers (78% memory reduction) - Streaming Chunked Prefill (最优方案): GPU memory becomes constant - Memory formulas and implementation guidance - Theoretical maximum: 4M tokens on 24GB GPU (128× improvement) Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-16 10:38:02 +08:00
parent 2826a649de
commit cfb188c34a
2 changed files with 1056 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -64,6 +64,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
 | [`docs/xattention_integration.md`](docs/xattention_integration.md) | XAttention integration guide: algorithm, implementation, design decisions, and testing |
 | [`docs/xattention_analysis.md`](docs/xattention_analysis.md) | XAttention algorithm analysis: chunked estimation, block sparse attention, integration design |
 | [`docs/development_notes.md`](docs/development_notes.md) | Development notes and scratchpad for ongoing work |
+| [`docs/chunked_prefill_analysis.md`](docs/chunked_prefill_analysis.md) | **NEW**: Chunked prefill for ultra-long sequences (1M+), memory analysis, MLP activation breakdown, implementation guide |

 ## Configuration

--- a/docs/chunked_prefill_analysis.md
+++ b/docs/chunked_prefill_analysis.md