docs: add chunked prefill analysis for ultra-long sequences
Add comprehensive analysis document covering: - MLP activation memory bottlenecks with SwiGLU architecture - Chunked MLP strategy (98% memory reduction) - Chunked prefill for single layers (78% memory reduction) - Streaming Chunked Prefill (最优方案): GPU memory becomes constant - Memory formulas and implementation guidance - Theoretical maximum: 4M tokens on 24GB GPU (128× improvement) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -64,6 +64,7 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
|
||||
| [`docs/xattention_integration.md`](docs/xattention_integration.md) | XAttention integration guide: algorithm, implementation, design decisions, and testing |
|
||||
| [`docs/xattention_analysis.md`](docs/xattention_analysis.md) | XAttention algorithm analysis: chunked estimation, block sparse attention, integration design |
|
||||
| [`docs/development_notes.md`](docs/development_notes.md) | Development notes and scratchpad for ongoing work |
|
||||
| [`docs/chunked_prefill_analysis.md`](docs/chunked_prefill_analysis.md) | **NEW**: Chunked prefill for ultra-long sequences (1M+), memory analysis, MLP activation breakdown, implementation guide |
|
||||
|
||||
## Configuration
|
||||
|
||||
|
||||
1055
docs/chunked_prefill_analysis.md
Normal file
1055
docs/chunked_prefill_analysis.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user