nano-vllm

Files

Zijie Tian dce6ad6b74 ♻️ refactor: chunked LayerNorm/QKV/MLP for 64k memory optimization

Implement chunked processing for LayerNorm, QKV projection, and MLP
layers to reduce peak activation memory for 64k sequence inference.

Changes:
- Chunked input_layernorm and post_attention_layernorm (chunk_size=128)
- Chunked QKV projection (chunk_size=128)
- Chunked MLP processing (chunk_size=128) with memory cleanup
- Added torch.cuda.empty_cache() calls after each chunk

This reduces peak activation from ~2 GB to ~50 MB per layer,
making 64k inference theoretically possible on 24GB GPUs
(though still limited by memory fragmentation).

Related: docs/64k_memory_analysis.md

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-14 07:01:57 +08:00

block_manager.py

simplify

2025-08-31 20:02:51 +08:00

llm_engine.py

Merge branch 'zijie/fix-dist-3': Fix distributed port conflict

2026-01-12 16:27:25 +08:00

model_runner.py

♻️ refactor: chunked LayerNorm/QKV/MLP for 64k memory optimization

2026-01-14 07:01:57 +08:00

scheduler.py

[WIP] Before fix bench_offload.py.

2026-01-06 18:41:08 +08:00

sequence.py

[fix] Fixed needle test bug.

2026-01-05 18:34:09 +08:00