nano-vllm

Files

Zijie Tian c90dc196b2 📝 docs: add estimate block_size performance analysis

Document the performance impact of block_size on softmax_fuse_block_sum:
- Current 4096 (reshaped 512) is the WORST point: 95ms
- Optimal 1024 (reshaped 128): 6ms - 15x faster
- Performance follows U-shaped curve

Add tests/bench_estimate_block_size.py for benchmarking and propose
hierarchical block sum approach for optimization.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 06:24:28 +08:00

__init__.py

[WIP] NEED refactor nanovllm mechenism.

2025-12-22 23:52:56 +08:00

bench_estimate_block_size.py

📝 docs: add estimate block_size performance analysis

2026-01-28 06:24:28 +08:00

modeling_qwen3.py

[refactor] Refactor needle test.

2026-01-03 19:19:37 +08:00

test_chunk_attention_graph_reuse.py

📝 docs: add CUDA Graph optimization plan for offload mode decode