nano-vllm/docs at tzj/vs_offload - nano-vllm - Gitea: Git with a cup of tea

zijie-tian/nano-vllm

Files

History

Zijie Tian 13586e689b docs: add chunked prefill integration plan

分析两个分支的内存布局差异，明确 Block-Based 设计对支持
任意长度推理的重要性。

核心发现：
- tzj/vs_offload 的 max_seq_len 设计导致 GPU 内存随序列长度增长
- tzj/minference 的 block-based 设计使 GPU 内存固定（~1.6 GB）
- 在 24GB RTX 3090 上可支持 4M+ tokens 推理

规划将 tzj/minference 的 chunked prefill 机制移植到 tzj/vs_offload 分支：
- Block-based GPU cache (无 layer 维度)
- Per-layer prefill buffer (完全并行 offload)
- Cross-layer pipeline buffers (double-buffering)
- Chunked prefill 流程和 LSE 在线合并

Sparse Policy 策略：保留架构，现阶段仅实现 FULL 策略

相关文件：
- docs/chunked_prefill_integration_plan.md (新增)

2026-01-18 18:49:19 +08:00

..

64k_memory_analysis.md

📝 docs: add 64k memory analysis and test configuration updates

2026-01-14 07:02:09 +08:00

64k_mlp_activation_oom.md

📝 docs: add 64k memory analysis and test configuration updates

2026-01-14 07:02:09 +08:00

architecture_guide.md

[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00

block_sparse_attention_lib.md

docs: add Block-Sparse-Attention library reference

2026-01-14 08:39:03 +08:00

chunked_prefill_analysis.md

docs: add chunked prefill analysis for ultra-long sequences

2026-01-16 10:38:02 +08:00

chunked_prefill_integration_plan.md

docs: add chunked prefill integration plan

2026-01-18 18:49:19 +08:00

cuda_graph_offload_guide.md

[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST

2026-01-09 16:10:28 +08:00

debugging_guide.md

[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00

development_notes.md

docs: reorganize documentation files

2026-01-14 10:08:41 +08:00

gpu_only_performance_issue.md

[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST

2026-01-08 23:22:38 +08:00

layerwise_offload_memory_analysis.md

[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00

multi_model_support.md

[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:14 CST

2026-01-10 21:14:32 +08:00

offload_accuracy_issue.md

📝 docs: update offload accuracy issue with independent testing results

2026-01-12 21:08:35 +08:00

ruler_benchmark_report.md

✅ test: add comprehensive RULER benchmark test suite

2026-01-14 00:51:30 +08:00

ruler_niah_standalone_test.md

[tests] Added test_niah_standalone.py.

2026-01-12 00:16:37 +08:00

sparse_attention_guide.md

[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00

sparse_offload_integration.md

[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST

2026-01-08 23:42:30 +08:00

sparse_prefill_integration_plan.md

[docs] Add sparse prefill integration plan from int-minference analysis

2026-01-10 23:33:09 +08:00

transformers_compatibility.md

[docs] Added transformers error desp.

2026-01-11 18:48:50 +08:00

xattention_analysis.md

docs: reorganize documentation files

2026-01-14 10:08:41 +08:00

xattention_integration.md

docs: add XAttention integration guide

2026-01-14 10:16:21 +08:00