nano-vllm

Files

Zijie Tian 86633004ca 📝 docs: add 64k memory analysis and test configuration updates

Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-14 07:02:09 +08:00

__init__.py

[WIP] NEED refactor nanovllm mechenism.

2025-12-22 23:52:56 +08:00

modeling_qwen3.py

[refactor] Refactor needle test.

2026-01-03 19:19:37 +08:00

run_parallel_niah.sh

Merge branch 'zijie/fix-dist-3': Fix distributed port conflict

2026-01-12 16:27:25 +08:00

test_minference_gpu.py

[claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST