nano-vllm/tests/test_ruler.py at 6a6bd7568523232312d5f4e16bafeb58e0f725f8

Files

Zijie Tian 86633004ca 📝 docs: add 64k memory analysis and test configuration updates

Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-14 07:02:09 +08:00

13 KiB

Raw Blame History

View Raw

13 KiB Raw Blame History

13 KiB

Raw Blame History