Commit Graph

2 Commits

Author SHA1 Message Date
Zijie Tian
86633004ca 📝 docs: add 64k memory analysis and test configuration updates
Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:02:09 +08:00
Zijie Tian
cf168fd9b9 test: add comprehensive RULER benchmark test suite
- Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT)
- Implement RULER official evaluation metrics (string_match_all/part)
- Fix max_model_len to 32896 to prevent decode OOM on long inputs
- Add ruler_benchmark_report.md with full test results (92.1% accuracy)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-14 00:51:30 +08:00