Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:
New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
OOM root cause (memory fragmentation), RTX 3090 limitations,
theoretical vs actual memory usage breakdown
Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support
CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits
Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.
Co-Authored-By: Claude <noreply@anthropic.com>
- Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT)
- Implement RULER official evaluation metrics (string_match_all/part)
- Fix max_model_len to 32896 to prevent decode OOM on long inputs
- Add ruler_benchmark_report.md with full test results (92.1% accuracy)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document key finding: single request inference works correctly (100% accuracy).
The 66% accuracy issue in batch mode is due to state accumulation between
sequential requests in the same process.
- Add comparison table: independent (100%) vs batch (66%) testing modes
- Document root cause analysis: state cleanup issue between requests
- Add workaround using test_ruler_niah.sh for independent testing
- Update next steps to focus on OffloadEngine reset/cleanup logic
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add model registry system for dynamic model loading
- Implement LlamaForCausalLM with Llama3 RoPE scaling
- Register Qwen3ForCausalLM and Qwen2ForCausalLM
- Update ModelRunner to use get_model_class() for dynamic model selection
Tested: needle 32k test PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>