Zijie Tian
|
1ab4676396
|
♻️ refactor: consolidate RULER test files and document root cause
- test_ruler.py: add --fresh-llm, --sample-indices, --json-output options
- test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py
- docs: update chunked offload issue with root cause (state leakage confirmed)
- docs: add single-sample test results showing 100% accuracy for niah_single_1
Deleted redundant test files:
- tests/test_ruler_single_sample.py
- tests/test_ruler_sequential.py
- tests/test_ruler_samples.py
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-01-20 23:41:17 +08:00 |
|
Zijie Tian
|
e6e0dc5d7d
|
✨ feat: add comprehensive RULER benchmark testing
- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics
Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup
Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)
|
2026-01-18 20:34:06 +08:00 |
|