nano-vllm

Files

Zijie Tian 49519c7ce7 📝 docs: update offload accuracy issue with independent testing results

Document key finding: single request inference works correctly (100% accuracy).
The 66% accuracy issue in batch mode is due to state accumulation between
sequential requests in the same process.

- Add comparison table: independent (100%) vs batch (66%) testing modes
- Document root cause analysis: state cleanup issue between requests
- Add workaround using test_ruler_niah.sh for independent testing
- Update next steps to focus on OffloadEngine reset/cleanup logic

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-12 21:08:35 +08:00

architecture_guide.md

[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00

cuda_graph_offload_guide.md

[claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST

2026-01-09 16:10:28 +08:00

debugging_guide.md

[claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST

2026-01-08 21:19:38 +08:00

gpu_only_performance_issue.md

[claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST