Document CUDA Graph memory behavior based on actual testing:
- Memory overhead at each stage (model, cache, warmup, capture, replay)
- StaticCache is the main overhead (~144MB for 1K tokens)
- Graph capture adds minimal overhead (~8MB)
- Graph replay requires zero additional allocation
- Performance improvement: ~2.8x decode throughput
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>