Document the performance impact of block_size on softmax_fuse_block_sum: - Current 4096 (reshaped 512) is the WORST point: 95ms - Optimal 1024 (reshaped 128): 6ms - 15x faster - Performance follows U-shaped curve Add tests/bench_estimate_block_size.py for benchmarking and propose hierarchical block sum approach for optimization. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>