Few days ago I posted about "octopus-inspired GPU optimization" with claims of 14x speedup. Got apart in the comments, and honestly, fair.
The main problem: my "naive" baseline was one image per thread. Nobody do that. It's not a real comparison it's beating up a strawman I built myself.
Some comments that hit hard:
The block metadata approach still has value—O(1) lookup, 9000x less memory than lookup tables, and you can attach scheduling info per block. But it's not a 14x win. On RTX 4090 with 72MB L2 cache, binary search is basically free.
i hv learned if your speedup looks too good, check your baseline
This is my first time writing this kind of benchmark. Happy to hear what I'm still getting wrong.