flatreader

Few days ago I posted about "octopus-inspired GPU optimization" with claims of 14x speedup. Got apart in the comments, and honestly, fair.

The main problem: my "naive" baseline was one image per thread. Nobody do that. It's not a real comparison it's beating up a strawman I built myself.

Some comments that hit hard:

A speedup is a combined measure of how good your solution is mixed with how bad the baseline is
warp-level uniformity is literally the first thing everyone considers
OpenMP did that before AI was a buzzword

The block metadata approach still has value—O(1) lookup, 9000x less memory than lookup tables, and you can attach scheduling info per block. But it's not a 14x win. On RTX 4090 with 72MB L2 cache, binary search is basically free.

i hv learned if your speedup looks too good, check your baseline

This is my first time writing this kind of benchmark. Happy to hear what I'm still getting wrong.

Got roasted for my 'octopus GPU' post. Went back and did it right