Got roasted for my 'octopus GPU' post. Went back and did it right

Few days ago I posted about "octopus-inspired GPU optimization" with claims of 14x speedup. Got apart in the comments, and honestly, fair.

The main problem: my "naive" baseline was one image per thread. Nobody do that. It's not a real comparison it's beating up a strawman I built myself.

Some comments that hit hard:

The block metadata approach still has value—O(1) lookup, 9000x less memory than lookup tables, and you can attach scheduling info per block. But it's not a 14x win. On RTX 4090 with 72MB L2 cache, binary search is basically free.

i hv learned if your speedup looks too good, check your baseline

This is my first time writing this kind of benchmark. Happy to hear what I'm still getting wrong.

submitted by /u/matthewlammw
[link] [comments]