flatreader

I wrote about interesting amd64-specific quirk. If a large array is 4-byte misaligned, making it 8-byte aligned can make the array clearing ~49% faster (at least on my Intel machine). In the post I also touch on Intel's REP STOSQ implementation, ERMS and also on other optimizations related to array clearing.

How 4 bytes of padding make array clearing 49% faster