How 4 bytes of padding make array clearing 49% faster

I wrote about interesting amd64-specific quirk. If a large array is 4-byte misaligned, making it 8-byte aligned can make the array clearing ~49% faster (at least on my Intel machine). In the post I also touch on Intel's REP STOSQ implementation, ERMS and also on other optimizations related to array clearing.

submitted by /u/watman12
[link] [comments]