Introduction to PTX Optimization

Wrote a guide on PTX optimization, from basics to tensor cores. Covers why FlashAttention uses PTX mma instead of WMMA, async copies, cache hints, and warp shuffles.

submitted by /u/Venom_moneV
[link] [comments]