mareko added a comment. I might change the intrinsic to add the option to insert "s_and_saveexec s[N:M], 1" and "s_mov_b64 exec, s[N:M]" around the intrinsic to get an optimal single-lane block. Repository: rL LLVM https://reviews.llvm.org/D52944