[llvm] [AMDGPU] - Add s_bitreplicate intrinsic (PR #69209)
Nicolai Hähnle via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 24 09:28:10 PDT 2023
nhaehnle wrote:
I forgot a final lshl_or in the sequence above, but it randomly occurred to me that there is a slightly better vector code sequence:
* 2x v_perm_b32 to distribute bytes of the input
* For 3 levels of bit width (4, 2, 1), do:
* 2x v_lshl_or_b32 to shift the high halves of each bit group
* 2x v_and_b32 to mask out the "dirt" leftover from the lshl_or
* 2x v_lshl_or_b32 for the final bit replication
That's slightly better at 16 VALU instructions, but still a massive performance cliff. I still think it's justified to have an intrinsic that only allows uniform inputs.
https://github.com/llvm/llvm-project/pull/69209
More information about the llvm-commits
mailing list