[llvm] [AMDGPU] - Add s_bitreplicate intrinsic (PR #69209)

Tue Oct 24 09:28:10 PDT 2023

nhaehnle wrote:

I forgot a final lshl_or in the sequence above, but it randomly occurred to me that there is a slightly better vector code sequence:

* 2x v_perm_b32 to distribute bytes of the input
* For 3 levels of bit width (4, 2, 1), do:
  * 2x v_lshl_or_b32 to shift the high halves of each bit group
  * 2x v_and_b32 to mask out the "dirt" leftover from the lshl_or
* 2x v_lshl_or_b32 for the final bit replication

That's slightly better at 16 VALU instructions, but still a massive performance cliff. I still think it's justified to have an intrinsic that only allows uniform inputs.

https://github.com/llvm/llvm-project/pull/69209