[llvm] [AMDGPU] - Add s_bitreplicate intrinsic (PR #69209)

Tue Oct 24 07:01:16 PDT 2023

https://github.com/nhaehnle commented:

I think this is good, though there should still be a comment in IntrinsicsAMDGPU.td describing the constraints / definition of the intrinsic.

> Emitting readfirstlane is just wrong, you have to emulate the operation with a VALU expansion if needed

That's how we usually do it, but this case is a bit unusual. It's the combination of:

* The whole point of adding the intrinsic is to expose a corner case optimization opportunity provided by this SALU instruction
* All VALU expansions I can think of are completely horrible, leaving a massive performance footgun if the intrinsic ever happens to run into a VGPR input
* We wouldn't have good test coverage for the complex VALU expansion

The cheapest VALU expansion I can think of is:

* 2x v_perm_b32 to distribute bytes of the of the input
* For 3 levels of bit width (4, 2, 1), do:
  * 4x v_and_b32 to mask out low and high halves
  * 2x v_lshl_or_b32 for shifted combine of the masked parts

... which amounts to 20 VALU instructions in place of 1 SALU instruction.

For these reasons, I recommended to define the intrinsic such that it's only defined for uniform inputs.

https://github.com/llvm/llvm-project/pull/69209