[llvm] [AMDGPU] - Add s_bitreplicate intrinsic (PR #69209)
Nicolai Hähnle via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 24 07:01:16 PDT 2023
https://github.com/nhaehnle commented:
I think this is good, though there should still be a comment in IntrinsicsAMDGPU.td describing the constraints / definition of the intrinsic.
> Emitting readfirstlane is just wrong, you have to emulate the operation with a VALU expansion if needed
That's how we usually do it, but this case is a bit unusual. It's the combination of:
* The whole point of adding the intrinsic is to expose a corner case optimization opportunity provided by this SALU instruction
* All VALU expansions I can think of are completely horrible, leaving a massive performance footgun if the intrinsic ever happens to run into a VGPR input
* We wouldn't have good test coverage for the complex VALU expansion
The cheapest VALU expansion I can think of is:
* 2x v_perm_b32 to distribute bytes of the of the input
* For 3 levels of bit width (4, 2, 1), do:
* 4x v_and_b32 to mask out low and high halves
* 2x v_lshl_or_b32 for shifted combine of the masked parts
... which amounts to 20 VALU instructions in place of 1 SALU instruction.
For these reasons, I recommended to define the intrinsic such that it's only defined for uniform inputs.
https://github.com/llvm/llvm-project/pull/69209
More information about the llvm-commits
mailing list