[llvm] [AMDGPU] Add pattern to select scalar ops for fshr with uniform operands (PR #165295)
Jay Foad via llvm-commits
llvm-commits at lists.llvm.org
Thu Nov 6 02:17:17 PST 2025
jayfoad wrote:
> > > limit pattern to gfx9+
> >
> >
> > Why?
>
> For other architectures, it leads to less optimized codegen and increased VGPR usage. For example:
>
> Looking at the ASM for the amdgcn.bitcast.32bit.ll test for gfx600:
>
> With change:
>
> ```
> v_and_b32_e32 v1, 0xffff0000, v4
> v_and_b32_e32 v0, 0xffff0000, v2
> v_add_f32_e32 v1, 0x40c00000, v1
> v_add_f32_e32 v0, 0x40c00000, v0
> v_lshrrev_b32_e32 v1, 16, v1
> v_lshr_b64 v[0:1], v[0:1], 16
> ```
>
> Without change:
>
> ```
> v_and_b32_e32 v1, 0xffff0000, v1
> v_and_b32_e32 v0, 0xffff0000, v2
> v_add_f32_e32 v1, 0x40c00000, v1
> v_add_f32_e32 v0, 0x40c00000, v0
> v_lshrrev_b32_e32 v1, 16, v1
> v_alignbit_b32 v0, v1, v0, 16
> ```
>
> Semantically, both are the same. However, we end up using a 64 bit instruction and more VGPR (5 vs 3). I don't see the same behavior for GFX9, 10, 11, and 12. Hence, limiting the blast radius to GFX9+.
Hmm. I don't know exactly what happened in that test. I can't think of any reason why using s_lshr_b64 would be bad for GFX6 but good for GFX9+. I would prefer to enable your patch for all architectures.
https://github.com/llvm/llvm-project/pull/165295
More information about the llvm-commits
mailing list