[llvm] [AMDGPU] Add pattern to select scalar ops for fshr with uniform operands (PR #165295)

Thu Nov 6 02:17:17 PST 2025

jayfoad wrote:

> > > limit pattern to gfx9+
> > 
> > 
> > Why?
> 
> For other architectures, it leads to less optimized codegen and increased VGPR usage. For example:
> 
> Looking at the ASM for the amdgcn.bitcast.32bit.ll test for gfx600:
> 
> With change:
> 
> ```
> v_and_b32_e32 v1, 0xffff0000, v4
> v_and_b32_e32 v0, 0xffff0000, v2
> v_add_f32_e32 v1, 0x40c00000, v1
> v_add_f32_e32 v0, 0x40c00000, v0
> v_lshrrev_b32_e32 v1, 16, v1
> v_lshr_b64 v[0:1], v[0:1], 16
> ```
> 
> Without change:
> 
> ```
> v_and_b32_e32 v1, 0xffff0000, v1
> v_and_b32_e32 v0, 0xffff0000, v2
> v_add_f32_e32 v1, 0x40c00000, v1
> v_add_f32_e32 v0, 0x40c00000, v0
> v_lshrrev_b32_e32 v1, 16, v1
> v_alignbit_b32 v0, v1, v0, 16
> ```
> 
> Semantically, both are the same. However, we end up using a 64 bit instruction and more VGPR (5 vs 3). I don't see the same behavior for GFX9, 10, 11, and 12. Hence, limiting the blast radius to GFX9+.

Hmm. I don't know exactly what happened in that test. I can't think of any reason why using s_lshr_b64 would be bad for GFX6 but good for GFX9+. I would prefer to enable your patch for all architectures.

https://github.com/llvm/llvm-project/pull/165295