[llvm] [AMDGPU] Add pattern to select scalar ops for fshr with uniform operands (PR #165295)

Wed Nov 5 08:12:24 PST 2025

akadutta wrote:

> > limit pattern to gfx9+
> 
> Why?

For other architectures, it leads to less optimized codegen and increased VGPR usage. For example:

Looking at the ASM for the amdgcn.bitcast.32bit.ll test for gfx600:

With change:
```
v_and_b32_e32 v1, 0xffff0000, v4
v_and_b32_e32 v0, 0xffff0000, v2
v_add_f32_e32 v1, 0x40c00000, v1
v_add_f32_e32 v0, 0x40c00000, v0
v_lshrrev_b32_e32 v1, 16, v1
v_lshr_b64 v[0:1], v[0:1], 16
```

Without change:
```
v_and_b32_e32 v1, 0xffff0000, v1
v_and_b32_e32 v0, 0xffff0000, v2
v_add_f32_e32 v1, 0x40c00000, v1
v_add_f32_e32 v0, 0x40c00000, v0
v_lshrrev_b32_e32 v1, 16, v1
v_alignbit_b32 v0, v1, v0, 16
```

Semantically, both are the same. However, we end up using a 64 bit instruction and more VGPR (5 vs 3). I don't see the same behavior for GFX9, 10, 11, and 12. Hence, limiting the blast radius to GFX9+.

https://github.com/llvm/llvm-project/pull/165295