[llvm] WIP: [AMDGPU] Use s_cselect_b32 for uniform select of f32 values (PR #111688)

Fri Oct 18 05:02:13 PDT 2024

================
@@ -9,7 +9,9 @@ define amdgpu_ps float @xor3_i1_const(float inreg %arg1, i32 inreg %arg2) {
 ; GCN-NEXT:    v_cmp_lt_f32_e64 s[2:3], s0, 0
 ; GCN-NEXT:    v_cmp_lt_f32_e32 vcc, s0, v0
 ; GCN-NEXT:    s_and_b64 s[0:1], s[2:3], vcc
-; GCN-NEXT:    v_cndmask_b32_e64 v0, 1.0, 0, s[0:1]
+; GCN-NEXT:    s_and_b64 s[0:1], s[0:1], exec
+; GCN-NEXT:    s_cselect_b32 s0, 0, 1.0
+; GCN-NEXT:    v_mov_b32_e32 v0, s0
----------------
jayfoad wrote:

I realize there are no VGPR to SGPR copies, but there is a VCC to SCC copy which is morally kind of the same thing: we could use a cost model to decide whether to implement that copy (by inserting S_AND) or to promote the user of scc to an equivalent VALU op.

It's hard to do this at instruction selection time because the optimal sequence depend on whether the result of select needs to go in an SGPR or not:
```
  %2 = V_CMP_EQ_F32_e64
  %5 = V_CNDMASK_B32
  ; result is in VGPR
```
vs:
```
  %2 = V_CMP_EQ_F32_e64
  %6 = S_AND_B64
  %5 = S_CSELECT_B32
  ; result is in SGPR
```
In the second case we don't want to use V_CNDMASK_B32 because we would need to insert V_READFIRSTLANE_B32 afterwards which is generally bad for performance.

https://github.com/llvm/llvm-project/pull/111688