[llvm] WIP: [AMDGPU] Use s_cselect_b32 for uniform select of f32 values (PR #111688)

Tue Oct 15 14:25:39 PDT 2024

================
@@ -9,7 +9,9 @@ define amdgpu_ps float @xor3_i1_const(float inreg %arg1, i32 inreg %arg2) {
 ; GCN-NEXT:    v_cmp_lt_f32_e64 s[2:3], s0, 0
 ; GCN-NEXT:    v_cmp_lt_f32_e32 vcc, s0, v0
 ; GCN-NEXT:    s_and_b64 s[0:1], s[2:3], vcc
-; GCN-NEXT:    v_cndmask_b32_e64 v0, 1.0, 0, s[0:1]
+; GCN-NEXT:    s_and_b64 s[0:1], s[0:1], exec
+; GCN-NEXT:    s_cselect_b32 s0, 0, 1.0
+; GCN-NEXT:    v_mov_b32_e32 v0, s0
----------------
alex-t wrote:

SIFixSGPRCopies has nothing to do with that MIR at all. There are no VGPR 2 SGPR copies in the input MIR.  The only thing this pass changes is that it lowers the VCC to SCC copy to S_AND_B64. 
I agree that the Select SDNode with Cond operand produced by VALU and a user expecting the result in VGPR should be selected to V_CNDMASK_B32. Since it is not related to the VGPR to SGPR copies lowering I am curious what would be the right place for that optimization? It seems like it should be better done in the ISel.
The problem is that the selection walks the DAG from the root upward and selecting the Select SDNode we would have to walk its Cond operand sub-DAG to guess if it is going to be selected to VALU. For this particular case, we could deduce it from the fact that Setcc has an operand of FP type and we have no scalar FP comparison. So, I would think of the custom selection callback for that purpose.
I attached the selection DAG after DAGCombine
[combined.pdf](https://github.com/user-attachments/files/17385275/combined.pdf)
 to illustrate the idea.

https://github.com/llvm/llvm-project/pull/111688