[llvm] [AMDGPU] Adopt new lowering sequence for `fdiv16` (PR #109295)

Thu Sep 26 05:07:11 PDT 2024

================
@@ -10606,19 +10606,40 @@ SDValue SITargetLowering::LowerFDIV16(SDValue Op, SelectionDAG &DAG) const {
     return FastLowered;
 
   SDLoc SL(Op);
-  SDValue Src0 = Op.getOperand(0);
-  SDValue Src1 = Op.getOperand(1);
-
-  SDValue CvtSrc0 = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, Src0);
-  SDValue CvtSrc1 = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, Src1);
-
-  SDValue RcpSrc1 = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, CvtSrc1);
-  SDValue Quot = DAG.getNode(ISD::FMUL, SL, MVT::f32, CvtSrc0, RcpSrc1);
-
-  SDValue FPRoundFlag = DAG.getTargetConstant(0, SL, MVT::i32);
-  SDValue BestQuot = DAG.getNode(ISD::FP_ROUND, SL, MVT::f16, Quot, FPRoundFlag);
+  SDValue LHS = Op.getOperand(0);
+  SDValue RHS = Op.getOperand(1);
 
-  return DAG.getNode(AMDGPUISD::DIV_FIXUP, SL, MVT::f16, BestQuot, Src1, Src0);
+  // a32.u = opx(V_CVT_F32_F16, a.u);
+  // b32.u = opx(V_CVT_F32_F16, b.u);
+  // r32.u = opx(V_RCP_F32, b32.u);
+  // q32.u = opx(V_MUL_F32, a32.u, r32.u);
+  // e32.u = opx(V_MAD_F32, (b32.u^_neg32), q32.u, a32.u);
+  // q32.u = opx(V_MAD_F32, e32.u, r32.u, q32.u);
+  // e32.u = opx(V_MAD_F32, (b32.u^_neg32), q32.u, a32.u);
+  // tmp.u = opx(V_MUL_F32, e32.u, r32.u);
+  // tmp.u = opx(V_AND_B32, tmp.u, 0xff800000)
+  // tmp.u = opx(V_FREXP_MANT_F32, tmp.u);
+  // q32.u = opx(V_ADD_F32, tmp.u, q32.u);
+  // q16.u = opx(V_CVT_F16_F32, q32.u);
+  // q16.u = opx(V_DIV_FIXUP_F16, q16.u, b.u, a.u);
+
+  SDValue LHSExt = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, LHS);
+  SDValue RHSExt = DAG.getNode(ISD::FP_EXTEND, SL, MVT::f32, RHS);
+  SDValue NegRHSExt = DAG.getNode(ISD::FNEG, SL, MVT::f32, RHSExt);
+  SDValue Rcp = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, RHSExt);
+  SDValue Quot = DAG.getNode(ISD::FMUL, SL, MVT::f32, LHSExt, Rcp);
+  SDValue Err = DAG.getNode(ISD::FMA, SL, MVT::f32, NegRHSExt, Quot, LHSExt);
----------------
jayfoad wrote:

> @jayfoad The HW sequence is `FMAD`, but I did my experiments using `FMA` since the very beginning. It works perfectly fine (passing both @b-sumner 's test case as well as OpenCL CTS). That's why the first version of this PR uses `FMA` regardless, until @arsenm pointed it out in this thread.
> 
> I understand `FMA` and `FMAD` are (slightly) semantically different, but since `FMA` is widely available and passes the OpenCL CTS (not just math, but also all the others FP16 related), do you think it makes sense here to just use `FMA`? @jayfoad @arsenm @b-sumner What's more, @arsenm already suggested to choose between `FMA` and `FMAD` based on availability, this also indicates `FMA` can be a "direct" replacement.
> 
> BTW, the reason I chose `FMA` to begin with was, as I mentioned in the ticket, I found `FMAD` is not always available and I asked the question in the ticket, both @arsenm and @b-sumner said `FMA` can be used.

I don't know how thorough the OpenCL conformance tests are. It is possible to exhaustively test this on all ~ 2^32 possible inputs?

https://github.com/llvm/llvm-project/pull/109295