[llvm] [NVPTX] Fold (add (select 0, (mul a, b)), c) -> (select c, (mad a, b, c)) (PR #96352)

Tue Jun 25 12:30:00 PDT 2024

================
@@ -5215,103 +5215,129 @@ bool NVPTXTargetLowering::allowUnsafeFPMath(MachineFunction &MF) const {
   return F.getFnAttribute("unsafe-fp-math").getValueAsBool();
 }
 
+static bool isConstZero(const SDValue &Operand) {
+  const auto *Const = dyn_cast<ConstantSDNode>(Operand);
+  return Const && Const->getZExtValue() == 0;
+}
+
 /// PerformADDCombineWithOperands - Try DAG combinations for an ADD with
 /// operands N0 and N1.  This is a helper for PerformADDCombine that is
 /// called with the default operands, and if that fails, with commuted
 /// operands.
-static SDValue PerformADDCombineWithOperands(
-    SDNode *N, SDValue N0, SDValue N1, TargetLowering::DAGCombinerInfo &DCI,
-    const NVPTXSubtarget &Subtarget, CodeGenOptLevel OptLevel) {
-  SelectionDAG  &DAG = DCI.DAG;
-  // Skip non-integer, non-scalar case
-  EVT VT=N0.getValueType();
-  if (VT.isVector())
+static SDValue
+PerformADDCombineWithOperands(SDNode *N, SDValue N0, SDValue N1,
+                              TargetLowering::DAGCombinerInfo &DCI) {
+  EVT VT = N0.getValueType();
+
+  // Since integer multiply-add costs the same as integer multiply
+  // but is more costly than integer add, do the fusion only when
+  // the mul is only used in the add.
----------------
jlebar wrote:

1. Is this statement true?  According to https://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions, on sm70+ integer multiply, integer add, and integer madd all have the same throughput.

3. If we really do want to add the hasOneUse() constraint for the simple madd transformation, we should definitely call it out in the commit message (or even do it as a separate commit?)

https://github.com/llvm/llvm-project/pull/96352