[PATCH] R600/SI: Aggressively fold fma and mad

Wed Jan 28 10:11:17 PST 2015

On Mon, Jan 26, 2015 at 12:59:24PM -0800, Matt Arsenault wrote:
> Hi,
> 
> These allow using fma and mad instructions in more situations, and fix incorrectly using v_mad_f32 when denormals are requested.
> 
> 

> From 0e1e5e441862f01b39205657f5eb1decbd836df1 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Thu, 22 Jan 2015 18:40:47 -0800
> Subject: [PATCH 1/5] R600/SI: Fix tonga's basic scheduling model
> 
> ---
>  lib/Target/R600/Processors.td | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

LGTM. 

> From 63ff27d46e77352e21aed2055d7d087cbbbd4246 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Thu, 22 Jan 2015 18:39:58 -0800
> Subject: [PATCH 2/5] R600/SI: Add subtarget feature for if f32 fma is fast
> 
> ---
>  lib/Target/R600/AMDGPU.td           |  6 ++++++
>  lib/Target/R600/AMDGPUSubtarget.cpp |  3 ++-
>  lib/Target/R600/AMDGPUSubtarget.h   |  5 +++++
>  lib/Target/R600/Processors.td       | 12 +++++++++---
>  lib/Target/R600/SIISelLowering.cpp  |  2 +-
>  5 files changed, 23 insertions(+), 5 deletions(-)

LGTM.

> From 6d5416b90a73b55aeb2fbee5f8426c01b1c876cc Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Wed, 21 Jan 2015 17:29:48 -0800
> Subject: [PATCH 3/5] R600/SI: Implement enableAggressiveFMAFusion
> 
> Add tests for the various combines. This should
> always be at least cycle neutral on all subtargets for f64,
> and faster on some. For f32 we should prefer selecting
> v_mad_f32 over v_fma_f32.
> ---
>  lib/Target/R600/SIISelLowering.cpp |  31 +++-
>  lib/Target/R600/SIISelLowering.h   |   1 +
>  test/CodeGen/R600/fma-combine.ll   | 368 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 399 insertions(+), 1 deletion(-)
>  create mode 100644 test/CodeGen/R600/fma-combine.ll
> 

LGTM.

> From 19ce04e9893142edfa79078d9b5e9991a9cb7445 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Sun, 25 Jan 2015 12:56:18 -0800
> Subject: [PATCH 4/5] R600: Copy aggressive fma combines for mad
> 
> v_mad_f32 has the same result as the separate add and
> multiply, and is always full rate, so we should always
> try to form these as long as we don't need to support
> denormals. I don't think there isn't a great way to share this
> code without adding a new generic mad node and a generic
> check for denormal support.

Could you add a TLI query something like:

SDValue mergeMulAdd(SDValue A, SDValue B, SDValue C)

so that the target could decide what opcode to use?

> ---
>  lib/Target/R600/AMDGPUISelLowering.cpp | 133 ++++++++++
>  lib/Target/R600/AMDGPUISelLowering.h   |   3 +
>  lib/Target/R600/AMDGPUInstructions.td  |   5 -
>  lib/Target/R600/R600Instructions.td    |   2 +-
>  lib/Target/R600/SIISelLowering.cpp     |  25 +-
>  lib/Target/R600/SIInstructions.td      |   5 +-
>  test/CodeGen/R600/mad-combine.ll       | 446 +++++++++++++++++++++++++++++++++
>  7 files changed, 587 insertions(+), 32 deletions(-)
>  create mode 100644 test/CodeGen/R600/mad-combine.ll
> 
> diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp
> index d3897fe..f3769e3 100644
> --- a/lib/Target/R600/AMDGPUISelLowering.cpp
> +++ b/lib/Target/R600/AMDGPUISelLowering.cpp
> @@ -395,6 +395,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(TargetMachine &TM) :
>    setTargetDAGCombine(ISD::SELECT_CC);
>    setTargetDAGCombine(ISD::STORE);
>  
> +  setTargetDAGCombine(ISD::FADD);
> +  setTargetDAGCombine(ISD::FSUB);
> +
>    setBooleanContents(ZeroOrNegativeOneBooleanContent);
>    setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);
>  
> @@ -2419,6 +2422,128 @@ SDValue AMDGPUTargetLowering::performMulCombine(SDNode *N,
>    return DAG.getSExtOrTrunc(Mul, DL, VT);
>  }
>  
> +// FIXME: Mostly copied directly from generic FMA combines.
> +// We can form f32 mads as long as denormals are not requested.

Do you

> +SDValue AMDGPUTargetLowering::performFAddCombine(SDNode *N,
> +                                                 DAGCombinerInfo &DCI) const {
> +  EVT VT = N->getValueType(0);
> +
> +  if (VT != MVT::f32) // There is no mad instruction for f64.
> +    return SDValue();
> +
> +  SelectionDAG &DAG = DCI.DAG;
> +  SDLoc SL(N);
> +
> +  SDValue N0 = N->getOperand(0);
> +  SDValue N1 = N->getOperand(1);
> +
> +  // fold (fadd (fmul x, y), z) -> (mad x, y, z)
> +  if (N0.getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       N0.getOperand(0), N0.getOperand(1), N1);
> +  }
> +
> +  // fold (fadd x, (fmul y, z)) -> (mad y, z, x)
> +  // Note: Commutes FADD operands.
> +  if (N1.getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       N1.getOperand(0), N1.getOperand(1), N0);
> +  }
> +
> +  // fold (fadd (mad x, y, (fmul u, v)), z) -> (mad x, y (mad u, v, z))
> +  if (N0.getOpcode() == AMDGPUISD::MAD &&
> +      N0.getOperand(2).getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       N0.getOperand(0), N0.getOperand(1),
> +                       DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                                   N0.getOperand(2).getOperand(0),
> +                                   N0.getOperand(2).getOperand(1),
> +                                   N1));
> +  }
> +
> +  // fold (fadd x, (mad y, z, (fmul u, v)) -> (mad y, z (mad u, v, x))
> +  if (N1->getOpcode() == AMDGPUISD::MAD &&
> +      N1.getOperand(2).getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       N1.getOperand(0), N1.getOperand(1),
> +                       DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                                   N1.getOperand(2).getOperand(0),
> +                                   N1.getOperand(2).getOperand(1),
> +                                   N0));
> +  }
> +
> +  return SDValue();
> +}
> +
> +// FIXME: Mostly copied directly from generic FMA combines.
> +SDValue AMDGPUTargetLowering::performFSubCombine(SDNode *N,
> +                                                 DAGCombinerInfo &DCI) const {
> +  EVT VT = N->getValueType(0);
> +
> +  if (VT != MVT::f32) // There is no mad instruction for f64.
> +    return SDValue();
> +
> +  SelectionDAG &DAG = DCI.DAG;
> +  SDLoc SL(N);
> +
> +  SDValue N0 = N->getOperand(0);
> +  SDValue N1 = N->getOperand(1);
> +
> +  // fold (fsub (fmul x, y), z) -> (mad x, y, (fneg z))
> +  if (N0.getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       N0.getOperand(0), N0.getOperand(1),
> +                       DAG.getNode(ISD::FNEG, SL, VT, N1));
> +  }
> +
> +  // fold (fsub x, (fmul y, z)) -> (mad (fneg y), z, x)
> +  // Note: Commutes FSUB operands.
> +  if (N1.getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       DAG.getNode(ISD::FNEG, SL, VT,
> +                                   N1.getOperand(0)),
> +                       N1.getOperand(1), N0);
> +  }
> +
> +  // fold (fsub (fneg (fmul, x, y)), z) -> (mad (fneg x), y, (fneg z))
> +  if (N0.getOpcode() == ISD::FNEG &&
> +      N0.getOperand(0).getOpcode() == ISD::FMUL) {
> +    SDValue N00 = N0.getOperand(0).getOperand(0);
> +    SDValue N01 = N0.getOperand(0).getOperand(1);
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       DAG.getNode(ISD::FNEG, SL, VT, N00), N01,
> +                       DAG.getNode(ISD::FNEG, SL, VT, N1));
> +  }
> +
> +  // fold (fsub (mad x, y, (fmul u, v)), z)
> +  //   -> (mad x, y (mad u, v, (fneg z)))
> +  if (N0.getOpcode() == AMDGPUISD::MAD &&
> +      N0.getOperand(2).getOpcode() == ISD::FMUL) {
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       N0.getOperand(0), N0.getOperand(1),
> +                       DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                                   N0.getOperand(2).getOperand(0),
> +                                   N0.getOperand(2).getOperand(1),
> +                                   DAG.getNode(ISD::FNEG, SL, VT, N1)));
> +  }
> +
> +  // fold (fsub x, (mad y, z, (fmul u, v)))
> +  //   -> (mad (fneg y), z, (mad (fneg u), v, x))
> +  if (N1.getOpcode() == AMDGPUISD::MAD &&
> +      N1.getOperand(2).getOpcode() == ISD::FMUL) {
> +    SDValue N20 = N1.getOperand(2).getOperand(0);
> +    SDValue N21 = N1.getOperand(2).getOperand(1);
> +    return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                       DAG.getNode(ISD::FNEG, SL, VT, N1.getOperand(0)),
> +                       N1.getOperand(1),
> +                       DAG.getNode(AMDGPUISD::MAD, SL, VT,
> +                                   DAG.getNode(ISD::FNEG, SL,  VT, N20),
> +                                   N21, N0));
> +  }
> +
> +  return SDValue();
> +}
> +
>  SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N,
>                                                  DAGCombinerInfo &DCI) const {
>    SelectionDAG &DAG = DCI.DAG;
> @@ -2436,6 +2561,14 @@ SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N,
>        simplifyI24(N1, DCI);
>        return SDValue();
>      }
> +  case ISD::FADD:
> +    if (DCI.getDAGCombineLevel() < AfterLegalizeDAG)
> +      break;
> +    return performFAddCombine(N, DCI);
> +  case ISD::FSUB:
> +    if (DCI.getDAGCombineLevel() < AfterLegalizeDAG)
> +      break;
> +    return performFSubCombine(N, DCI);
>    case ISD::SELECT: {
>      SDValue Cond = N->getOperand(0);
>      if (Cond.getOpcode() == ISD::SETCC && Cond.hasOneUse()) {
> diff --git a/lib/Target/R600/AMDGPUISelLowering.h b/lib/Target/R600/AMDGPUISelLowering.h
> index 387a58e..4aca7c6 100644
> --- a/lib/Target/R600/AMDGPUISelLowering.h
> +++ b/lib/Target/R600/AMDGPUISelLowering.h
> @@ -68,6 +68,9 @@ private:
>    SDValue performMulCombine(SDNode *N, DAGCombinerInfo &DCI) const;
>  
>  protected:
> +  SDValue performFAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;
> +  SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
> +
>    static EVT getEquivalentMemType(LLVMContext &Context, EVT VT);
>    static EVT getEquivalentLoadRegType(LLVMContext &Context, EVT VT);
>  
> diff --git a/lib/Target/R600/AMDGPUInstructions.td b/lib/Target/R600/AMDGPUInstructions.td
> index e42796b..ff3fddc 100644
> --- a/lib/Target/R600/AMDGPUInstructions.td
> +++ b/lib/Target/R600/AMDGPUInstructions.td
> @@ -413,11 +413,6 @@ def atomic_xor_global : global_binary_atomic_op<atomic_load_xor>;
>  // Misc Pattern Fragments
>  //===----------------------------------------------------------------------===//
>  
> -def fmad : PatFrag <
> -  (ops node:$src0, node:$src1, node:$src2),
> -  (fadd (fmul node:$src0, node:$src1), node:$src2)
> ->;
> -
>  class Constants {
>  int TWO_PI = 0x40c90fdb;
>  int PI = 0x40490fdb;
> diff --git a/lib/Target/R600/R600Instructions.td b/lib/Target/R600/R600Instructions.td
> index d004262..06acd6b 100644
> --- a/lib/Target/R600/R600Instructions.td
> +++ b/lib/Target/R600/R600Instructions.td
> @@ -914,7 +914,7 @@ class MULADD_Common <bits<5> inst> : R600_3OP <
>  
>  class MULADD_IEEE_Common <bits<5> inst> : R600_3OP <
>    inst, "MULADD_IEEE",
> -  [(set f32:$dst, (fadd (fmul f32:$src0, f32:$src1), f32:$src2))]
> +  [(set f32:$dst, (AMDGPUmad f32:$src0, f32:$src1, f32:$src2))]
>  >;
>  
>  class FMA_Common <bits<5> inst> : R600_3OP <
> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> index 894bd6e..6dc97ea 100644
> --- a/lib/Target/R600/SIISelLowering.cpp
> +++ b/lib/Target/R600/SIISelLowering.cpp
> @@ -1603,7 +1603,7 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
>        }
>      }
>  
> -    break;
> +    return AMDGPUTargetLowering::performFAddCombine(N, DCI);
>    }
>    case ISD::FSUB: {
>      if (DCI.getDAGCombineLevel() < AfterLegalizeDAG)
> @@ -1616,27 +1616,6 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
>      if (VT == MVT::f32) {
>        SDValue LHS = N->getOperand(0);
>        SDValue RHS = N->getOperand(1);
> -
> -      if (LHS.getOpcode() == ISD::FMUL) {
> -        // (fsub (fmul a, b), c) -> mad a, b, (fneg c)
> -
> -        SDValue A = LHS.getOperand(0);
> -        SDValue B = LHS.getOperand(1);
> -        SDValue C = DAG.getNode(ISD::FNEG, DL, VT, RHS);
> -
> -        return DAG.getNode(AMDGPUISD::MAD, DL, VT, A, B, C);
> -      }
> -
> -      if (RHS.getOpcode() == ISD::FMUL) {
> -        // (fsub c, (fmul a, b)) -> mad (fneg a), b, c
> -
> -        SDValue A = DAG.getNode(ISD::FNEG, DL, VT, RHS.getOperand(0));
> -        SDValue B = RHS.getOperand(1);
> -        SDValue C = LHS;
> -
> -        return DAG.getNode(AMDGPUISD::MAD, DL, VT, A, B, C);
> -      }
> -
>        if (LHS.getOpcode() == ISD::FADD) {
>          // (fsub (fadd a, a), c) -> mad 2.0, a, (fneg c)
>  
> @@ -1658,6 +1637,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
>            return DAG.getNode(AMDGPUISD::MAD, DL, VT, NegTwo, A, LHS);
>          }
>        }
> +
> +      return AMDGPUTargetLowering::performFSubCombine(N, DCI);
>      }
>  
>      break;
> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> index d758f9f..c614609 100644
> --- a/lib/Target/R600/SIInstructions.td
> +++ b/lib/Target/R600/SIInstructions.td
> @@ -1607,7 +1607,7 @@ defm V_MAD_LEGACY_F32 : VOP3Inst <vop3<0x140, 0x1c0>, "v_mad_legacy_f32",
>  >;
>  
>  defm V_MAD_F32 : VOP3Inst <vop3<0x141, 0x1c1>, "v_mad_f32",
> -  VOP_F32_F32_F32_F32, fmad
> +  VOP_F32_F32_F32_F32, AMDGPUmad
>  >;
>  
>  defm V_MAD_I32_I24 : VOP3Inst <vop3<0x142, 0x1c2>, "v_mad_i32_i24",
> @@ -2748,9 +2748,6 @@ def : Pat <
>    (V_MUL_HI_I32 $src0, $src1)
>  >;
>  
> -def : Vop3ModPat<V_MAD_F32, VOP_F32_F32_F32_F32, AMDGPUmad>;
> -
> -
>  defm : BFIPatterns <V_BFI_B32, S_MOV_B32, SReg_64>;
>  def : ROTRPattern <V_ALIGNBIT_B32>;
>  
> diff --git a/test/CodeGen/R600/mad-combine.ll b/test/CodeGen/R600/mad-combine.ll
> new file mode 100644
> index 0000000..b116b2c
> --- /dev/null
> +++ b/test/CodeGen/R600/mad-combine.ll
> @@ -0,0 +1,446 @@
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -fp-contract=fast < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> +
> +; Make sure we still form mad even when unsafe math or fp-contract is allowed instead of fma.
> +
> +
> +declare i32 @llvm.r600.read.tidig.x() #0
> +declare float @llvm.fabs.f32(float) #0
> +declare float @llvm.fma.f32(float, float, float) #0
> +declare float @llvm.fmuladd.f32(float, float, float) #0
> +
> +; (fadd (fmul x, y), z) -> (fma x, y, z)
> +; FUNC-LABEL: {{^}}combine_to_mad_f32_0:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_f32_0(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +
> +  %mul = fmul float %a, %b
> +  %fma = fadd float %mul, %c
> +  store float %fma, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; (fadd (fmul x, y), z) -> (fma x, y, z)
> +; FUNC-LABEL: {{^}}combine_to_mad_f32_0_2use:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_f32_0_2use(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> +  %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +  %d = load float addrspace(1)* %gep.3
> +
> +  %mul = fmul float %a, %b
> +  %fma0 = fadd float %mul, %c
> +  %fma1 = fadd float %mul, %d
> +
> +  store float %fma0, float addrspace(1)* %gep.out.0
> +  store float %fma1, float addrspace(1)* %gep.out.1
> +  ret void
> +}
> +
> +; (fadd x, (fmul y, z)) -> (fma y, z, x)
> +; FUNC-LABEL: {{^}}combine_to_mad_f32_1:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_f32_1(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +
> +  %mul = fmul float %a, %b
> +  %fma = fadd float %c, %mul
> +  store float %fma, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; (fsub (fmul x, y), z) -> (fma x, y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_0_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +
> +  %mul = fmul float %a, %b
> +  %fma = fsub float %mul, %c
> +  store float %fma, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; (fsub (fmul x, y), z) -> (fma x, y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_0_f32_2use:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_0_f32_2use(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> +  %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +  %d = load float addrspace(1)* %gep.3
> +
> +  %mul = fmul float %a, %b
> +  %fma0 = fsub float %mul, %c
> +  %fma1 = fsub float %mul, %d
> +  store float %fma0, float addrspace(1)* %gep.out.0
> +  store float %fma1, float addrspace(1)* %gep.out.1
> +  ret void
> +}
> +
> +; (fsub x, (fmul y, z)) -> (fma (fneg y), z, x)
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_1_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +
> +  %mul = fmul float %a, %b
> +  %fma = fsub float %c, %mul
> +  store float %fma, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; (fsub x, (fmul y, z)) -> (fma (fneg y), z, x)
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_1_f32_2use:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_1_f32_2use(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> +  %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +  %d = load float addrspace(1)* %gep.3
> +
> +  %mul = fmul float %a, %b
> +  %fma0 = fsub float %c, %mul
> +  %fma1 = fsub float %d, %mul
> +  store float %fma0, float addrspace(1)* %gep.out.0
> +  store float %fma1, float addrspace(1)* %gep.out.1
> +  ret void
> +}
> +
> +; (fsub (fneg (fmul x, y)), z) -> (fma (fneg x), y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_2_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +
> +  %mul = fmul float %a, %b
> +  %mul.neg = fsub float -0.0, %mul
> +  %fma = fsub float %mul.neg, %c
> +
> +  store float %fma, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; (fsub (fneg (fmul x, y)), z) -> (fma (fneg x), y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_2_f32_2uses_neg:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_2_f32_2uses_neg(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> +  %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +  %d = load float addrspace(1)* %gep.3
> +
> +  %mul = fmul float %a, %b
> +  %mul.neg = fsub float -0.0, %mul
> +  %fma0 = fsub float %mul.neg, %c
> +  %fma1 = fsub float %mul.neg, %d
> +
> +  store float %fma0, float addrspace(1)* %gep.out.0
> +  store float %fma1, float addrspace(1)* %gep.out.1
> +  ret void
> +}
> +
> +; (fsub (fneg (fmul x, y)), z) -> (fma (fneg x), y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_2_f32_2uses_mul:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_2_f32_2uses_mul(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> +  %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> +  %a = load float addrspace(1)* %gep.0
> +  %b = load float addrspace(1)* %gep.1
> +  %c = load float addrspace(1)* %gep.2
> +  %d = load float addrspace(1)* %gep.3
> +
> +  %mul = fmul float %a, %b
> +  %mul.neg = fsub float -0.0, %mul
> +  %fma0 = fsub float %mul.neg, %c
> +  %fma1 = fsub float %mul, %d
> +
> +  store float %fma0, float addrspace(1)* %gep.out.0
> +  store float %fma1, float addrspace(1)* %gep.out.1
> +  ret void
> +}
> +
> +; fold (fsub (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, (fneg z)))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_0_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP1]]
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +define void @aggressive_combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %x = load float addrspace(1)* %gep.0
> +  %y = load float addrspace(1)* %gep.1
> +  %z = load float addrspace(1)* %gep.2
> +  %u = load float addrspace(1)* %gep.3
> +  %v = load float addrspace(1)* %gep.4
> +
> +  %tmp0 = fmul float %u, %v
> +  %tmp1 = call float @llvm.fma.f32(float %x, float %y, float %tmp0) #0
> +  %tmp2 = fsub float %tmp1, %z
> +
> +  store float %tmp2, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; fold (fsub x, (fma y, z, (fmul u, v)))
> +;   -> (fma (fneg y), z, (fma (fneg u), v, x))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_1_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> +; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI: s_endpgm
> +define void @aggressive_combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %x = load float addrspace(1)* %gep.0
> +  %y = load float addrspace(1)* %gep.1
> +  %z = load float addrspace(1)* %gep.2
> +  %u = load float addrspace(1)* %gep.3
> +  %v = load float addrspace(1)* %gep.4
> +
> +  %tmp0 = fmul float %u, %v
> +  %tmp1 = call float @llvm.fma.f32(float %y, float %z, float %tmp0) #0
> +  %tmp2 = fsub float %x, %tmp1
> +
> +  store float %tmp2, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; fold (fsub (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, (fneg z)))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_2_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> +; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI: s_endpgm
> +define void @aggressive_combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %x = load float addrspace(1)* %gep.0
> +  %y = load float addrspace(1)* %gep.1
> +  %z = load float addrspace(1)* %gep.2
> +  %u = load float addrspace(1)* %gep.3
> +  %v = load float addrspace(1)* %gep.4
> +
> +  %tmp0 = fmul float %u, %v
> +  %tmp1 = call float @llvm.fmuladd.f32(float %x, float %y, float %tmp0) #0
> +  %tmp2 = fsub float %tmp1, %z
> +
> +  store float %tmp2, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +; fold (fsub x, (fmuladd y, z, (fmul u, v)))
> +;   -> (fmuladd (fneg y), z, (fmuladd (fneg u), v, x))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_3_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> +; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI: s_endpgm
> +define void @aggressive_combine_to_mad_fsub_3_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> +  %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> +  %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> +  %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> +  %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> +  %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> +  %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> +  %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> +  %x = load float addrspace(1)* %gep.0
> +  %y = load float addrspace(1)* %gep.1
> +  %z = load float addrspace(1)* %gep.2
> +  %u = load float addrspace(1)* %gep.3
> +  %v = load float addrspace(1)* %gep.4
> +
> +  %tmp0 = fmul float %u, %v
> +  %tmp1 = call float @llvm.fmuladd.f32(float %y, float %z, float %tmp0) #0
> +  %tmp2 = fsub float %x, %tmp1
> +
> +  store float %tmp2, float addrspace(1)* %gep.out
> +  ret void
> +}
> +
> +attributes #0 = { nounwind readnone }
> +attributes #1 = { nounwind }
> -- 
> 2.2.1
> 

> From b4f466d606626343ca30a9a8daec35bd61362027 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Thu, 22 Jan 2015 18:41:24 -0800
> Subject: [PATCH 5/5] R600/SI: Only form v_mad_f32 without denormals
> 
> According to some sources, v_mad_f32 does not support them.

Do we ever have denormals enabled?

> ---
>  lib/Target/R600/AMDGPUISelLowering.cpp |   8 ++
>  lib/Target/R600/SIISelLowering.cpp     |  16 ++-
>  test/CodeGen/R600/mad-combine.ll       | 183 +++++++++++++++++++++++++++------
>  3 files changed, 173 insertions(+), 34 deletions(-)
> 
> diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp
> index f3769e3..6f7d6e3 100644
> --- a/lib/Target/R600/AMDGPUISelLowering.cpp
> +++ b/lib/Target/R600/AMDGPUISelLowering.cpp
> @@ -2426,6 +2426,10 @@ SDValue AMDGPUTargetLowering::performMulCombine(SDNode *N,
>  // We can form f32 mads as long as denormals are not requested.
>  SDValue AMDGPUTargetLowering::performFAddCombine(SDNode *N,
>                                                   DAGCombinerInfo &DCI) const {
> +  // v_mad_f32 does not support denormals.
> +  if (Subtarget->hasFP32Denormals())
> +    return SDValue();
> +
>    EVT VT = N->getValueType(0);
>  
>    if (VT != MVT::f32) // There is no mad instruction for f64.
> @@ -2478,6 +2482,10 @@ SDValue AMDGPUTargetLowering::performFAddCombine(SDNode *N,
>  // FIXME: Mostly copied directly from generic FMA combines.
>  SDValue AMDGPUTargetLowering::performFSubCombine(SDNode *N,
>                                                   DAGCombinerInfo &DCI) const {
> +  // v_mad_f32 does not support denormals.
> +  if (Subtarget->hasFP32Denormals())
> +    return SDValue();
> +
>    EVT VT = N->getValueType(0);
>  
>    if (VT != MVT::f32) // There is no mad instruction for f64.
> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> index 6dc97ea..ce803e8 100644
> --- a/lib/Target/R600/SIISelLowering.cpp
> +++ b/lib/Target/R600/SIISelLowering.cpp
> @@ -672,8 +672,9 @@ bool SITargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {
>    case MVT::f32:
>      // This is as fast on some subtargets. However, we always have full rate f32
>      // mad available which returns the same result as the separate operations
> -    // which we should prefer over fma.
> -    return false;
> +    // which we should prefer over fma. We can't use this if we want to support
> +    // denormals, so only report this in these cases.
> +    return Subtarget->hasFP32Denormals() && Subtarget->hasFastFMAF32();
>    case MVT::f64:
>      return true;
>    default:
> @@ -1579,6 +1580,11 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
>      if (VT != MVT::f32)
>        break;
>  
> +    // Only do this if we are not trying to support denormals. v_mad_f32 does
> +    // not support denormals ever.
> +    if (Subtarget->hasFP32Denormals())
> +      break;
> +
>      SDValue LHS = N->getOperand(0);
>      SDValue RHS = N->getOperand(1);
>  
> @@ -1613,7 +1619,11 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
>  
>      // Try to get the fneg to fold into the source modifier. This undoes generic
>      // DAG combines and folds them into the mad.
> -    if (VT == MVT::f32) {
> +    //
> +    // Only do this if we are not trying to support denormals. v_mad_f32 does
> +    // not support denormals ever.
> +    if (VT == MVT::f32 &&
> +        !Subtarget->hasFP32Denormals()) {
>        SDValue LHS = N->getOperand(0);
>        SDValue RHS = N->getOperand(1);
>        if (LHS.getOpcode() == ISD::FADD) {
> diff --git a/test/CodeGen/R600/mad-combine.ll b/test/CodeGen/R600/mad-combine.ll
> index b116b2c..8c4e09b 100644
> --- a/test/CodeGen/R600/mad-combine.ll
> +++ b/test/CodeGen/R600/mad-combine.ll
> @@ -1,9 +1,12 @@
> -; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> -; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -fp-contract=fast < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> -; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> -
>  ; Make sure we still form mad even when unsafe math or fp-contract is allowed instead of fma.
>  
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=SI-STD -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -fp-contract=fast < %s | FileCheck -check-prefix=SI -check-prefix=SI-STD -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -check-prefix=SI -check-prefix=SI-STD -check-prefix=FUNC %s
> +
> +; Make sure we don't form mad with denormals
> +; RUN: llc -march=amdgcn -mcpu=tahiti -mattr=+fp32-denormals -fp-contract=fast -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=SI-DENORM -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=verde -mattr=+fp32-denormals -fp-contract=fast -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=SI-DENORM-SLOWFMAF -check-prefix=FUNC %s
>  
>  declare i32 @llvm.r600.read.tidig.x() #0
>  declare float @llvm.fabs.f32(float) #0
> @@ -15,7 +18,17 @@ declare float @llvm.fmuladd.f32(float, float, float) #0
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-DENORM-SLOWFMAF-NOT: v_fma
> +; SI-DENORM-SLOWFMAF-NOT: v_mad
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP]]
> +
>  ; SI: buffer_store_dword [[RESULT]]
>  define void @combine_to_mad_f32_0(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -40,8 +53,17 @@ define void @combine_to_mad_f32_0(float addrspace(1)* noalias %out, float addrsp
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
>  ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_add_f32_e32 [[RESULT0:v[0-9]+]], [[C]], [[TMP]]
> +; SI-DENORM-SLOWFMAF-DAG: v_add_f32_e32 [[RESULT1:v[0-9]+]], [[D]], [[TMP]]
> +
>  ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI: s_endpgm
> @@ -73,7 +95,13 @@ define void @combine_to_mad_f32_0_2use(float addrspace(1)* noalias %out, float a
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[C]]
> +
>  ; SI: buffer_store_dword [[RESULT]]
>  define void @combine_to_mad_f32_1(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -97,7 +125,13 @@ define void @combine_to_mad_f32_1(float addrspace(1)* noalias %out, float addrsp
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP]]
> +
>  ; SI: buffer_store_dword [[RESULT]]
>  define void @combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -122,8 +156,17 @@ define void @combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float a
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
>  ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT0:v[0-9]+]], [[C]], [[TMP]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[D]], [[TMP]]
> +
>  ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI: s_endpgm
> @@ -154,7 +197,13 @@ define void @combine_to_mad_fsub_0_f32_2use(float addrspace(1)* noalias %out, fl
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[C]]
> +
>  ; SI: buffer_store_dword [[RESULT]]
>  define void @combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -178,8 +227,17 @@ define void @combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float a
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT0:v[0-9]+]], [[TMP]], [[C]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[TMP]], [[D]]
> +
>  ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI: s_endpgm
> @@ -210,7 +268,14 @@ define void @combine_to_mad_fsub_1_f32_2use(float addrspace(1)* noalias %out, fl
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_sub_f32_e64 [[RESULT:v[0-9]+]], -[[TMP]], [[C]]
> +
>  ; SI: buffer_store_dword [[RESULT]]
>  define void @combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -236,8 +301,17 @@ define void @combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float a
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_sub_f32_e64 [[RESULT0:v[0-9]+]], -[[TMP]], [[C]]
> +; SI-DENORM-SLOWFMAF-DAG: v_sub_f32_e64 [[RESULT1:v[0-9]+]], -[[TMP]], [[D]]
> +
>  ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI: s_endpgm
> @@ -270,8 +344,17 @@ define void @combine_to_mad_fsub_2_f32_2uses_neg(float addrspace(1)* noalias %ou
>  ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_sub_f32_e64 [[RESULT0:v[0-9]+]], -[[TMP]], [[C]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[D]], [[TMP]]
> +
>  ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
>  ; SI: s_endpgm
> @@ -307,9 +390,18 @@ define void @combine_to_mad_fsub_2_f32_2uses_mul(float addrspace(1)* noalias %ou
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
>  ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
>  ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> -; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> -; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP1]]
> +
> +; SI-STD: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-STD: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +; SI-STD: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP1]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP0:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[C]], [[TMP1]]
> +
>  ; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  define void @aggressive_combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -343,9 +435,18 @@ define void @aggressive_combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %o
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
>  ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
>  ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> -; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> -; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +
> +; SI-STD: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-STD: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> +; SI-STD: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP0:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP0]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +
>  ; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI: s_endpgm
>  define void @aggressive_combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> @@ -379,9 +480,19 @@ define void @aggressive_combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %o
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
>  ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
>  ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> -; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +
> +; SI-STD: v_mad_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP1:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[TMP2:v[0-9]+]], [[TMP0]], [[TMP1]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP2]]
> +
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI: s_endpgm
>  define void @aggressive_combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -415,9 +526,19 @@ define void @aggressive_combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %o
>  ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
>  ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
>  ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> -; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> -; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +
> +; SI-STD: v_mad_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP1:v[0-9]+]], [[C]], [[B]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[TMP2:v[0-9]+]], [[TMP0]], [[TMP1]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP2]], [[A]]
> +
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
>  ; SI: s_endpgm
>  define void @aggressive_combine_to_mad_fsub_3_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
>    %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> -- 
> 2.2.1
> 

> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits