[PATCH] R600/SI: Aggressively fold fma and mad
Tom Stellard
tom at stellard.net
Wed Jan 28 10:11:17 PST 2015
On Mon, Jan 26, 2015 at 12:59:24PM -0800, Matt Arsenault wrote:
> Hi,
>
> These allow using fma and mad instructions in more situations, and fix incorrectly using v_mad_f32 when denormals are requested.
>
>
> From 0e1e5e441862f01b39205657f5eb1decbd836df1 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Thu, 22 Jan 2015 18:40:47 -0800
> Subject: [PATCH 1/5] R600/SI: Fix tonga's basic scheduling model
>
> ---
> lib/Target/R600/Processors.td | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
LGTM.
> From 63ff27d46e77352e21aed2055d7d087cbbbd4246 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Thu, 22 Jan 2015 18:39:58 -0800
> Subject: [PATCH 2/5] R600/SI: Add subtarget feature for if f32 fma is fast
>
> ---
> lib/Target/R600/AMDGPU.td | 6 ++++++
> lib/Target/R600/AMDGPUSubtarget.cpp | 3 ++-
> lib/Target/R600/AMDGPUSubtarget.h | 5 +++++
> lib/Target/R600/Processors.td | 12 +++++++++---
> lib/Target/R600/SIISelLowering.cpp | 2 +-
> 5 files changed, 23 insertions(+), 5 deletions(-)
LGTM.
> From 6d5416b90a73b55aeb2fbee5f8426c01b1c876cc Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Wed, 21 Jan 2015 17:29:48 -0800
> Subject: [PATCH 3/5] R600/SI: Implement enableAggressiveFMAFusion
>
> Add tests for the various combines. This should
> always be at least cycle neutral on all subtargets for f64,
> and faster on some. For f32 we should prefer selecting
> v_mad_f32 over v_fma_f32.
> ---
> lib/Target/R600/SIISelLowering.cpp | 31 +++-
> lib/Target/R600/SIISelLowering.h | 1 +
> test/CodeGen/R600/fma-combine.ll | 368 +++++++++++++++++++++++++++++++++++++
> 3 files changed, 399 insertions(+), 1 deletion(-)
> create mode 100644 test/CodeGen/R600/fma-combine.ll
>
LGTM.
> From 19ce04e9893142edfa79078d9b5e9991a9cb7445 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Sun, 25 Jan 2015 12:56:18 -0800
> Subject: [PATCH 4/5] R600: Copy aggressive fma combines for mad
>
> v_mad_f32 has the same result as the separate add and
> multiply, and is always full rate, so we should always
> try to form these as long as we don't need to support
> denormals. I don't think there isn't a great way to share this
> code without adding a new generic mad node and a generic
> check for denormal support.
Could you add a TLI query something like:
SDValue mergeMulAdd(SDValue A, SDValue B, SDValue C)
so that the target could decide what opcode to use?
> ---
> lib/Target/R600/AMDGPUISelLowering.cpp | 133 ++++++++++
> lib/Target/R600/AMDGPUISelLowering.h | 3 +
> lib/Target/R600/AMDGPUInstructions.td | 5 -
> lib/Target/R600/R600Instructions.td | 2 +-
> lib/Target/R600/SIISelLowering.cpp | 25 +-
> lib/Target/R600/SIInstructions.td | 5 +-
> test/CodeGen/R600/mad-combine.ll | 446 +++++++++++++++++++++++++++++++++
> 7 files changed, 587 insertions(+), 32 deletions(-)
> create mode 100644 test/CodeGen/R600/mad-combine.ll
>
> diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp
> index d3897fe..f3769e3 100644
> --- a/lib/Target/R600/AMDGPUISelLowering.cpp
> +++ b/lib/Target/R600/AMDGPUISelLowering.cpp
> @@ -395,6 +395,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(TargetMachine &TM) :
> setTargetDAGCombine(ISD::SELECT_CC);
> setTargetDAGCombine(ISD::STORE);
>
> + setTargetDAGCombine(ISD::FADD);
> + setTargetDAGCombine(ISD::FSUB);
> +
> setBooleanContents(ZeroOrNegativeOneBooleanContent);
> setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);
>
> @@ -2419,6 +2422,128 @@ SDValue AMDGPUTargetLowering::performMulCombine(SDNode *N,
> return DAG.getSExtOrTrunc(Mul, DL, VT);
> }
>
> +// FIXME: Mostly copied directly from generic FMA combines.
> +// We can form f32 mads as long as denormals are not requested.
Do you
> +SDValue AMDGPUTargetLowering::performFAddCombine(SDNode *N,
> + DAGCombinerInfo &DCI) const {
> + EVT VT = N->getValueType(0);
> +
> + if (VT != MVT::f32) // There is no mad instruction for f64.
> + return SDValue();
> +
> + SelectionDAG &DAG = DCI.DAG;
> + SDLoc SL(N);
> +
> + SDValue N0 = N->getOperand(0);
> + SDValue N1 = N->getOperand(1);
> +
> + // fold (fadd (fmul x, y), z) -> (mad x, y, z)
> + if (N0.getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N0.getOperand(0), N0.getOperand(1), N1);
> + }
> +
> + // fold (fadd x, (fmul y, z)) -> (mad y, z, x)
> + // Note: Commutes FADD operands.
> + if (N1.getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N1.getOperand(0), N1.getOperand(1), N0);
> + }
> +
> + // fold (fadd (mad x, y, (fmul u, v)), z) -> (mad x, y (mad u, v, z))
> + if (N0.getOpcode() == AMDGPUISD::MAD &&
> + N0.getOperand(2).getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N0.getOperand(0), N0.getOperand(1),
> + DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N0.getOperand(2).getOperand(0),
> + N0.getOperand(2).getOperand(1),
> + N1));
> + }
> +
> + // fold (fadd x, (mad y, z, (fmul u, v)) -> (mad y, z (mad u, v, x))
> + if (N1->getOpcode() == AMDGPUISD::MAD &&
> + N1.getOperand(2).getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N1.getOperand(0), N1.getOperand(1),
> + DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N1.getOperand(2).getOperand(0),
> + N1.getOperand(2).getOperand(1),
> + N0));
> + }
> +
> + return SDValue();
> +}
> +
> +// FIXME: Mostly copied directly from generic FMA combines.
> +SDValue AMDGPUTargetLowering::performFSubCombine(SDNode *N,
> + DAGCombinerInfo &DCI) const {
> + EVT VT = N->getValueType(0);
> +
> + if (VT != MVT::f32) // There is no mad instruction for f64.
> + return SDValue();
> +
> + SelectionDAG &DAG = DCI.DAG;
> + SDLoc SL(N);
> +
> + SDValue N0 = N->getOperand(0);
> + SDValue N1 = N->getOperand(1);
> +
> + // fold (fsub (fmul x, y), z) -> (mad x, y, (fneg z))
> + if (N0.getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N0.getOperand(0), N0.getOperand(1),
> + DAG.getNode(ISD::FNEG, SL, VT, N1));
> + }
> +
> + // fold (fsub x, (fmul y, z)) -> (mad (fneg y), z, x)
> + // Note: Commutes FSUB operands.
> + if (N1.getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + DAG.getNode(ISD::FNEG, SL, VT,
> + N1.getOperand(0)),
> + N1.getOperand(1), N0);
> + }
> +
> + // fold (fsub (fneg (fmul, x, y)), z) -> (mad (fneg x), y, (fneg z))
> + if (N0.getOpcode() == ISD::FNEG &&
> + N0.getOperand(0).getOpcode() == ISD::FMUL) {
> + SDValue N00 = N0.getOperand(0).getOperand(0);
> + SDValue N01 = N0.getOperand(0).getOperand(1);
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + DAG.getNode(ISD::FNEG, SL, VT, N00), N01,
> + DAG.getNode(ISD::FNEG, SL, VT, N1));
> + }
> +
> + // fold (fsub (mad x, y, (fmul u, v)), z)
> + // -> (mad x, y (mad u, v, (fneg z)))
> + if (N0.getOpcode() == AMDGPUISD::MAD &&
> + N0.getOperand(2).getOpcode() == ISD::FMUL) {
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N0.getOperand(0), N0.getOperand(1),
> + DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + N0.getOperand(2).getOperand(0),
> + N0.getOperand(2).getOperand(1),
> + DAG.getNode(ISD::FNEG, SL, VT, N1)));
> + }
> +
> + // fold (fsub x, (mad y, z, (fmul u, v)))
> + // -> (mad (fneg y), z, (mad (fneg u), v, x))
> + if (N1.getOpcode() == AMDGPUISD::MAD &&
> + N1.getOperand(2).getOpcode() == ISD::FMUL) {
> + SDValue N20 = N1.getOperand(2).getOperand(0);
> + SDValue N21 = N1.getOperand(2).getOperand(1);
> + return DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + DAG.getNode(ISD::FNEG, SL, VT, N1.getOperand(0)),
> + N1.getOperand(1),
> + DAG.getNode(AMDGPUISD::MAD, SL, VT,
> + DAG.getNode(ISD::FNEG, SL, VT, N20),
> + N21, N0));
> + }
> +
> + return SDValue();
> +}
> +
> SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N,
> DAGCombinerInfo &DCI) const {
> SelectionDAG &DAG = DCI.DAG;
> @@ -2436,6 +2561,14 @@ SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N,
> simplifyI24(N1, DCI);
> return SDValue();
> }
> + case ISD::FADD:
> + if (DCI.getDAGCombineLevel() < AfterLegalizeDAG)
> + break;
> + return performFAddCombine(N, DCI);
> + case ISD::FSUB:
> + if (DCI.getDAGCombineLevel() < AfterLegalizeDAG)
> + break;
> + return performFSubCombine(N, DCI);
> case ISD::SELECT: {
> SDValue Cond = N->getOperand(0);
> if (Cond.getOpcode() == ISD::SETCC && Cond.hasOneUse()) {
> diff --git a/lib/Target/R600/AMDGPUISelLowering.h b/lib/Target/R600/AMDGPUISelLowering.h
> index 387a58e..4aca7c6 100644
> --- a/lib/Target/R600/AMDGPUISelLowering.h
> +++ b/lib/Target/R600/AMDGPUISelLowering.h
> @@ -68,6 +68,9 @@ private:
> SDValue performMulCombine(SDNode *N, DAGCombinerInfo &DCI) const;
>
> protected:
> + SDValue performFAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;
> + SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
> +
> static EVT getEquivalentMemType(LLVMContext &Context, EVT VT);
> static EVT getEquivalentLoadRegType(LLVMContext &Context, EVT VT);
>
> diff --git a/lib/Target/R600/AMDGPUInstructions.td b/lib/Target/R600/AMDGPUInstructions.td
> index e42796b..ff3fddc 100644
> --- a/lib/Target/R600/AMDGPUInstructions.td
> +++ b/lib/Target/R600/AMDGPUInstructions.td
> @@ -413,11 +413,6 @@ def atomic_xor_global : global_binary_atomic_op<atomic_load_xor>;
> // Misc Pattern Fragments
> //===----------------------------------------------------------------------===//
>
> -def fmad : PatFrag <
> - (ops node:$src0, node:$src1, node:$src2),
> - (fadd (fmul node:$src0, node:$src1), node:$src2)
> ->;
> -
> class Constants {
> int TWO_PI = 0x40c90fdb;
> int PI = 0x40490fdb;
> diff --git a/lib/Target/R600/R600Instructions.td b/lib/Target/R600/R600Instructions.td
> index d004262..06acd6b 100644
> --- a/lib/Target/R600/R600Instructions.td
> +++ b/lib/Target/R600/R600Instructions.td
> @@ -914,7 +914,7 @@ class MULADD_Common <bits<5> inst> : R600_3OP <
>
> class MULADD_IEEE_Common <bits<5> inst> : R600_3OP <
> inst, "MULADD_IEEE",
> - [(set f32:$dst, (fadd (fmul f32:$src0, f32:$src1), f32:$src2))]
> + [(set f32:$dst, (AMDGPUmad f32:$src0, f32:$src1, f32:$src2))]
> >;
>
> class FMA_Common <bits<5> inst> : R600_3OP <
> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> index 894bd6e..6dc97ea 100644
> --- a/lib/Target/R600/SIISelLowering.cpp
> +++ b/lib/Target/R600/SIISelLowering.cpp
> @@ -1603,7 +1603,7 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
> }
> }
>
> - break;
> + return AMDGPUTargetLowering::performFAddCombine(N, DCI);
> }
> case ISD::FSUB: {
> if (DCI.getDAGCombineLevel() < AfterLegalizeDAG)
> @@ -1616,27 +1616,6 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
> if (VT == MVT::f32) {
> SDValue LHS = N->getOperand(0);
> SDValue RHS = N->getOperand(1);
> -
> - if (LHS.getOpcode() == ISD::FMUL) {
> - // (fsub (fmul a, b), c) -> mad a, b, (fneg c)
> -
> - SDValue A = LHS.getOperand(0);
> - SDValue B = LHS.getOperand(1);
> - SDValue C = DAG.getNode(ISD::FNEG, DL, VT, RHS);
> -
> - return DAG.getNode(AMDGPUISD::MAD, DL, VT, A, B, C);
> - }
> -
> - if (RHS.getOpcode() == ISD::FMUL) {
> - // (fsub c, (fmul a, b)) -> mad (fneg a), b, c
> -
> - SDValue A = DAG.getNode(ISD::FNEG, DL, VT, RHS.getOperand(0));
> - SDValue B = RHS.getOperand(1);
> - SDValue C = LHS;
> -
> - return DAG.getNode(AMDGPUISD::MAD, DL, VT, A, B, C);
> - }
> -
> if (LHS.getOpcode() == ISD::FADD) {
> // (fsub (fadd a, a), c) -> mad 2.0, a, (fneg c)
>
> @@ -1658,6 +1637,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
> return DAG.getNode(AMDGPUISD::MAD, DL, VT, NegTwo, A, LHS);
> }
> }
> +
> + return AMDGPUTargetLowering::performFSubCombine(N, DCI);
> }
>
> break;
> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> index d758f9f..c614609 100644
> --- a/lib/Target/R600/SIInstructions.td
> +++ b/lib/Target/R600/SIInstructions.td
> @@ -1607,7 +1607,7 @@ defm V_MAD_LEGACY_F32 : VOP3Inst <vop3<0x140, 0x1c0>, "v_mad_legacy_f32",
> >;
>
> defm V_MAD_F32 : VOP3Inst <vop3<0x141, 0x1c1>, "v_mad_f32",
> - VOP_F32_F32_F32_F32, fmad
> + VOP_F32_F32_F32_F32, AMDGPUmad
> >;
>
> defm V_MAD_I32_I24 : VOP3Inst <vop3<0x142, 0x1c2>, "v_mad_i32_i24",
> @@ -2748,9 +2748,6 @@ def : Pat <
> (V_MUL_HI_I32 $src0, $src1)
> >;
>
> -def : Vop3ModPat<V_MAD_F32, VOP_F32_F32_F32_F32, AMDGPUmad>;
> -
> -
> defm : BFIPatterns <V_BFI_B32, S_MOV_B32, SReg_64>;
> def : ROTRPattern <V_ALIGNBIT_B32>;
>
> diff --git a/test/CodeGen/R600/mad-combine.ll b/test/CodeGen/R600/mad-combine.ll
> new file mode 100644
> index 0000000..b116b2c
> --- /dev/null
> +++ b/test/CodeGen/R600/mad-combine.ll
> @@ -0,0 +1,446 @@
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -fp-contract=fast < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> +
> +; Make sure we still form mad even when unsafe math or fp-contract is allowed instead of fma.
> +
> +
> +declare i32 @llvm.r600.read.tidig.x() #0
> +declare float @llvm.fabs.f32(float) #0
> +declare float @llvm.fma.f32(float, float, float) #0
> +declare float @llvm.fmuladd.f32(float, float, float) #0
> +
> +; (fadd (fmul x, y), z) -> (fma x, y, z)
> +; FUNC-LABEL: {{^}}combine_to_mad_f32_0:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_f32_0(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> +
> + %mul = fmul float %a, %b
> + %fma = fadd float %mul, %c
> + store float %fma, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; (fadd (fmul x, y), z) -> (fma x, y, z)
> +; FUNC-LABEL: {{^}}combine_to_mad_f32_0_2use:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_f32_0_2use(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> + %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> + %d = load float addrspace(1)* %gep.3
> +
> + %mul = fmul float %a, %b
> + %fma0 = fadd float %mul, %c
> + %fma1 = fadd float %mul, %d
> +
> + store float %fma0, float addrspace(1)* %gep.out.0
> + store float %fma1, float addrspace(1)* %gep.out.1
> + ret void
> +}
> +
> +; (fadd x, (fmul y, z)) -> (fma y, z, x)
> +; FUNC-LABEL: {{^}}combine_to_mad_f32_1:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_f32_1(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> +
> + %mul = fmul float %a, %b
> + %fma = fadd float %c, %mul
> + store float %fma, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; (fsub (fmul x, y), z) -> (fma x, y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_0_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> +
> + %mul = fmul float %a, %b
> + %fma = fsub float %mul, %c
> + store float %fma, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; (fsub (fmul x, y), z) -> (fma x, y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_0_f32_2use:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_0_f32_2use(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> + %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> + %d = load float addrspace(1)* %gep.3
> +
> + %mul = fmul float %a, %b
> + %fma0 = fsub float %mul, %c
> + %fma1 = fsub float %mul, %d
> + store float %fma0, float addrspace(1)* %gep.out.0
> + store float %fma1, float addrspace(1)* %gep.out.1
> + ret void
> +}
> +
> +; (fsub x, (fmul y, z)) -> (fma (fneg y), z, x)
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_1_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> +
> + %mul = fmul float %a, %b
> + %fma = fsub float %c, %mul
> + store float %fma, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; (fsub x, (fmul y, z)) -> (fma (fneg y), z, x)
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_1_f32_2use:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_1_f32_2use(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> + %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> + %d = load float addrspace(1)* %gep.3
> +
> + %mul = fmul float %a, %b
> + %fma0 = fsub float %c, %mul
> + %fma1 = fsub float %d, %mul
> + store float %fma0, float addrspace(1)* %gep.out.0
> + store float %fma1, float addrspace(1)* %gep.out.1
> + ret void
> +}
> +
> +; (fsub (fneg (fmul x, y)), z) -> (fma (fneg x), y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_2_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI: buffer_store_dword [[RESULT]]
> +define void @combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> +
> + %mul = fmul float %a, %b
> + %mul.neg = fsub float -0.0, %mul
> + %fma = fsub float %mul.neg, %c
> +
> + store float %fma, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; (fsub (fneg (fmul x, y)), z) -> (fma (fneg x), y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_2_f32_2uses_neg:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_2_f32_2uses_neg(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> + %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> + %d = load float addrspace(1)* %gep.3
> +
> + %mul = fmul float %a, %b
> + %mul.neg = fsub float -0.0, %mul
> + %fma0 = fsub float %mul.neg, %c
> + %fma1 = fsub float %mul.neg, %d
> +
> + store float %fma0, float addrspace(1)* %gep.out.0
> + store float %fma1, float addrspace(1)* %gep.out.1
> + ret void
> +}
> +
> +; (fsub (fneg (fmul x, y)), z) -> (fma (fneg x), y, (fneg z))
> +; FUNC-LABEL: {{^}}combine_to_mad_fsub_2_f32_2uses_mul:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI: s_endpgm
> +define void @combine_to_mad_fsub_2_f32_2uses_mul(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.out.0 = getelementptr float addrspace(1)* %out, i32 %tid
> + %gep.out.1 = getelementptr float addrspace(1)* %gep.out.0, i32 1
> +
> + %a = load float addrspace(1)* %gep.0
> + %b = load float addrspace(1)* %gep.1
> + %c = load float addrspace(1)* %gep.2
> + %d = load float addrspace(1)* %gep.3
> +
> + %mul = fmul float %a, %b
> + %mul.neg = fsub float -0.0, %mul
> + %fma0 = fsub float %mul.neg, %c
> + %fma1 = fsub float %mul, %d
> +
> + store float %fma0, float addrspace(1)* %gep.out.0
> + store float %fma1, float addrspace(1)* %gep.out.1
> + ret void
> +}
> +
> +; fold (fsub (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, (fneg z)))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_0_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP1]]
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +define void @aggressive_combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %x = load float addrspace(1)* %gep.0
> + %y = load float addrspace(1)* %gep.1
> + %z = load float addrspace(1)* %gep.2
> + %u = load float addrspace(1)* %gep.3
> + %v = load float addrspace(1)* %gep.4
> +
> + %tmp0 = fmul float %u, %v
> + %tmp1 = call float @llvm.fma.f32(float %x, float %y, float %tmp0) #0
> + %tmp2 = fsub float %tmp1, %z
> +
> + store float %tmp2, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; fold (fsub x, (fma y, z, (fmul u, v)))
> +; -> (fma (fneg y), z, (fma (fneg u), v, x))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_1_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> +; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI: s_endpgm
> +define void @aggressive_combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %x = load float addrspace(1)* %gep.0
> + %y = load float addrspace(1)* %gep.1
> + %z = load float addrspace(1)* %gep.2
> + %u = load float addrspace(1)* %gep.3
> + %v = load float addrspace(1)* %gep.4
> +
> + %tmp0 = fmul float %u, %v
> + %tmp1 = call float @llvm.fma.f32(float %y, float %z, float %tmp0) #0
> + %tmp2 = fsub float %x, %tmp1
> +
> + store float %tmp2, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; fold (fsub (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, (fneg z)))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_2_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> +; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI: s_endpgm
> +define void @aggressive_combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %x = load float addrspace(1)* %gep.0
> + %y = load float addrspace(1)* %gep.1
> + %z = load float addrspace(1)* %gep.2
> + %u = load float addrspace(1)* %gep.3
> + %v = load float addrspace(1)* %gep.4
> +
> + %tmp0 = fmul float %u, %v
> + %tmp1 = call float @llvm.fmuladd.f32(float %x, float %y, float %tmp0) #0
> + %tmp2 = fsub float %tmp1, %z
> +
> + store float %tmp2, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +; fold (fsub x, (fmuladd y, z, (fmul u, v)))
> +; -> (fmuladd (fneg y), z, (fmuladd (fneg u), v, x))
> +
> +; FUNC-LABEL: {{^}}aggressive_combine_to_mad_fsub_3_f32:
> +; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> +; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> +; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> +; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> +; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> +; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +; SI: s_endpgm
> +define void @aggressive_combine_to_mad_fsub_3_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> + %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> + %gep.0 = getelementptr float addrspace(1)* %in, i32 %tid
> + %gep.1 = getelementptr float addrspace(1)* %gep.0, i32 1
> + %gep.2 = getelementptr float addrspace(1)* %gep.0, i32 2
> + %gep.3 = getelementptr float addrspace(1)* %gep.0, i32 3
> + %gep.4 = getelementptr float addrspace(1)* %gep.0, i32 4
> + %gep.out = getelementptr float addrspace(1)* %out, i32 %tid
> +
> + %x = load float addrspace(1)* %gep.0
> + %y = load float addrspace(1)* %gep.1
> + %z = load float addrspace(1)* %gep.2
> + %u = load float addrspace(1)* %gep.3
> + %v = load float addrspace(1)* %gep.4
> +
> + %tmp0 = fmul float %u, %v
> + %tmp1 = call float @llvm.fmuladd.f32(float %y, float %z, float %tmp0) #0
> + %tmp2 = fsub float %x, %tmp1
> +
> + store float %tmp2, float addrspace(1)* %gep.out
> + ret void
> +}
> +
> +attributes #0 = { nounwind readnone }
> +attributes #1 = { nounwind }
> --
> 2.2.1
>
> From b4f466d606626343ca30a9a8daec35bd61362027 Mon Sep 17 00:00:00 2001
> From: Matt Arsenault <Matthew.Arsenault at amd.com>
> Date: Thu, 22 Jan 2015 18:41:24 -0800
> Subject: [PATCH 5/5] R600/SI: Only form v_mad_f32 without denormals
>
> According to some sources, v_mad_f32 does not support them.
Do we ever have denormals enabled?
> ---
> lib/Target/R600/AMDGPUISelLowering.cpp | 8 ++
> lib/Target/R600/SIISelLowering.cpp | 16 ++-
> test/CodeGen/R600/mad-combine.ll | 183 +++++++++++++++++++++++++++------
> 3 files changed, 173 insertions(+), 34 deletions(-)
>
> diff --git a/lib/Target/R600/AMDGPUISelLowering.cpp b/lib/Target/R600/AMDGPUISelLowering.cpp
> index f3769e3..6f7d6e3 100644
> --- a/lib/Target/R600/AMDGPUISelLowering.cpp
> +++ b/lib/Target/R600/AMDGPUISelLowering.cpp
> @@ -2426,6 +2426,10 @@ SDValue AMDGPUTargetLowering::performMulCombine(SDNode *N,
> // We can form f32 mads as long as denormals are not requested.
> SDValue AMDGPUTargetLowering::performFAddCombine(SDNode *N,
> DAGCombinerInfo &DCI) const {
> + // v_mad_f32 does not support denormals.
> + if (Subtarget->hasFP32Denormals())
> + return SDValue();
> +
> EVT VT = N->getValueType(0);
>
> if (VT != MVT::f32) // There is no mad instruction for f64.
> @@ -2478,6 +2482,10 @@ SDValue AMDGPUTargetLowering::performFAddCombine(SDNode *N,
> // FIXME: Mostly copied directly from generic FMA combines.
> SDValue AMDGPUTargetLowering::performFSubCombine(SDNode *N,
> DAGCombinerInfo &DCI) const {
> + // v_mad_f32 does not support denormals.
> + if (Subtarget->hasFP32Denormals())
> + return SDValue();
> +
> EVT VT = N->getValueType(0);
>
> if (VT != MVT::f32) // There is no mad instruction for f64.
> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> index 6dc97ea..ce803e8 100644
> --- a/lib/Target/R600/SIISelLowering.cpp
> +++ b/lib/Target/R600/SIISelLowering.cpp
> @@ -672,8 +672,9 @@ bool SITargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {
> case MVT::f32:
> // This is as fast on some subtargets. However, we always have full rate f32
> // mad available which returns the same result as the separate operations
> - // which we should prefer over fma.
> - return false;
> + // which we should prefer over fma. We can't use this if we want to support
> + // denormals, so only report this in these cases.
> + return Subtarget->hasFP32Denormals() && Subtarget->hasFastFMAF32();
> case MVT::f64:
> return true;
> default:
> @@ -1579,6 +1580,11 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
> if (VT != MVT::f32)
> break;
>
> + // Only do this if we are not trying to support denormals. v_mad_f32 does
> + // not support denormals ever.
> + if (Subtarget->hasFP32Denormals())
> + break;
> +
> SDValue LHS = N->getOperand(0);
> SDValue RHS = N->getOperand(1);
>
> @@ -1613,7 +1619,11 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
>
> // Try to get the fneg to fold into the source modifier. This undoes generic
> // DAG combines and folds them into the mad.
> - if (VT == MVT::f32) {
> + //
> + // Only do this if we are not trying to support denormals. v_mad_f32 does
> + // not support denormals ever.
> + if (VT == MVT::f32 &&
> + !Subtarget->hasFP32Denormals()) {
> SDValue LHS = N->getOperand(0);
> SDValue RHS = N->getOperand(1);
> if (LHS.getOpcode() == ISD::FADD) {
> diff --git a/test/CodeGen/R600/mad-combine.ll b/test/CodeGen/R600/mad-combine.ll
> index b116b2c..8c4e09b 100644
> --- a/test/CodeGen/R600/mad-combine.ll
> +++ b/test/CodeGen/R600/mad-combine.ll
> @@ -1,9 +1,12 @@
> -; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> -; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -fp-contract=fast < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> -; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -check-prefix=SI -check-prefix=FUNC %s
> -
> ; Make sure we still form mad even when unsafe math or fp-contract is allowed instead of fma.
>
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=SI-STD -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -fp-contract=fast < %s | FileCheck -check-prefix=SI -check-prefix=SI-STD -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -check-prefix=SI -check-prefix=SI-STD -check-prefix=FUNC %s
> +
> +; Make sure we don't form mad with denormals
> +; RUN: llc -march=amdgcn -mcpu=tahiti -mattr=+fp32-denormals -fp-contract=fast -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=SI-DENORM -check-prefix=FUNC %s
> +; RUN: llc -march=amdgcn -mcpu=verde -mattr=+fp32-denormals -fp-contract=fast -verify-machineinstrs < %s | FileCheck -check-prefix=SI -check-prefix=SI-DENORM-SLOWFMAF -check-prefix=FUNC %s
>
> declare i32 @llvm.r600.read.tidig.x() #0
> declare float @llvm.fabs.f32(float) #0
> @@ -15,7 +18,17 @@ declare float @llvm.fmuladd.f32(float, float, float) #0
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-DENORM-SLOWFMAF-NOT: v_fma
> +; SI-DENORM-SLOWFMAF-NOT: v_mad
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP]]
> +
> ; SI: buffer_store_dword [[RESULT]]
> define void @combine_to_mad_f32_0(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -40,8 +53,17 @@ define void @combine_to_mad_f32_0(float addrspace(1)* noalias %out, float addrsp
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], [[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_add_f32_e32 [[RESULT0:v[0-9]+]], [[C]], [[TMP]]
> +; SI-DENORM-SLOWFMAF-DAG: v_add_f32_e32 [[RESULT1:v[0-9]+]], [[D]], [[TMP]]
> +
> ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI: s_endpgm
> @@ -73,7 +95,13 @@ define void @combine_to_mad_f32_0_2use(float addrspace(1)* noalias %out, float a
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[C]]
> +
> ; SI: buffer_store_dword [[RESULT]]
> define void @combine_to_mad_f32_1(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -97,7 +125,13 @@ define void @combine_to_mad_f32_1(float addrspace(1)* noalias %out, float addrsp
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], -[[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP]]
> +
> ; SI: buffer_store_dword [[RESULT]]
> define void @combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -122,8 +156,17 @@ define void @combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float a
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], [[A]], [[B]], -[[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT0:v[0-9]+]], [[C]], [[TMP]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[D]], [[TMP]]
> +
> ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI: s_endpgm
> @@ -154,7 +197,13 @@ define void @combine_to_mad_fsub_0_f32_2use(float addrspace(1)* noalias %out, fl
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], [[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP]], [[C]]
> +
> ; SI: buffer_store_dword [[RESULT]]
> define void @combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -178,8 +227,17 @@ define void @combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float a
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], [[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], [[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT0:v[0-9]+]], [[TMP]], [[C]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[TMP]], [[D]]
> +
> ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI: s_endpgm
> @@ -210,7 +268,14 @@ define void @combine_to_mad_fsub_1_f32_2use(float addrspace(1)* noalias %out, fl
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_sub_f32_e64 [[RESULT:v[0-9]+]], -[[TMP]], [[C]]
> +
> ; SI: buffer_store_dword [[RESULT]]
> define void @combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -236,8 +301,17 @@ define void @combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float a
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], -[[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_sub_f32_e64 [[RESULT0:v[0-9]+]], -[[TMP]], [[C]]
> +; SI-DENORM-SLOWFMAF-DAG: v_sub_f32_e64 [[RESULT1:v[0-9]+]], -[[TMP]], [[D]]
> +
> ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI: s_endpgm
> @@ -270,8 +344,17 @@ define void @combine_to_mad_fsub_2_f32_2uses_neg(float addrspace(1)* noalias %ou
> ; SI-DAG: buffer_load_dword [[A:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_load_dword [[B:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> -; SI-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-STD-DAG: v_mad_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-STD-DAG: v_mad_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT0:v[0-9]+]], -[[A]], [[B]], -[[C]]
> +; SI-DENORM-DAG: v_fma_f32 [[RESULT1:v[0-9]+]], [[A]], [[B]], -[[D]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF-DAG: v_sub_f32_e64 [[RESULT0:v[0-9]+]], -[[TMP]], [[C]]
> +; SI-DENORM-SLOWFMAF-DAG: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[D]], [[TMP]]
> +
> ; SI-DAG: buffer_store_dword [[RESULT0]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI-DAG: buffer_store_dword [[RESULT1]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4{{$}}
> ; SI: s_endpgm
> @@ -307,9 +390,18 @@ define void @combine_to_mad_fsub_2_f32_2uses_mul(float addrspace(1)* noalias %ou
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> -; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> -; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP1]]
> +
> +; SI-STD: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-STD: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +; SI-STD: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP1]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP0:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_fma_f32 [[TMP1:v[0-9]+]], [[A]], [[B]], [[TMP0]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT1:v[0-9]+]], [[C]], [[TMP1]]
> +
> ; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> define void @aggressive_combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -343,9 +435,18 @@ define void @aggressive_combine_to_mad_fsub_0_f32(float addrspace(1)* noalias %o
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> -; SI: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> -; SI: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +
> +; SI-STD: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-STD: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> +; SI-STD: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP0:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP0]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_fma_f32 [[TMP1:v[0-9]+]], [[B]], [[C]], [[TMP0]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP1]], [[A]]
> +
> ; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI: s_endpgm
> define void @aggressive_combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> @@ -379,9 +480,19 @@ define void @aggressive_combine_to_mad_fsub_1_f32(float addrspace(1)* noalias %o
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> -; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> -; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +
> +; SI-STD: v_mad_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP:v[0-9]+]], [[D]], [[E]], -[[C]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], [[A]], [[B]], [[TMP]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP1:v[0-9]+]], [[B]], [[A]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[TMP2:v[0-9]+]], [[TMP0]], [[TMP1]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[C]], [[TMP2]]
> +
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI: s_endpgm
> define void @aggressive_combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> @@ -415,9 +526,19 @@ define void @aggressive_combine_to_mad_fsub_2_f32(float addrspace(1)* noalias %o
> ; SI-DAG: buffer_load_dword [[C:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8{{$}}
> ; SI-DAG: buffer_load_dword [[D:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:12{{$}}
> ; SI-DAG: buffer_load_dword [[E:v[0-9]+]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:16{{$}}
> -; SI-DAG: v_mad_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> -; SI-DAG: v_mad_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> -; SI-DAG: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> +
> +; SI-STD: v_mad_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-STD: v_mad_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> +
> +; SI-DENORM: v_fma_f32 [[TMP:v[0-9]+]], -[[D]], [[E]], [[A]]
> +; SI-DENORM: v_fma_f32 [[RESULT:v[0-9]+]], -[[B]], [[C]], [[TMP]]
> +
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP0:v[0-9]+]], [[E]], [[D]]
> +; SI-DENORM-SLOWFMAF: v_mul_f32_e32 [[TMP1:v[0-9]+]], [[C]], [[B]]
> +; SI-DENORM-SLOWFMAF: v_add_f32_e32 [[TMP2:v[0-9]+]], [[TMP0]], [[TMP1]]
> +; SI-DENORM-SLOWFMAF: v_subrev_f32_e32 [[RESULT:v[0-9]+]], [[TMP2]], [[A]]
> +
> +; SI: buffer_store_dword [[RESULT]], v{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
> ; SI: s_endpgm
> define void @aggressive_combine_to_mad_fsub_3_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) #1 {
> %tid = tail call i32 @llvm.r600.read.tidig.x() #0
> --
> 2.2.1
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
More information about the llvm-commits
mailing list