Improve FMA folding

Tue Sep 16 15:08:37 PDT 2014

Hal,

Thanks for the quick review. I don't have commit access.

Olivier

Sent from my iPhone

> On Sep 16, 2014, at 5:41 PM, "Hal Finkel" <hfinkel at anl.gov> wrote:
>
> Hi,
>
> +bool PPCTargetLowering::enableAggressiveFMAFusion(EVT VT) const {
> +  return VT.isFloatingPoint();
> +}
>
> This should never be called on anything but floating-point types (we
don't have integer FMA nodes). I'd write this as:
>
> bool PPCTargetLowering::enableAggressiveFMAFusion(EVT VT) const {
>   assert(VT.isFloatingPoint() && "Non-floating-point FMA?");
>   return true;
> }
>
> otherwise, LGTM. Do you have commit access?
>
>  -Hal
>
> ----- Original Message -----
> > From: "Olivier H Sallenave" <ohsallen at us.ibm.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvm-commits at cs.uiuc.edu, "Samuel F Antao" <sfantao at us.ibm.com>
> > Sent: Tuesday, September 16, 2014 4:28:27 PM
> > Subject: Re: Improve FMA folding
> >
> >
> >
> > Hi Hal,
> >
> > Thanks for the advice. Attached is the suggested change.
> >
> > Cheers,
> > Olivier
> >
> > (See attached file: fma.diff)
> >
> > Inactive hide details for Hal Finkel ---09/16/2014 02:02:07 AM---Hi
> > Olivier, I don't want to duplicate the DAGCombine logic in Hal
> > Finkel ---09/16/2014 02:02:07 AM---Hi Olivier, I don't want to
> > duplicate the DAGCombine logic in the PPC backend. Instead, please
> > add a
> >
> > From: Hal Finkel <hfinkel at anl.gov>
> > To: Olivier H Sallenave/Watson/IBM at IBMUS
> > Cc: Samuel F Antao/Watson/IBM at IBMUS, <llvm-commits at cs.uiuc.edu>
> > Date: 09/16/2014 02:02 AM
> > Subject: Re: Improve FMA folding
> >
> >
> >
> >
> > Hi Olivier,
> >
> > I don't want to duplicate the DAGCombine logic in the PPC backend.
> > Instead, please add a TLI callback to optionally enable the
> > transformation in DAGCombine. Make this callback take the value type
> > as a parameter, and return false for all types by default. Then
> > override this callback in PPCISelLowering to return true (perhaps
> > for all types). You can call it something like
> > enableAggressiveFMAFormation, or enablePartialReplacementWithFMAs,
> > etc.
> >
> > Thanks again,
> > Hal
> >
> > ----- Original Message -----
> > > From: "Olivier H Sallenave" <ohsallen at us.ibm.com>
> > > To: llvm-commits at cs.uiuc.edu
> > > Cc: "Hal Finkel" <hfinkel at anl.gov>, "Samuel F Antao"
> > > <sfantao at us.ibm.com>
> > > Sent: Wednesday, September 10, 2014 10:41:01 AM
> > > Subject: Fw: Improve FMA folding
> > >
> > >
> > >
> > > Hi,
> > >
> > > We should indeed check that the number of cycles has been reduced
> > > (through TTI::getArithmeticInstrCost). However, AArch64 doesn't
> > > override the cost of FMA, FADD and FMUL, therefore they all cost 2
> > > by default.
> > >
> > > Instead, here is a patch specific to the PPC backend. I followed
> > > Kevin's proposal to always combine, which should be fine for this
> > > target.
> > >
> > > Olivier
> > >
> > > (See attached file: fma-ppc.diff)
> > >
> > > ----- Forwarded by Olivier H Sallenave/Watson/IBM on 09/10/2014
> > > 11:26
> > > AM -----
> > >
> > > From: Olivier H Sallenave/Watson/IBM
> > > To: Hal Finkel <hfinkel at anl.gov>
> > > Cc: llvm-commits at cs.uiuc.edu
> > > Date: 09/04/2014 05:26 PM
> > > Subject: Re: Improve FMA folding
> > >
> > >
> > >
> > > Hi Hal,
> > >
> > > Thanks for your feedback. Attached is the patch with the
> > > corrections
> > > and the regression test.
> > >
> > > Regarding the recursive call, I believe it is fine since it's
> > > guarded
> > > by some conditions which ensure it can only take an FNEG node as
> > > parameter — therefore the depth of the recursion cannot be greater
> > > than 1. If you still don't like it, we can replace the recursive
> > > call with an iteration over the uses of the FNEG node.
> > >
> > > More importantly, I found out that the following tests now fail:
> > > CodeGen/AArch64/fp-dp3.ll
> > > CodeGen/AArch64/neon-fma.ll
> > >
> > > The failing tests are basically the same, and check explicitly that
> > > FMUL isn't combined with two distinct FADD (see below). I am
> > > wondering why ARM doesn't want to combine here. It seems that FMUL
> > > and FMADD take 2 cycles, whereas FADD takes 1 cycle. Could that be
> > > the reason? If so, maybe we could check the latency of the
> > > instructions we combine?
> > >
> > > ; Another set of tests that check for multiply single use
> > >
> > > define float @test_fmadd_unfused_su(float %a, float %b, float %c) {
> > > ; CHECK-LABEL: test_fmadd_unfused_su:
> > > %prod = fmul float %b, %c
> > > %sum = fadd float %a, %prod
> > > %res = fadd float %sum, %prod
> > > ; CHECK-NOT: fmadd {{s[0-9]+}}, {{s[0-9]+}}, {{s[0-9]+}},
> > > {{s[0-9]+}}
> > > ; CHECK: fmul {{s[0-9]+}}, {{s[0-9]+}}, {{s[0-9]+}}
> > > ; CHECK: fadd {{s[0-9]+}}, {{s[0-9]+}}, {{s[0-9]+}}
> > > ; CHECK: fadd {{s[0-9]+}}, {{s[0-9]+}}, {{s[0-9]+}}
> > > ret float %res
> > > }
> > >
> > > Thanks very much for your help,
> > > Olivier
> > >
> > > (See attached file: fold-fma.diff)
> > >
> > > Inactive hide details for Hal Finkel ---09/03/2014 03:29:08 PM---Hi
> > > Olivier, Thanks for working on this. A few comments:Hal Finkel
> > > ---09/03/2014 03:29:08 PM---Hi Olivier, Thanks for working on this.
> > > A few comments:
> > >
> > > From: Hal Finkel <hfinkel at anl.gov>
> > > To: Olivier H Sallenave/Watson/IBM at IBMUS
> > > Cc: <llvm-commits at cs.uiuc.edu>
> > > Date: 09/03/2014 03:29 PM
> > > Subject: Re: Improve FMA folding
> > >
> > >
> > >
> > >
> > > Hi Olivier,
> > >
> > > Thanks for working on this. A few comments:
> > >
> > > +// operation. This is used to ensure that combining into FMA will
> > > actually
> > >
> > > into FMA -> into an FMA
> > >
> > > + if (N->getOpcode() != ISD::FMUL &&
> > > + N->getOpcode() != ISD::FNEG)
> > > + return false;
> > >
> > > These conditions guard every call to the function, so this can be
> > > an
> > > assert; something like this:
> > >
> > > assert((N->getOpcode() == ISD::FMUL || N->getOpcode() == ISD::FNEG)
> > > &&
> > > "Expecting fmul or fneg node");
> > >
> > > + else if (!isAlwaysFoldedIntoFMA(*UI))
> > > + return false;
> > >
> > > Recursion here seems overly-general. Is that the best way to do it?
> > > Should you have a depth limit?
> > >
> > > And finally, you'll need an LLVM IR-level CodeGen regression test.
> > > Look in test/CodeGen/PowerPC or test/CodeGen/X86 (it should be
> > > something like test/CodeGen/PowerPC/fma.ll).
> > >
> > > Thanks,
> > > Hal
> > >
> > > ----- Original Message -----
> > > > From: "Olivier H Sallenave" <ohsallen at us.ibm.com>
> > > > To: llvm-commits at cs.uiuc.edu
> > > > Sent: Wednesday, September 3, 2014 11:21:53 AM
> > > > Subject: Improve FMA folding
> > > >
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Here is a patch to improve FMA folding in DAGCombiner.cpp.
> > > >
> > > > Previously, when looking at a FMUL and a FADD, it would combine
> > > > them
> > > > into FMA if the FMUL is only used by the FADD. This was to ensure
> > > > that the FMUL will be removed, meaning that the number of
> > > > instructions will actually be reduced. However, if the FMUL has
> > > > many
> > > > uses but can be combined with each one of them (i.e. they are
> > > > either
> > > > FADD or FSUB), it should still be combined. Attached is a patch
> > > > which checks all uses of the FMUL.
> > > >
> > > > Olivier
> > > >
> > > >
> > > > (See attached file: fma.diff) (See attached file: test-fma.c)
> > > >
> > > > ----- Forwarded by Olivier H Sallenave/Watson/IBM on 09/03/2014
> > > > 12:08
> > > > PM -----
> > > >
> > > > From: Hal Finkel <hfinkel at anl.gov>
> > > > To: Kevin K O'Brien/Watson/IBM at IBMUS
> > > > Cc: Samuel F Antao/Watson/IBM at IBMUS, <llvmdev at cs.uiuc.edu>,
> > > > Olivier
> > > > H
> > > > Sallenave/Watson/IBM at IBMUS
> > > > Date: 08/26/2014 11:33 AM
> > > > Subject: Re: [LLVMdev] Multiply-add combining
> > > >
> > > >
> > > >
> > > >
> > > > Kevin, Olivier,
> > > >
> > > > The target independent heuristic could certainly check all uses,
> > > > patches welcome. We could also consider each case separately, as
> > > > Kevin suggests, but that might not be optimal on targets with
> > > > only
> > > > one floating-point pipeline, so we'd need to make it opt-in. You
> > > > should also look at the MachineCombiner Pass (added in r214832,
> > > > currently only used by the ARM backend I think) the tries to
> > > > solve
> > > > this problem in a more-sophisticated way.
> > > >
> > > > -Hal
> > > >
> > > > ----- Original Message -----
> > > > > From: "Kevin K O'Brien" <caomhin at us.ibm.com>
> > > > > To: "Olivier H Sallenave" <ohsallen at us.ibm.com>
> > > > > Cc: "Samuel F Antao" <sfantao at us.ibm.com>, llvmdev at cs.uiuc.edu
> > > > > Sent: Tuesday, August 26, 2014 10:16:23 AM
> > > > > Subject: Re: [LLVMdev] Multiply-add combining
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Hi Olivier,
> > > > > I think we discussed this last Thursday? My feeling is that
> > > > > each
> > > > > use
> > > > > of the multiply can be considered separately. If it can be
> > > > > combined,
> > > > > then we should do so. The multiply should be left in place and
> > > > > removed by a dead code elimination pass sometime later. This is
> > > > > what
> > > > > TOBEY does. If you want me to explain the XL method in more
> > > > > detail,
> > > > > come talk to me.
> > > > >
> > > > > Kevin
> > > > > ----------------------------------------------
> > > > > Kevin O'Brien
> > > > > Manager, Advanced Compiler Technology
> > > > > IBM T.J Watson Research Center, Yorktown Heights, NY
> > > > >
> > > > > Inactive hide details for Olivier H Sallenave---08/26/2014
> > > > > 11:12:04
> > > > > AM---Hi, I tried to compile the following using
> > > > > -ffp-contraOlivier
> > > > > H
> > > > > Sallenave---08/26/2014 11:12:04 AM---Hi, I tried to compile the
> > > > > following using -ffp-contract=fast:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > From:
> > > > > Olivier H Sallenave/Watson/IBM
> > > > >
> > > > >
> > > > >
> > > > > To:
> > > > > llvmdev at cs.uiuc.edu,
> > > > >
> > > > >
> > > > >
> > > > > Cc:
> > > > > Samuel F Antao/Watson/IBM at IBMUS, Kevin K
> > > > > O'Brien/Watson/IBM at IBMUS
> > > > >
> > > > >
> > > > >
> > > > > Date:
> > > > > 08/26/2014 11:12 AM
> > > > >
> > > > >
> > > > >
> > > > > Subject:
> > > > > Multiply-add combining
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I tried to compile the following using -ffp-contract=fast:
> > > > >
> > > > > %mul = fmul double %sub5, %x
> > > > > %add = fadd double %add6, %mul
> > > > > %sub = fsub double %sub5, %mul
> > > > >
> > > > > I expected fadd and fsub to be contracted with fmul, which
> > > > > didn't
> > > > > happen.
> > > > >
> > > > > When looking in DAGCombiner.cpp, it appears the result of the
> > > > > fmul
> > > > > needs to be used only once, which isn't the case here as it is
> > > > > used
> > > > > by both the fadd and the fsub:
> > > > >
> > > > > // fold (fadd (fmul x, y), z) -> (fma x, y, z)
> > > > > if (N0.getOpcode() == ISD::FMUL && N0.hasOneUse())
> > > > > return DAG.getNode(ISD::FMA, SDLoc(N), VT, N0.getOperand(0),
> > > > > N0.getOperand(1), N1);
> > > > >
> > > > > This heuristic looks a little conservative, could we instead
> > > > > check
> > > > > that every instruction using the result of the fmul are
> > > > > combinable
> > > > > (i.e., they are either fadd or fsub)?
> > > > >
> > > > >
> > > > > Thanks in advance,
> > > > > Olivier
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > >
> > > >
> > > > --
> > > > Hal Finkel
> > > > Assistant Computational Scientist
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > > >
> > > >
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140916/25c8a853/attachment.html>