Improve FMA folding

Wed Sep 3 12:28:40 PDT 2014

Hi Olivier,

Thanks for working on this. A few comments:

+// operation. This is used to ensure that combining into FMA will actually

into FMA -> into an FMA

+  if (N->getOpcode() != ISD::FMUL &&
+      N->getOpcode() != ISD::FNEG)
+    return false;

These conditions guard every call to the function, so this can be an assert; something like this:

  assert((N->getOpcode() == ISD::FMUL || N->getOpcode() == ISD::FNEG) &&
         "Expecting fmul or fneg node");

+        else if (!isAlwaysFoldedIntoFMA(*UI))
+          return false;

Recursion here seems overly-general. Is that the best way to do it? Should you have a depth limit?

And finally, you'll need an LLVM IR-level CodeGen regression test. Look in test/CodeGen/PowerPC or test/CodeGen/X86 (it should be something like test/CodeGen/PowerPC/fma.ll).

Thanks,
Hal

----- Original Message -----
> From: "Olivier H Sallenave" <ohsallen at us.ibm.com>
> To: llvm-commits at cs.uiuc.edu
> Sent: Wednesday, September 3, 2014 11:21:53 AM
> Subject: Improve FMA folding
> 
> 
> 
> 
> Hi,
> 
> Here is a patch to improve FMA folding in DAGCombiner.cpp.
> 
> Previously, when looking at a FMUL and a FADD, it would combine them
> into FMA if the FMUL is only used by the FADD. This was to ensure
> that the FMUL will be removed, meaning that the number of
> instructions will actually be reduced. However, if the FMUL has many
> uses but can be combined with each one of them (i.e. they are either
> FADD or FSUB), it should still be combined. Attached is a patch
> which checks all uses of the FMUL.
> 
> Olivier
> 
> 
> (See attached file: fma.diff) (See attached file: test-fma.c)
> 
> ----- Forwarded by Olivier H Sallenave/Watson/IBM on 09/03/2014 12:08
> PM -----
> 
> From: Hal Finkel <hfinkel at anl.gov>
> To: Kevin K O'Brien/Watson/IBM at IBMUS
> Cc: Samuel F Antao/Watson/IBM at IBMUS, <llvmdev at cs.uiuc.edu>, Olivier H
> Sallenave/Watson/IBM at IBMUS
> Date: 08/26/2014 11:33 AM
> Subject: Re: [LLVMdev] Multiply-add combining
> 
> 
> 
> 
> Kevin, Olivier,
> 
> The target independent heuristic could certainly check all uses,
> patches welcome. We could also consider each case separately, as
> Kevin suggests, but that might not be optimal on targets with only
> one floating-point pipeline, so we'd need to make it opt-in. You
> should also look at the MachineCombiner Pass (added in r214832,
> currently only used by the ARM backend I think) the tries to solve
> this problem in a more-sophisticated way.
> 
> -Hal
> 
> ----- Original Message -----
> > From: "Kevin K O'Brien" <caomhin at us.ibm.com>
> > To: "Olivier H Sallenave" <ohsallen at us.ibm.com>
> > Cc: "Samuel F Antao" <sfantao at us.ibm.com>, llvmdev at cs.uiuc.edu
> > Sent: Tuesday, August 26, 2014 10:16:23 AM
> > Subject: Re: [LLVMdev] Multiply-add combining
> > 
> > 
> > 
> > 
> > Hi Olivier,
> > I think we discussed this last Thursday? My feeling is that each
> > use
> > of the multiply can be considered separately. If it can be
> > combined,
> > then we should do so. The multiply should be left in place and
> > removed by a dead code elimination pass sometime later. This is
> > what
> > TOBEY does. If you want me to explain the XL method in more detail,
> > come talk to me.
> > 
> > Kevin
> > ----------------------------------------------
> > Kevin O'Brien
> > Manager, Advanced Compiler Technology
> > IBM T.J Watson Research Center, Yorktown Heights, NY
> > 
> > Inactive hide details for Olivier H Sallenave---08/26/2014 11:12:04
> > AM---Hi, I tried to compile the following using -ffp-contraOlivier
> > H
> > Sallenave---08/26/2014 11:12:04 AM---Hi, I tried to compile the
> > following using -ffp-contract=fast:
> > 
> > 
> > 
> > 
> > From:
> > Olivier H Sallenave/Watson/IBM
> > 
> > 
> > 
> > To:
> > llvmdev at cs.uiuc.edu,
> > 
> > 
> > 
> > Cc:
> > Samuel F Antao/Watson/IBM at IBMUS, Kevin K O'Brien/Watson/IBM at IBMUS
> > 
> > 
> > 
> > Date:
> > 08/26/2014 11:12 AM
> > 
> > 
> > 
> > Subject:
> > Multiply-add combining
> > 
> > 
> > Hi,
> > 
> > I tried to compile the following using -ffp-contract=fast:
> > 
> > %mul = fmul double %sub5, %x
> > %add = fadd double %add6, %mul
> > %sub = fsub double %sub5, %mul
> > 
> > I expected fadd and fsub to be contracted with fmul, which didn't
> > happen.
> > 
> > When looking in DAGCombiner.cpp, it appears the result of the fmul
> > needs to be used only once, which isn't the case here as it is used
> > by both the fadd and the fsub:
> > 
> > // fold (fadd (fmul x, y), z) -> (fma x, y, z)
> > if (N0.getOpcode() == ISD::FMUL && N0.hasOneUse())
> > return DAG.getNode(ISD::FMA, SDLoc(N), VT, N0.getOperand(0),
> > N0.getOperand(1), N1);
> > 
> > This heuristic looks a little conservative, could we instead check
> > that every instruction using the result of the fmul are combinable
> > (i.e., they are either fadd or fsub)?
> > 
> > 
> > Thanks in advance,
> > Olivier
> > 
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory