[PATCH] D23583: [AArch64] Add feature has-fast-fma

Thu Aug 18 06:34:04 PDT 2016

jgreenhalgh added a subscriber: jgreenhalgh.
jgreenhalgh added a comment.

In https://reviews.llvm.org/D23583#517938, @jmolloy wrote:

> Hi,
>
> I also have concerns here. The TargetLowering hook states:
>
>   /// Return true if target always beneficiates from combining into FMA for a
>   /// given value type. This must typically return false on targets where FMA
>   /// takes more cycles to execute than *FADD*.
>   
>
> Whereas you say:
>
>   In spite of what the original author intended, I observed that the extra folds are worth it if FMA is as quick *FMUL* instead.
>   
>
> Which is correct? Or are you using this hook in a way the hook users don't intend? The wording used is vague and I really think we need to have more detail about what property of Exynos-M1 makes this good for Exynos but not for any other microarchitecture.

I'm not an expert on the codebase. As far as I can see, returning true here would enable two classes of additional transformations in DAGCombiner.cpp : DAGCombiner::visitFADDForFMACombine

   (fadd (fmul x, y), z) -> (fma x, y, z) 
   (fadd x, (fmul y, z)) -> (fma y, z, x)  
   (fsub (fmul x, y), z) -> (fma x, y, (fneg z)) 
   (fsub x, (fmul y, z)) -> (fma (fneg y), z, x) 

  Where the FMUL has multiple uses (single-use is already permitted).

Presumably this is where the "faster than an FADD" comes from. This transform is FMUL + FADD + [use of FMUL] -> FMA + FMUL + [use of FMUL].

Is this really a good optimisation for Exynos-M1? I think this is of an exceptionally unclear benefit. I'd personally be surprised if it was uniformly good as it would seem to increase competition for resources in the processor. I suppose if you were a CPU on which forwarding a result from FMUL to FADD incurred a penalty that FMA didn't face, then this might give you a faster result from the FMA, and if that was on your critical path then you might see a win. But surely other scenarios like the FMUL being on the critical path (and now competing with FMA for multiply resources), or the second operand to the FADD coming from a long latency instruction (so executing the FMUL while you waited would have hidden the latency) are both possible and likely.

I wonder whether really the gains for Exynos-M1 come from the second class of optimisation:

  (fadd (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, z))
  (fadd x, (fma y, z, (fmul u, v)) -> (fma y, z (fma u, v, x))

These looks generally applicable, as long as forwarding to the accumulator operand has the same cost whether you are coming from an FMA or an FMUL. This one should be good across multiple AArch64 cores.

There is a third class of optimisation, but these relate to folding through extend operations. For AArch64 LookThroughFPExt will return false, so these won't help you.

  (fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))

I'd guess that most of your benefit would come from the second class of folds, and that these are likely to be good across microarchitectures. The first class I think are a strange set of optimisations, and it isn't clear to me why that should uniformly be a good fold, even on microarchitectures where you can contrive a scenario where there is a benefit.

Repository:
  rL LLVM

https://reviews.llvm.org/D23583