[PATCH] D23583: [AArch64] Add feature has-fast-fma

Thu Aug 18 08:39:39 PDT 2016

evandro added a comment.

In https://reviews.llvm.org/D23583#519379, @jgreenhalgh wrote:

> Presumably this is where the "faster than an FADD" comes from. This transform is FMUL + FADD + [use of FMUL] -> FMA + FMUL + [use of FMUL].

There are other cases, such as FADD + FMUL + FMA -> FMA + FMA.  Probably a better way to describe the use of `enableAggressiveFMAFusion` is the relative cost of FMA to FADD and FMUL.

> Is this really a good optimisation for Exynos-M1? I think this is of an exceptionally unclear benefit. I'd personally be surprised if it was uniformly good as it would seem to increase competition for resources in the processor. I suppose if you were a CPU on which forwarding a result from FMUL to FADD incurred a penalty that FMA didn't face, then this might give you a faster result from the FMA, and if that was on your critical path then you might see a win. But surely other scenarios like the FMUL being on the critical path (and now competing with FMA for multiply resources), or the second operand to the FADD coming from a long latency instruction (so executing the FMUL while you waited would have hidden the latency) are both possible and likely.

In Exynos M1, FMA uses the same resources as FMUL and saves using the other resources required by FADD, so it tends to be beneficial on it.  It's not always a win though, such as when having finer grained FADD and FMUL allows more parallelism, but, in my experiments, they were few such workloads.

> I wonder whether really the gains for Exynos-M1 come from the second class of optimisation:

> 

>   (fadd (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, z))

>   (fadd x, (fma y, z, (fmul u, v)) -> (fma y, z (fma u, v, x))

>    

> 

> These looks generally applicable, as long as forwarding to the accumulator operand has the same cost whether you are coming from an FMA or an FMUL. This one should be good across multiple AArch64 cores.

Indeed.

> There is a third class of optimisation, but these relate to folding through extend operations. For AArch64 LookThroughFPExt will return false, so these won't help you.

> 

>   (fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))

True that.

> I'd guess that most of your benefit would come from the second class of folds, and that these are likely to be good across microarchitectures. The first class I think are a strange set of optimisations, and it isn't clear to me why that should uniformly be a good fold, even on microarchitectures where you can contrive a scenario where there is a benefit.

As I said above, FMA may free up the resources used by either FADD or FMUL to other computations.  But that will depend on the target.

Repository:
  rL LLVM

https://reviews.llvm.org/D23583