[PATCH] D23583: [AArch64] Add feature has-fast-fma

Thu Aug 18 10:03:35 PDT 2016

jgreenhalgh added a comment.

In https://reviews.llvm.org/D23583#519490, @evandro wrote:

> In https://reviews.llvm.org/D23583#519379, @jgreenhalgh wrote:
>
> > Presumably this is where the "faster than an FADD" comes from. This transform is FMUL + FADD + [use of FMUL] -> FMA + FMUL + [use of FMUL].
>
>
> There are other cases, such as FADD + FMUL + FMA -> FMA + FMA.  Probably a better way to describe the use of `enableAggressiveFMAFusion` is the relative cost of FMA to FADD and FMUL.

Yes, as you'll have seen from the rest of my Review. But this transform in particular looks like trouble. Again for clarity:

  (fadd (fmul x, y), z) -> (fma x, y, z) 

Where the multiply is **multiple** use (if it is single use, you catch it regardless of enableAgressiveFMAFusion).

So the transformation here always results in both an FMUL and an FMA in the instruction stream imagine:

  fmul s0, s0, s0
  fadd s1, s1, s0
  str s0, [x0]
  str s1, [x1]

The transformation above would generate:

  fmul s2, s0, s0
  fmadd s1, s0, s0, s1
  str s2, [x0]
  str s1, [x1]

This is why, for these transforms, the hook talks about the relative cost of an FADD to an FMA.

I would imagine for many cores, including Exynos-M1, this would not be a good transform to enable.

> > Is this really a good optimisation for Exynos-M1?

> 

> 

> In Exynos M1, FMA uses the same resources as FMUL and saves using the other resources required by FADD, so it tends to be beneficial on it.  It's not always a win though, such as when having finer grained FADD and FMUL allows more parallelism, but, in my experiments, they were few such workloads.

Right. So given that the transformation will increase the traffic to your multiply unit, are you sure it is something you want to enable? As you describe, the FMA is not going to be faster here, and the code generated is objectively worse.

> > I wonder whether really the gains for Exynos-M1 come from the second class of optimisation:

> 

> > 

> 

> >   (fadd (fma x, y, (fmul u, v)), z) -> (fma x, y (fma u, v, z))

> 

> >   (fadd x, (fma y, z, (fmul u, v)) -> (fma y, z (fma u, v, x))

> 

> >    

> 

> > 

> 

> > These looks generally applicable, as long as forwarding to the accumulator operand has the same cost whether you are coming from an FMA or an FMUL. This one should be good across multiple AArch64 cores.

> 

> 

> Indeed.

We can agree that these are good optimisations. No arguments from me there. In fact, these are good enough that they should be enabled for many AArch64 cores. If these were the only transforms that were enabled by enableAgressiveFMAFusion I'd suggest that you'd actually want your target feature to have the opposite sense to what it has right now - most cores should enjoy these transforms and those that don't could call it out in a cost-model/feature flag.

It seems to me that decoupling the bad transforms from the good, enabling the good transforms for most AArch64 CPUs, and leaving the bad transform off would be the more beneficial approach.

Repository:
  rL LLVM

https://reviews.llvm.org/D23583