[PATCH] D131028: [AArch64] Fix cost model for FADD vector reduction

Wed Aug 17 04:07:29 PDT 2022

dmgreen added a comment.

In D131028#3725287 <https://reviews.llvm.org/D131028#3725287>, @fhahn wrote:

> In D131028#3725222 <https://reviews.llvm.org/D131028#3725222>, @dmgreen wrote:
>
>> The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.
>
> Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?
>
> If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

I don't think this is about micro-architecture differences, just different benchmarks. I happen to have a lot of tests that have the awkward SLP case and 4x unrolled loops, so show where this can perform worse. The same thing doesn't happen to come up in Spec or the other benchmarks you ran, so you didn't see any differences. The manually unrolled loops I think are less important in the grand scheme of things.

The problem isn't FADD vs FADDP. I agree that in isolation this is a decent improvement to the cost of fadd reduction. The problem is FMADD vs vector FMUL + FADDP's. Unless a micro-arch will be doing something very strange, back-to-back fmadd should be pretty efficient. There will not be a very large difference between 4xscalar fmadd and vector fmul+faddp+faddp, and yet the cost model will currently cost them as 8 vs 2+2 I think. With anything else going on (like the shuffles in the arm_biquad_cascade_df1_f32 case, the cost can easily be wrong enough to give worse performance.

But our current FP costs are all set to 2 (which isn't necessarily deliberate, just the default), and any modifications I've tried so far have only really made things look worse. I was attempting to mark the cost of a fadd that used a fmul as free, but it didn't help in the arm_biquad_cascade_df1_f32 case and seemed to cause other performance issues on its own. Perhaps it's best to move forward with this, accepting the regressions because it also gives improvements, and work on improving the other costs as we go forward.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll:31
+; COMMON-LABEL: 'fast_fp_reductions'
+; NOFP16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NOFP16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
----------------
Can we split the fp16 into their own functions, so that we can still use the script to generate the check lines. We do the same in a few of the other costmodel tests.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131028/new/

https://reviews.llvm.org/D131028