[PATCH] D131028: [AArch64] Fix cost model for FADD vector reduction

Tue Aug 16 01:21:46 PDT 2022

fhahn added a comment.

In D131028#3725222 <https://reviews.llvm.org/D131028#3725222>, @dmgreen wrote:

> The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131028/new/

https://reviews.llvm.org/D131028