[PATCH] D131028: [AArch64] Fix cost model for FADD vector reduction
Florian Hahn via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Aug 16 01:21:46 PDT 2022
fhahn added a comment.
In D131028#3725222 <https://reviews.llvm.org/D131028#3725222>, @dmgreen wrote:
> The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.
Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?
If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D131028/new/
https://reviews.llvm.org/D131028
More information about the llvm-commits
mailing list