[PATCH] D124564: [MachineCombiner, AArch64] Add a new pattern A-(B+C) => (A-B)-C to reduce latency

Thu May 12 01:11:47 PDT 2022

dmgreen added subscribers: malharJ, dmgreen.
dmgreen added a comment.

Hello. This looks like two different patches.

The first part is the subadd - If that is reducing the critical path length then it would seem to make sense. It might be worth adding a mir test? It's hard to tell with all the other changes how profitable it would be.

The other part is fixing the machine combines costs. That is something we've noticed before in the context of transient instructions and the cost model it uses seeming to be incorrect. D123512 <https://reviews.llvm.org/D123512> for example. I've not stepped through the new costs, but I think we should be preferring madd over mul and add, even if the scheduling model suggests otherwise. Most CPU implement forwarding into the accumulate of a madd, that probably isn't modelled in the A55 model that we use by default. The codegen we produce for -mcpu=generic should ideally be descent for the combination of big and little core. Scheduling we take from the little core because being in-order it requires the most help, and out-of-order cores can schedule inside the core themselves. The codegen preferably should be good on both though, if we can, and we should aim not to make the big core performance worse just because we are using a little-cores schedule.

The simplest solution is possibly to just add the forwarding paths to the A55 schedule (although every time we try things like that the performance seems to be mixed at best). For others like fmadd it is less obvious what we should do. At any rate it would be best if the addsub and the costmodel changes were two separate patches.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124564/new/

https://reviews.llvm.org/D124564