[PATCH] D126234: [AArch64] Add support for FMA intrinsics to shouldSinkOperands.

Thu May 26 09:23:35 PDT 2022

dmgreen accepted this revision.
dmgreen added a comment.
This revision is now accepted and ready to land.

> From the publicly available ARM software optimization guides, it looks like the indexed/non-indexed variants should give the same performance on A57/A75, but the indexed variants have less throughput on A55. A quick glance at the scheduling model for A55 seems to indicate that this difference is not reflected in the model though.
>
> It also looks like the same issue would exist for other indexed variants handled there. If issues show up investigating handling this more granually sounds like a  good idea. We could also limit the sinking to cores where indexed and non-indexed variants have the same performance, if needed.

The testing I usually run included some A55 runs, and I wasn't seeing any differences in performance. It turns out that the test case I have for fma+dup would come through to MachineCombiner as FADD+FMUL+DUP though, not llvm.fma. It was MachineCombiner that did the conversion in the end though (at least for one of the two fma's, the other becomes a non-lane-indexed version because the pattern is actually `FADD(FMUL(COPY(DUP(..)))`, and I don't think the Machine Combiner is ignoring no-op copies).

Anyway - Having fixed that I still do not see any difference between the indexed and non-indexed variants of FMA A55 when the index versions are forced off, so I think performance-wise this is OK. The indexed variants seem to perform better in general, even if they do have a lower throughput.

So LGTM. Cheers

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126234/new/

https://reviews.llvm.org/D126234