[PATCH] D80801: [DAGCombiner] allow more folding of fadd + fmul into fma

Sat May 30 06:52:24 PDT 2020

spatel added a comment.

In D80801#2064218 <https://reviews.llvm.org/D80801#2064218>, @rscottmanley wrote:

> I agree SDAG is not ideal -- I explored doing this earlier in opt and also later using MIs, but both places have their own problems. It's a surprisingly cumbersome optimization if you are concerned about multiple targets which have different sets of FMA "flavours".

This discussion reminded me of the examples here:
https://reviews.llvm.org/D18751#402906
(and that's where the MachineCombiner hook was added)

So we really can't win all cases - even on the same target - without seeing the entire loop. I still view this patch as an instruction/uop win, so it's the right default choice.

In D80801#2064204 <https://reviews.llvm.org/D80801#2064204>, @craig.topper wrote:

> Broadwell might be an interesting X86 target here. MUL and ADD both have 3 cycle latency and FMA is 5 cycle latency. Haswell is ADD 3, MUL/FMA 5. Everything is uniform on SKL at 4 cycles.

Thanks - I overlooked Broadwell. From what I see in Agner's tables, Broadwell is then tied with Ryzen for worst relative FMA implementation (3/3/5 for single-precision).
Do you think this is worth trying as-is for x86, or do we need to work harder to undo FMA first? @RKSimon - any thoughts about AMD CPUs?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80801/new/

https://reviews.llvm.org/D80801