[PATCH] D123512: [MachineCombiner]: Avoid including transient instructions in latency calculation

Mon Apr 11 09:29:25 PDT 2022

georges added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/neon-mla-mls.ll:141
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mul v0.8b, v0.8b, v1.8b
-; CHECK-NEXT:    sub v0.8b, v0.8b, v2.8b
+; CHECK-NEXT:    neg v2.8b, v2.8b
+; CHECK-NEXT:    mla v2.8b, v0.8b, v1.8b
----------------
fhahn wrote:
> Is this actually profitable compared to the original code? Shouldn't this use `mls`?
I don't think you can use `mls` here since the negation is on the input accumulator rather than the multiplicand.
Apart from that, I agree with @fhahn that this doesn't look obviously profitable. `fmov`/`mov` are not zero-latency on a lot of micro-architectures.
The equivalent case without the `fmov` (e.g. accumulating into `A` rather than `C` so it ends up in `v0` naturally) could be believably faster due to the late accumulator operand forwarding present on many micro-architectures, and at worst shouldn't be worse than the existing code. I'm not sure if that already happened prior to this change though.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123512/new/

https://reviews.llvm.org/D123512