[PATCH] D123512: [MachineCombiner]: Avoid including transient instructions in latency calculation

Mon Apr 11 11:02:39 PDT 2022

malharJ added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/neon-mla-mls.ll:141
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mul v0.8b, v0.8b, v1.8b
-; CHECK-NEXT:    sub v0.8b, v0.8b, v2.8b
+; CHECK-NEXT:    neg v2.8b, v2.8b
+; CHECK-NEXT:    mla v2.8b, v0.8b, v1.8b
----------------
georges wrote:
> fhahn wrote:
> > Is this actually profitable compared to the original code? Shouldn't this use `mls`?
> I don't think you can use `mls` here since the negation is on the input accumulator rather than the multiplicand.
> Apart from that, I agree with @fhahn that this doesn't look obviously profitable. `fmov`/`mov` are not zero-latency on a lot of micro-architectures.
> The equivalent case without the `fmov` (e.g. accumulating into `A` rather than `C` so it ends up in `v0` naturally) could be believably faster due to the late accumulator operand forwarding present on many micro-architectures, and at worst shouldn't be worse than the existing code. I'm not sure if that already happened prior to this change though.
I understand the concern about the additional fmov/mov,
but as  @georges already pointed out, this is more related
to the calling convention of returning result in r0
(and also unrelated to my patch which just fixes the latency/depth calculation)

I am happy to interchange the order of %A and %C, 
as that eliminates the fmov/mov, if that's what is required for this patch.

Also, independent of the above, I think the test can be renamed to use
 "mla" (or maybe "negmla") instead of "mls"

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123512/new/

https://reviews.llvm.org/D123512