[PATCH] D123512: [MachineCombiner]: Avoid including transient instructions in latency calculation
Malhar via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Apr 11 11:02:39 PDT 2022
malharJ added inline comments.
================
Comment at: llvm/test/CodeGen/AArch64/neon-mla-mls.ll:141
; CHECK: // %bb.0:
-; CHECK-NEXT: mul v0.8b, v0.8b, v1.8b
-; CHECK-NEXT: sub v0.8b, v0.8b, v2.8b
+; CHECK-NEXT: neg v2.8b, v2.8b
+; CHECK-NEXT: mla v2.8b, v0.8b, v1.8b
----------------
georges wrote:
> fhahn wrote:
> > Is this actually profitable compared to the original code? Shouldn't this use `mls`?
> I don't think you can use `mls` here since the negation is on the input accumulator rather than the multiplicand.
> Apart from that, I agree with @fhahn that this doesn't look obviously profitable. `fmov`/`mov` are not zero-latency on a lot of micro-architectures.
> The equivalent case without the `fmov` (e.g. accumulating into `A` rather than `C` so it ends up in `v0` naturally) could be believably faster due to the late accumulator operand forwarding present on many micro-architectures, and at worst shouldn't be worse than the existing code. I'm not sure if that already happened prior to this change though.
I understand the concern about the additional fmov/mov,
but as @georges already pointed out, this is more related
to the calling convention of returning result in r0
(and also unrelated to my patch which just fixes the latency/depth calculation)
I am happy to interchange the order of %A and %C,
as that eliminates the fmov/mov, if that's what is required for this patch.
Also, independent of the above, I think the test can be renamed to use
"mla" (or maybe "negmla") instead of "mls"
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D123512/new/
https://reviews.llvm.org/D123512
More information about the llvm-commits
mailing list