[PATCH] D45733: [DAGCombiner] Unfold scalar masked merge if profitable

Wed Apr 18 10:42:02 PDT 2018

Currently llvm-mca doesn't know how to resolve variant scheduling classes.
This problem mostly affects the ARM target.
This has been reported here: https://bugs.llvm.org/show_bug.cgi?id=36672

The number of micro opcodes that you see is the llvm-mca output is the
default (invalid) number of micro opcodes for instructions associated with
a sched-variant class.

I plan to send a patch to address (most of) the issues related to the
presence of variant scheduling classes. However, keep in mind that ARM
sched-predicates heavily rely on TII hooks. Those are going to cause
problems for tools like mca (i.e. there is not an easy way to "fix" them).

At the moment, llvm-mca doesnt' know how to analyze these two instructions,
since both are associated with a variant scheduling class:
   eor     w8, w0, w1
   mov w0, w1

On Wed, Apr 18, 2018 at 5:10 PM, Sanjay Patel via Phabricator <
reviews at reviews.llvm.org> wrote:

> spatel added subscribers: courbet, andreadb.
> spatel added a comment.
>
> In https://reviews.llvm.org/D45733#1071005, @lebedev.ri wrote:
>
> > > Yeah, that is the question, i'm having. I did look at mca output.
> >
> > Here is what MCA says about that for `-mtriple=aarch64-unknown-linux-gnu
> -mcpu=cortex-a75`
> >  F5971838: diff.txt <https://reviews.llvm.org/F5971838>
> >  Or is this a scheduling info problem?
>
>
> Cool - a chance to poke at llvm-mca! (cc @andreadb and @courbet)
>
> First thing I see is that it's harder to get the sequence we're after on
> x86 using the basic source premise:
>
>   int andandor(int x, int y)  {
>     __asm volatile("# LLVM-MCA-BEGIN ands");
>     int r = (x & 42) | (y & ~42);
>     __asm volatile("# LLVM-MCA-END ands");
>     return r;
>   }
>
>   int xorandxor(int x, int y) {
>     __asm volatile("# LLVM-MCA-BEGIN xors");
>     int r = ((x ^ y) & 42) ^ y;
>     __asm volatile("# LLVM-MCA-END xors");
>     return r;
>   }
>
> ...because the input param register doesn't match the output result
> register. We'd have to hack that in asm...or put the code in a loop, but
> subtract the loop overhead somehow. Things work/look alright to me other
> than that.
>
> I don't know AArch that well, but your example is a special-case that may
> be going wrong. Ie, if we have a bit-string constant like 0xff000000, you
> could get:
>         bfxil   w0, w1, #0, #24
> ...which should certainly be better than:
>         eor     w8, w1, w0
>         and     w8, w8, #0xff000000
>         eor     w0, w8, w1
>
> AArch64 chose to convert to shift + possibly more expensive bfi for the
> 0x00ffff00 constant though. That's not something that we can account for in
> generic DAGCombiner, so I'd categorize that as an AArch64-specific bug
> (either don't use bfi there or fix the scheduling model or fix this up in
> MI somehow).
>
>
> Repository:
>   rL LLVM
>
> https://reviews.llvm.org/D45733
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180418/cadb4bad/attachment.html>