[PATCH] D45733: [DAGCombiner] Unfold scalar masked merge if profitable

Wed Apr 18 09:10:31 PDT 2018

spatel added subscribers: courbet, andreadb.
spatel added a comment.

In https://reviews.llvm.org/D45733#1071005, @lebedev.ri wrote:

> > Yeah, that is the question, i'm having. I did look at mca output.
>
> Here is what MCA says about that for `-mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75`
>  F5971838: diff.txt <https://reviews.llvm.org/F5971838>
>  Or is this a scheduling info problem?

Cool - a chance to poke at llvm-mca! (cc @andreadb and @courbet)

First thing I see is that it's harder to get the sequence we're after on x86 using the basic source premise:

  int andandor(int x, int y)  {
    __asm volatile("# LLVM-MCA-BEGIN ands");
    int r = (x & 42) | (y & ~42);
    __asm volatile("# LLVM-MCA-END ands");
    return r;
  }

  int xorandxor(int x, int y) {
    __asm volatile("# LLVM-MCA-BEGIN xors");
    int r = ((x ^ y) & 42) ^ y;
    __asm volatile("# LLVM-MCA-END xors");
    return r;
  }

...because the input param register doesn't match the output result register. We'd have to hack that in asm...or put the code in a loop, but subtract the loop overhead somehow. Things work/look alright to me other than that.

I don't know AArch that well, but your example is a special-case that may be going wrong. Ie, if we have a bit-string constant like 0xff000000, you could get:
	bfxil	w0, w1, #0, #24
...which should certainly be better than:
	eor	w8, w1, w0
	and	w8, w8, #0xff000000
	eor	w0, w8, w1

AArch64 chose to convert to shift + possibly more expensive bfi for the 0x00ffff00 constant though. That's not something that we can account for in generic DAGCombiner, so I'd categorize that as an AArch64-specific bug (either don't use bfi there or fix the scheduling model or fix this up in MI somehow).

Repository:
  rL LLVM

https://reviews.llvm.org/D45733