[PATCH] D45733: [DAGCombiner] Unfold scalar masked merge if profitable

Wed Apr 18 09:30:29 PDT 2018

lebedev.ri added a comment.

In https://reviews.llvm.org/D45733#1071051, @spatel wrote:

> In https://reviews.llvm.org/D45733#1071005, @lebedev.ri wrote:
>
> > > Yeah, that is the question, i'm having. I did look at mca output.
> >
> > Here is what MCA says about that for `-mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a75`
> >  F5971838: diff.txt <https://reviews.llvm.org/F5971838>
> >  Or is this a scheduling info problem?
>
>
> Cool - a chance to poke at llvm-mca! (cc @andreadb and @courbet)
>
> First thing I see is that it's harder to get the sequence we're after on x86 using the basic source premise:
>
>   int andandor(int x, int y)  {
>     __asm volatile("# LLVM-MCA-BEGIN ands");
>     int r = (x & 42) | (y & ~42);
>     __asm volatile("# LLVM-MCA-END ands");
>     return r;
>   }
>  
>   int xorandxor(int x, int y) {
>     __asm volatile("# LLVM-MCA-BEGIN xors");
>     int r = ((x ^ y) & 42) ^ y;
>     __asm volatile("# LLVM-MCA-END xors");
>     return r;
>   }
>  
>
>
> ...because the input param register doesn't match the output result register. We'd have to hack that in asm...or put the code in a loop, but subtract the loop overhead somehow. Things work/look alright to me other than that.

I simply stored the lhs and rhs side of `// CHECK` lines from aarch64's `@in32_constmask` in two local files,
run llvm-mca on each of them, and diffed the output, no clang was involved.

> I don't know AArch that well, but your example is a special-case that may be going wrong. Ie, if we have a bit-string constant like 0xff000000, you could get:
> 
>   	bfxil	w0, w1, #0, #24
> 
> ...which should certainly be better than:
> 
>   	eor	w8, w1, w0
>   	and	w8, w8, #0xff000000
>   	eor	w0, w8, w1
>   
> 
> 
> AArch64 chose to convert to shift + possibly more expensive bfi for the 0x00ffff00 constant though. That's not something that we can account for in generic DAGCombiner, so I'd categorize that as an AArch64-specific bug (either don't use bfi there or fix the scheduling model or fix this up in MI somehow).

Ok, then let's assume until proven otherwise that if mask is a constant, unfolded variant is always better.
I'll unfold it in instcombine (since it seems the https://reviews.llvm.org/D45664 will already match the masked merge pattern, so it would not add much code).

In https://reviews.llvm.org/D45733#1070963, @spatel wrote:

> there's no need to add code bloat to the DAG to handle the pattern unless something in the backend can create this pattern (seems unlikely).

I don't think it will be possible to check that until after the instcombine part has landed, so ok, at least for now i will stop unfolding [constant mask] in dagcombine.

While there, any hint re pattern matchers for this code?

Repository:
  rL LLVM

https://reviews.llvm.org/D45733