[llvm-dev] [AVX512BW] Nasty KAND issue

Thu Oct 20 10:36:31 PDT 2016

On Thu, Oct 20, 2016 at 1:14 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
> On Oct 20, 2016, at 10:05 AM, Friedman, Eli <efriedma at codeaurora.org> wrote:
>
> On 10/20/2016 9:28 AM, Cameron McInally via llvm-dev wrote:
>
> I should have attached the generated asm to save some trouble.
> Apologies for that and attaching now...
>
>
>
> On Thu, Oct 20, 2016 at 12:26 PM, Cameron McInally
> <cameron.mcinally at nyu.edu> wrote:
>
> On Thu, Oct 20, 2016 at 12:05 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
> On Oct 20, 2016, at 8:54 AM, Cameron McInally via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>
> Hey guys,
>
> I've hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits.
>
> In the IR, we represent a 4b vector mask as <4 x i1>. This assumes
> that the storage container for this type is also 4b, but it's not.
>
> The storage type is not relevant, these bits are “unreachable” from the IR
> point of view.
> The backend is supposed to lower the operation in a safe way when it is
> needed to clear these bit.
>
> For example if you were to perform some arithmetic operation on these, it is
> likely that they would get zero extended to 8bits first and this is where
> the upper bits would be cleared.
>
>
> The
> smallest mask register on SKX is 8b. This also implies that the
> smallest load/store moves 8b.
>
> We run into problems when we try to optimize ANDs (full test case attached):
>
>  %r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1 -1>
>
> At the IR level the all1s mask looks like the Identity for this
> operation, so LLVM will remove it. But it is not the Identity since
> this operation should clear the top 4 bits of the 8 bit hardware
> register in play. E.g.
>
> No, this operation alone does not need to clear the upper bit, they are
> undefined before and after.
>
>
>        kmovb   -4(%rsp), %k0
>        kandb   %k0, %k1, %k0
>        kmovb   %k0, -4(%rsp)
>
> I began tracking down this issue and found that InstCombine will
> incorrectly remove the AND. Then I noticed that the Reassociate pass
> would also remove the AND if InstCombine did not. That made me
> nervous. My current thinking is that this might be a larger problem
> that shouldn't be patched up. Or maybe I made a faulty assumption with
> the IR I choose for this operation.
>
> There might be a legitimate issue, but your example fails short to
> illustrate it right now: you’re not showing how these upper bits are leaking
> into the computation somewhere?
>
> —
> Mehdi
>
> Hi Mehdi,
>
> Yes, good point. Updated test case exhibiting the dirty bits attached.
> Notice that the kortest will operate on the dirty bits that should
> have been zeroed.
>
> Perhaps the problem is that the zext of the i4 to i16 does not get
> generated correctly.
>
>
> The problem with your IR is actually the load.  When you load a value whose
> size in bits is not a multiple of 8 (like i1, or <4 x i1>, the result is
> undefined unless the unused bits are zero.
>
>
> Almost: LangRef says "When loading a value of a type like i20 with a size
> that is not an integral number of bytes, the result is undefined if the
> value was not originally written using a store of the same type.”
>
> This is even stronger than what you described:  even if the bits are
> explicitly  0, the results can be undefined.
>
> Also, while I guess the most straightforward lowering of the store would
> clear the upper bits, that’s not a requirement of LangRef I believe (it
> seems legal to me to clear the bits on the load instead for example).
>
> You can see this in the debug output from llc:
>
> SelectionDAG has 15 nodes:
>  t0: ch = EntryToken
>        t21: i32 = X86ISD::KORTEST t19, t19
>      t22: i8 = X86ISD::SETCC Constant:i8<4>, t21
>    t23: i32 = zero_extend t22
>  t14: ch,glue = CopyToReg t0, Register:i32 %EAX, t23
>      t24: i16,ch = load<LD1[%XXX](align=4)(dereferenceable), zext from i8>
> t0, FrameIndex:i64<0>, undef:i64
>
>
> If I read correctly, the issue is the "zext from i8” part of the load right?
> The backend assumed they were cleared on the store.
>
> Thanks,
>
> —
> Mehdi

Well that wouldn't be too bad. I.e. explicitly zeroing the top bits on
the <4 x i1> store. I think it would require an extra instruction
though.

Handling <4 x i1> loads/stores as <8 x i1> at my interface sounds
awful. Just FYI.

-Cam