[PATCH] D108382: [X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such
Roman Lebedev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 1 09:48:54 PDT 2021
lebedev.ri added inline comments.
================
Comment at: llvm/test/CodeGen/X86/oddshuffles.ll:2284
+; AVX2-FAST-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
+; AVX2-FAST-NEXT: vpinsrd $2, 8(%rdi), %xmm0, %xmm1
; AVX2-FAST-NEXT: vpxor %xmm0, %xmm0, %xmm0
----------------
lebedev.ri wrote:
> RKSimon wrote:
> > lebedev.ri wrote:
> > > lebedev.ri wrote:
> > > > RKSimon wrote:
> > > > > any luck with this?
> > > > I wrote a comment here, and phab just lost it :(
> > > >
> > > > This seems like demandedelts failure.
> > > > In LHS, we successfully dropped this load.
> > > > Whenever in `SimplifyMultipleUseDemandedBits()` we look at this `insert_vector_elt`,
> > > > demandedelts implies that we demand all elements.
> > > > The problem is that we need to decode the target shuffle mask to notice that, i think.
> > > >
> > > > Wild guess: perhaps in `SimplifyMultipleUseDemandedBitsForTargetNode()` after `getTargetShuffleInputs()`,
> > > > we can call `SimplifyMultipleUseDemandedBits()` on inputs, and recreate the shuffle if that succeeded?
> > > > I'm not really sure if there is some other better place to do that.
> > > Actually, that won't work either.
> > > ```
> > > Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
> > > SelectionDAG has 32 nodes:
> > > t0: ch = EntryToken
> > > t2: i64,ch = CopyFromReg t0, Register:i64 %0
> > > t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
> > > t55: v8i32 = X86ISD::BLENDI t24, t58, TargetConstant:i8<2>
> > > t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t55
> > > t69: v32i8 = bitcast t58
> > > t76: i64 = X86ISD::Wrapper TargetConstantPool:i64<<32 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 0, i8 1, i8 2, i8 3, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>> 0
> > > t74: v32i8,ch = load<(load (s256) from constant-pool)> t0, t76, undef:i64
> > > t71: v32i8 = X86ISD::PSHUFB t69, t74
> > > t72: v8i32 = bitcast t71
> > > t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t72, t19:1
> > > t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
> > > t30: v2i64 = scalar_to_vector t27
> > > t31: v4i32 = bitcast t30
> > > t28: i64 = add nuw t2, Constant:i64<8>
> > > t29: i32,ch = load<(load (s32) from %ir.ptr + 8, align 1)> t0, t28, undef:i64
> > > t61: v4i32 = insert_vector_elt t31, t29, Constant:i64<2>
> > > t58: v8i32 = insert_subvector undef:v8i32, t61, Constant:i64<0>
> > > t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1
> > >
> > >
> > > ===== Instruction selection begins: %bb.0 ''
> > > ```
> > > `t29`/`t61` is what we want to drop, but even if we could recreate subreg widening, `t58` has two uses.
> > > So i guess our only hope is `combineX86ShufflesRecursively()`?
> > What might work is calling SimplifyMultipleUseDemandedVectorElts on each operand at the end of the combineX86ShufflesRecursively recursion just before the calls into combineX86ShuffleChain?
> Let's see...
I've locally rebased this patch ontop of the suggestion which i've implemented in D109065. and it does not help.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D108382/new/
https://reviews.llvm.org/D108382
More information about the llvm-commits
mailing list