[llvm] r368183 - Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."

Craig Topper via llvm-commits llvm-commits at lists.llvm.org
Mon Aug 19 22:24:16 PDT 2019


There have been quite a lot of follow on patches to this. A lot of them
would need be reverted to get back to the old state. I can start trying to
put that together.

~Craig


On Mon, Aug 19, 2019 at 9:55 PM Eric Christopher via llvm-commits <
llvm-commits at lists.llvm.org> wrote:

> HI Craig,
>
> We're seeing a rather lot of performance regressions with this enabled
> by default. Is it possible to get it turned on under a command flag
> for the near term while we work on getting you a pile of testcases
> (some of it is Eigen and those will at least be easier as you have
> access to that source :)
>
> Thoughts?
>
> Thanks!
>
> -eric
>
> On Wed, Aug 7, 2019 at 9:23 AM Craig Topper via llvm-commits
> <llvm-commits at lists.llvm.org> wrote:
> >
> > Author: ctopper
> > Date: Wed Aug  7 09:24:26 2019
> > New Revision: 368183
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=368183&view=rev
> > Log:
> > Recommit r367901 "[X86] Enable
> -x86-experimental-vector-widening-legalization by default."
> >
> > The assert that caused this to be reverted should be fixed now.
> >
> > Original commit message:
> >
> > This patch changes our defualt legalization behavior for 16, 32, and
> > 64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
> > widening. For example, v8i8 will now be widened to v16i8 instead of
> > promoted to v8i16. This keeps the elements widths the same and pads
> > with undef elements. We believe this is a better legalization strategy.
> > But it carries some issues due to the fragmented vector ISA. For
> > example, i8 shifts and multiplies get widened and then later have
> > to be promoted/split into vXi16 vectors.
> >
> > This has the potential to cause regressions so we wanted to get
> > it in early in the 10.0 cycle so we have plenty of time to
> > address them.
> >
> > Next steps will be to merge tests that explicitly test the command
> > line option. And then we can remove the option and its associated
> > code.
> >
> > Removed:
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll
> > Modified:
> >     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
> >     llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> >     llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/arith.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/cast.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll
> >     llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll
> >     llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll
> >     llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll
> >     llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll
> >     llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll
> >     llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll
> >     llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll
> >     llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll
> >     llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll
> >     llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll
> >     llvm/trunk/test/CodeGen/X86/4char-promote.ll
> >     llvm/trunk/test/CodeGen/X86/and-load-fold.ll
> >     llvm/trunk/test/CodeGen/X86/atomic-unordered.ll
> >     llvm/trunk/test/CodeGen/X86/avg.ll
> >     llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll
> >     llvm/trunk/test/CodeGen/X86/avx-fp2int.ll
> >     llvm/trunk/test/CodeGen/X86/avx2-conversions.ll
> >     llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll
> >     llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-cvt.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-ext.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-trunc.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll
> >     llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll
> >     llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
> >     llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll
> >     llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll
> >     llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll
> >     llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll
> >     llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll
> >     llvm/trunk/test/CodeGen/X86/bitreverse.ll
> >     llvm/trunk/test/CodeGen/X86/bswap-vector.ll
> >     llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll
> >     llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll
> >     llvm/trunk/test/CodeGen/X86/combine-or.ll
> >     llvm/trunk/test/CodeGen/X86/complex-fastmath.ll
> >     llvm/trunk/test/CodeGen/X86/cvtv2f32.ll
> >     llvm/trunk/test/CodeGen/X86/extract-concat.ll
> >     llvm/trunk/test/CodeGen/X86/extract-insert.ll
> >     llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll
> >     llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll
> >     llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll
> >     llvm/trunk/test/CodeGen/X86/known-bits.ll
> >     llvm/trunk/test/CodeGen/X86/load-partial.ll
> >     llvm/trunk/test/CodeGen/X86/lower-bitcast.ll
> >     llvm/trunk/test/CodeGen/X86/madd.ll
> >     llvm/trunk/test/CodeGen/X86/masked_compressstore.ll
> >     llvm/trunk/test/CodeGen/X86/masked_expandload.ll
> >     llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll
> >     llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll
> >     llvm/trunk/test/CodeGen/X86/masked_load.ll
> >     llvm/trunk/test/CodeGen/X86/masked_store.ll
> >     llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll
> >     llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll
> >     llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll
> >     llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll
> >     llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll
> >     llvm/trunk/test/CodeGen/X86/mmx-arith.ll
> >     llvm/trunk/test/CodeGen/X86/mmx-cvt.ll
> >     llvm/trunk/test/CodeGen/X86/mulvi32.ll
> >     llvm/trunk/test/CodeGen/X86/oddshuffles.ll
> >     llvm/trunk/test/CodeGen/X86/oddsubvector.ll
> >     llvm/trunk/test/CodeGen/X86/pmaddubsw.ll
> >     llvm/trunk/test/CodeGen/X86/pmulh.ll
> >     llvm/trunk/test/CodeGen/X86/pointer-vector.ll
> >     llvm/trunk/test/CodeGen/X86/pr14161.ll
> >     llvm/trunk/test/CodeGen/X86/pr35918.ll
> >     llvm/trunk/test/CodeGen/X86/pr40994.ll
> >     llvm/trunk/test/CodeGen/X86/promote-vec3.ll
> >     llvm/trunk/test/CodeGen/X86/promote.ll
> >     llvm/trunk/test/CodeGen/X86/psubus.ll
> >     llvm/trunk/test/CodeGen/X86/ret-mmx.ll
> >     llvm/trunk/test/CodeGen/X86/sad.ll
> >     llvm/trunk/test/CodeGen/X86/sadd_sat_vec.ll
> >     llvm/trunk/test/CodeGen/X86/scalar_widen_div.ll
> >     llvm/trunk/test/CodeGen/X86/select.ll
> >     llvm/trunk/test/CodeGen/X86/shift-combine.ll
> >     llvm/trunk/test/CodeGen/X86/shrink_vmul.ll
> >     llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-128.ll
> >     llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-256.ll
> >     llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-512.ll
> >     llvm/trunk/test/CodeGen/X86/shuffle-vs-trunc-128.ll
> >     llvm/trunk/test/CodeGen/X86/shuffle-vs-trunc-256.ll
> >     llvm/trunk/test/CodeGen/X86/shuffle-vs-trunc-512.ll
> >     llvm/trunk/test/CodeGen/X86/slow-pmulld.ll
> >     llvm/trunk/test/CodeGen/X86/sse2-intrinsics-canonical.ll
> >     llvm/trunk/test/CodeGen/X86/sse2-vector-shifts.ll
> >     llvm/trunk/test/CodeGen/X86/ssub_sat_vec.ll
> >     llvm/trunk/test/CodeGen/X86/test-shrink-bug.ll
> >     llvm/trunk/test/CodeGen/X86/trunc-ext-ld-st.ll
> >     llvm/trunk/test/CodeGen/X86/trunc-subvector.ll
> >     llvm/trunk/test/CodeGen/X86/uadd_sat_vec.ll
> >
>  llvm/trunk/test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
> >     llvm/trunk/test/CodeGen/X86/usub_sat_vec.ll
> >     llvm/trunk/test/CodeGen/X86/vec_cast2.ll
> >     llvm/trunk/test/CodeGen/X86/vec_cast3.ll
> >     llvm/trunk/test/CodeGen/X86/vec_ctbits.ll
> >     llvm/trunk/test/CodeGen/X86/vec_extract-mmx.ll
> >     llvm/trunk/test/CodeGen/X86/vec_fp_to_int.ll
> >     llvm/trunk/test/CodeGen/X86/vec_insert-5.ll
> >     llvm/trunk/test/CodeGen/X86/vec_insert-7.ll
> >     llvm/trunk/test/CodeGen/X86/vec_insert-mmx.ll
> >     llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
> >     llvm/trunk/test/CodeGen/X86/vec_saddo.ll
> >     llvm/trunk/test/CodeGen/X86/vec_smulo.ll
> >     llvm/trunk/test/CodeGen/X86/vec_ssubo.ll
> >     llvm/trunk/test/CodeGen/X86/vec_uaddo.ll
> >     llvm/trunk/test/CodeGen/X86/vec_umulo.ll
> >     llvm/trunk/test/CodeGen/X86/vec_usubo.ll
> >     llvm/trunk/test/CodeGen/X86/vector-blend.ll
> >     llvm/trunk/test/CodeGen/X86/vector-ext-logic.ll
> >     llvm/trunk/test/CodeGen/X86/vector-gep.ll
> >     llvm/trunk/test/CodeGen/X86/vector-half-conversions.ll
> >     llvm/trunk/test/CodeGen/X86/vector-idiv-v2i32.ll
> >     llvm/trunk/test/CodeGen/X86/vector-narrow-binop.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-add.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-and-bool.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-and.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-or-bool.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-or.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-smax.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-smin.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-umax.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-umin.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-xor-bool.ll
> >     llvm/trunk/test/CodeGen/X86/vector-reduce-xor.ll
> >     llvm/trunk/test/CodeGen/X86/vector-sext.ll
> >     llvm/trunk/test/CodeGen/X86/vector-shift-ashr-sub128.ll
> >     llvm/trunk/test/CodeGen/X86/vector-shift-by-select-loop.ll
> >     llvm/trunk/test/CodeGen/X86/vector-shift-lshr-sub128.ll
> >     llvm/trunk/test/CodeGen/X86/vector-shift-shl-sub128.ll
> >     llvm/trunk/test/CodeGen/X86/vector-shuffle-128-v16.ll
> >     llvm/trunk/test/CodeGen/X86/vector-shuffle-combining.ll
> >     llvm/trunk/test/CodeGen/X86/vector-trunc-packus.ll
> >     llvm/trunk/test/CodeGen/X86/vector-trunc-ssat.ll
> >     llvm/trunk/test/CodeGen/X86/vector-trunc-usat.ll
> >     llvm/trunk/test/CodeGen/X86/vector-trunc.ll
> >     llvm/trunk/test/CodeGen/X86/vector-truncate-combine.ll
> >     llvm/trunk/test/CodeGen/X86/vector-zext.ll
> >     llvm/trunk/test/CodeGen/X86/vsel-cmp-load.ll
> >     llvm/trunk/test/CodeGen/X86/vselect-avx.ll
> >     llvm/trunk/test/CodeGen/X86/vselect.ll
> >     llvm/trunk/test/CodeGen/X86/vshift-4.ll
> >     llvm/trunk/test/CodeGen/X86/widen_arith-1.ll
> >     llvm/trunk/test/CodeGen/X86/widen_arith-2.ll
> >     llvm/trunk/test/CodeGen/X86/widen_arith-3.ll
> >     llvm/trunk/test/CodeGen/X86/widen_bitops-0.ll
> >     llvm/trunk/test/CodeGen/X86/widen_cast-1.ll
> >     llvm/trunk/test/CodeGen/X86/widen_cast-2.ll
> >     llvm/trunk/test/CodeGen/X86/widen_cast-3.ll
> >     llvm/trunk/test/CodeGen/X86/widen_cast-4.ll
> >     llvm/trunk/test/CodeGen/X86/widen_cast-5.ll
> >     llvm/trunk/test/CodeGen/X86/widen_cast-6.ll
> >     llvm/trunk/test/CodeGen/X86/widen_compare-1.ll
> >     llvm/trunk/test/CodeGen/X86/widen_conv-1.ll
> >     llvm/trunk/test/CodeGen/X86/widen_conv-2.ll
> >     llvm/trunk/test/CodeGen/X86/widen_conv-3.ll
> >     llvm/trunk/test/CodeGen/X86/widen_conv-4.ll
> >     llvm/trunk/test/CodeGen/X86/widen_load-2.ll
> >     llvm/trunk/test/CodeGen/X86/widen_shuffle-1.ll
> >     llvm/trunk/test/CodeGen/X86/x86-interleaved-access.ll
> >     llvm/trunk/test/CodeGen/X86/x86-shifts.ll
> >     llvm/trunk/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll
> >     llvm/trunk/test/Transforms/SLPVectorizer/X86/fptosi.ll
> >     llvm/trunk/test/Transforms/SLPVectorizer/X86/fptoui.ll
> >
>  llvm/trunk/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
> >     llvm/trunk/test/Transforms/SLPVectorizer/X86/sitofp.ll
> >     llvm/trunk/test/Transforms/SLPVectorizer/X86/uitofp.ll
> >
> > Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
> > +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Wed Aug  7 09:24:26
> 2019
> > @@ -66,7 +66,7 @@ using namespace llvm;
> >  STATISTIC(NumTailCalls, "Number of tail calls");
> >
> >  static cl::opt<bool> ExperimentalVectorWideningLegalization(
> > -    "x86-experimental-vector-widening-legalization", cl::init(false),
> > +    "x86-experimental-vector-widening-legalization", cl::init(true),
> >      cl::desc("Enable an experimental vector type legalization through
> widening "
> >               "rather than promotion."),
> >      cl::Hidden);
> > @@ -40453,8 +40453,7 @@ static SDValue combineStore(SDNode *N, S
> >    bool NoImplicitFloatOps =
> F.hasFnAttribute(Attribute::NoImplicitFloat);
> >    bool F64IsLegal =
> >        !Subtarget.useSoftFloat() && !NoImplicitFloatOps &&
> Subtarget.hasSSE2();
> > -  if (((VT.isVector() && !VT.isFloatingPoint()) ||
> > -       (VT == MVT::i64 && F64IsLegal && !Subtarget.is64Bit())) &&
> > +  if ((VT == MVT::i64 && F64IsLegal && !Subtarget.is64Bit()) &&
> >        isa<LoadSDNode>(St->getValue()) &&
> >        !cast<LoadSDNode>(St->getValue())->isVolatile() &&
> >        St->getChain().hasOneUse() && !St->isVolatile()) {
> >
> > Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
> > +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Wed Aug  7
> 09:24:26 2019
> > @@ -887,7 +887,7 @@ int X86TTIImpl::getArithmeticInstrCost(
> >  int X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int
> Index,
> >                                 Type *SubTp) {
> >    // 64-bit packed float vectors (v2f32) are widened to type v4f32.
> > -  // 64-bit packed integer vectors (v2i32) are promoted to type v2i64.
> > +  // 64-bit packed integer vectors (v2i32) are widened to type v4i32.
> >    std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
> >
> >    // Treat Transpose as 2-op shuffles - there's no difference in
> lowering.
> > @@ -2425,14 +2425,6 @@ int X86TTIImpl::getAddressComputationCos
> >
> >  int X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, Type *ValTy,
> >                                             bool IsPairwise) {
> > -
> > -  std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
> > -
> > -  MVT MTy = LT.second;
> > -
> > -  int ISD = TLI->InstructionOpcodeToISD(Opcode);
> > -  assert(ISD && "Invalid opcode");
> > -
> >    // We use the Intel Architecture Code Analyzer(IACA) to measure the
> throughput
> >    // and make it as the cost.
> >
> > @@ -2440,7 +2432,10 @@ int X86TTIImpl::getArithmeticReductionCo
> >      { ISD::FADD,  MVT::v2f64,   2 },
> >      { ISD::FADD,  MVT::v4f32,   4 },
> >      { ISD::ADD,   MVT::v2i64,   2 },      // The data reported by the
> IACA tool is "1.6".
> > +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than
> v4i32.
> >      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the
> IACA tool is "3.5".
> > +    { ISD::ADD,   MVT::v2i16,   3 }, // FIXME: chosen to be less than
> v4i16
> > +    { ISD::ADD,   MVT::v4i16,   4 }, // FIXME: chosen to be less than
> v8i16
> >      { ISD::ADD,   MVT::v8i16,   5 },
> >    };
> >
> > @@ -2449,8 +2444,11 @@ int X86TTIImpl::getArithmeticReductionCo
> >      { ISD::FADD,  MVT::v4f64,   5 },
> >      { ISD::FADD,  MVT::v8f32,   7 },
> >      { ISD::ADD,   MVT::v2i64,   1 },      // The data reported by the
> IACA tool is "1.5".
> > +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than
> v4i32
> >      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the
> IACA tool is "3.5".
> >      { ISD::ADD,   MVT::v4i64,   5 },      // The data reported by the
> IACA tool is "4.8".
> > +    { ISD::ADD,   MVT::v2i16,   3 }, // FIXME: chosen to be less than
> v4i16
> > +    { ISD::ADD,   MVT::v4i16,   4 }, // FIXME: chosen to be less than
> v8i16
> >      { ISD::ADD,   MVT::v8i16,   5 },
> >      { ISD::ADD,   MVT::v8i32,   5 },
> >    };
> > @@ -2459,7 +2457,10 @@ int X86TTIImpl::getArithmeticReductionCo
> >      { ISD::FADD,  MVT::v2f64,   2 },
> >      { ISD::FADD,  MVT::v4f32,   4 },
> >      { ISD::ADD,   MVT::v2i64,   2 },      // The data reported by the
> IACA tool is "1.6".
> > +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than
> v4i32
> >      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the
> IACA tool is "3.3".
> > +    { ISD::ADD,   MVT::v2i16,   2 },      // The data reported by the
> IACA tool is "4.3".
> > +    { ISD::ADD,   MVT::v4i16,   3 },      // The data reported by the
> IACA tool is "4.3".
> >      { ISD::ADD,   MVT::v8i16,   4 },      // The data reported by the
> IACA tool is "4.3".
> >    };
> >
> > @@ -2468,12 +2469,47 @@ int X86TTIImpl::getArithmeticReductionCo
> >      { ISD::FADD,  MVT::v4f64,   3 },
> >      { ISD::FADD,  MVT::v8f32,   4 },
> >      { ISD::ADD,   MVT::v2i64,   1 },      // The data reported by the
> IACA tool is "1.5".
> > +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than
> v4i32
> >      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the
> IACA tool is "2.8".
> >      { ISD::ADD,   MVT::v4i64,   3 },
> > +    { ISD::ADD,   MVT::v2i16,   2 },      // The data reported by the
> IACA tool is "4.3".
> > +    { ISD::ADD,   MVT::v4i16,   3 },      // The data reported by the
> IACA tool is "4.3".
> >      { ISD::ADD,   MVT::v8i16,   4 },
> >      { ISD::ADD,   MVT::v8i32,   5 },
> >    };
> >
> > +  int ISD = TLI->InstructionOpcodeToISD(Opcode);
> > +  assert(ISD && "Invalid opcode");
> > +
> > +  // Before legalizing the type, give a chance to look up illegal
> narrow types
> > +  // in the table.
> > +  // FIXME: Is there a better way to do this?
> > +  EVT VT = TLI->getValueType(DL, ValTy);
> > +  if (VT.isSimple()) {
> > +    MVT MTy = VT.getSimpleVT();
> > +    if (IsPairwise) {
> > +      if (ST->hasAVX())
> > +        if (const auto *Entry = CostTableLookup(AVX1CostTblPairWise,
> ISD, MTy))
> > +          return Entry->Cost;
> > +
> > +      if (ST->hasSSE42())
> > +        if (const auto *Entry = CostTableLookup(SSE42CostTblPairWise,
> ISD, MTy))
> > +          return Entry->Cost;
> > +    } else {
> > +      if (ST->hasAVX())
> > +        if (const auto *Entry = CostTableLookup(AVX1CostTblNoPairWise,
> ISD, MTy))
> > +          return Entry->Cost;
> > +
> > +      if (ST->hasSSE42())
> > +        if (const auto *Entry = CostTableLookup(SSE42CostTblNoPairWise,
> ISD, MTy))
> > +          return Entry->Cost;
> > +    }
> > +  }
> > +
> > +  std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
> > +
> > +  MVT MTy = LT.second;
> > +
> >    if (IsPairwise) {
> >      if (ST->hasAVX())
> >        if (const auto *Entry = CostTableLookup(AVX1CostTblPairWise, ISD,
> MTy))
> >
> > Modified:
> llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll
> (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll Wed
> Aug  7 09:24:26 2019
> > @@ -18,9 +18,21 @@
> >  ; 64-bit packed float vectors (v2f32) are widened to type v4f32.
> >
> >  define <2 x i32> @test_v2i32(<2 x i32> %a, <2 x i32> %b) {
> > -; CHECK-LABEL: 'test_v2i32'
> > -; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 0, i32 3>
> > -; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %1
> > +; SSE2-LABEL: 'test_v2i32'
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %1
> > +;
> > +; SSSE3-LABEL: 'test_v2i32'
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 0, i32 3>
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %1
> > +;
> > +; SSE42-LABEL: 'test_v2i32'
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 0, i32 3>
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %1
> > +;
> > +; AVX-LABEL: 'test_v2i32'
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %1
> >  ;
> >  ; BTVER2-LABEL: 'test_v2i32'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 0, i32 3>
> > @@ -56,9 +68,21 @@ define <2 x float> @test_v2f32(<2 x floa
> >  }
> >
> >  define <2 x i32> @test_v2i32_2(<2 x i32> %a, <2 x i32> %b) {
> > -; CHECK-LABEL: 'test_v2i32_2'
> > -; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 2, i32 1>
> > -; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %1
> > +; SSE2-LABEL: 'test_v2i32_2'
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %1
> > +;
> > +; SSSE3-LABEL: 'test_v2i32_2'
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 2, i32 1>
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %1
> > +;
> > +; SSE42-LABEL: 'test_v2i32_2'
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 2, i32 1>
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %1
> > +;
> > +; AVX-LABEL: 'test_v2i32_2'
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %1
> >  ;
> >  ; BTVER2-LABEL: 'test_v2i32_2'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32
> 2, i32 1>
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/arith.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/arith.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/arith.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/arith.ll Wed Aug  7 09:24:26
> 2019
> > @@ -1342,36 +1342,32 @@ define i32 @mul(i32 %arg) {
> >  ; A <2 x i64> vector multiply is implemented using
> >  ; 3 PMULUDQ and 2 PADDS and 4 shifts.
> >  define void @mul_2i32() {
> > -; SSE-LABEL: 'mul_2i32'
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %A0 = mul <2 x i32> undef, undef
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> > +; SSSE3-LABEL: 'mul_2i32'
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> > +;
> > +; SSE42-LABEL: 'mul_2i32'
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> >  ;
> >  ; AVX-LABEL: 'mul_2i32'
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %A0 = mul <2 x i32> undef, undef
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %A0 = mul <2 x i32> undef, undef
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> > -; AVX512F-LABEL: 'mul_2i32'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> > -;
> > -; AVX512BW-LABEL: 'mul_2i32'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> > -;
> > -; AVX512DQ-LABEL: 'mul_2i32'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> > +; AVX512-LABEL: 'mul_2i32'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> >  ;
> >  ; SLM-LABEL: 'mul_2i32'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction:
> %A0 = mul <2 x i32> undef, undef
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 11 for instruction:
> %A0 = mul <2 x i32> undef, undef
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; GLM-LABEL: 'mul_2i32'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %A0 = mul <2 x i32> undef, undef
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %A0 = mul <2 x i32> undef, undef
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; BTVER2-LABEL: 'mul_2i32'
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %A0 = mul <2 x i32> undef, undef
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %A0 = mul <2 x i32> undef, undef
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> >  ;
> >    %A0 = mul <2 x i32> undef, undef
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/cast.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/cast.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/cast.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/cast.ll Wed Aug  7 09:24:26
> 2019
> > @@ -315,10 +315,10 @@ define void @sitofp4(<4 x i1> %a, <4 x i
> >  ; SSE-LABEL: 'sitofp4'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %A1 = sitofp <4 x i1> %a to <4 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %A2 = sitofp <4 x i1> %a to <4 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %B1 = sitofp <4 x i8> %b to <4 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %B2 = sitofp <4 x i8> %b to <4 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %C1 = sitofp <4 x i16> %c to <4 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %C2 = sitofp <4 x i16> %c to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %B1 = sitofp <4 x i8> %b to <4 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %B2 = sitofp <4 x i8> %b to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %C1 = sitofp <4 x i16> %c to <4 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %C2 = sitofp <4 x i16> %c to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %D1 = sitofp <4 x i32> %d to <4 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %D2 = sitofp <4 x i32> %d to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> > @@ -359,7 +359,7 @@ define void @sitofp4(<4 x i1> %a, <4 x i
> >  define void @sitofp8(<8 x i1> %a, <8 x i8> %b, <8 x i16> %c, <8 x i32>
> %d) {
> >  ; SSE-LABEL: 'sitofp8'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %A1 = sitofp <8 x i1> %a to <8 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %B1 = sitofp <8 x i8> %b to <8 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %B1 = sitofp <8 x i8> %b to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %C1 = sitofp <8 x i16> %c to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 10 for instruction:
> %D1 = sitofp <8 x i32> %d to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> > @@ -390,9 +390,9 @@ define void @uitofp4(<4 x i1> %a, <4 x i
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %A1 = uitofp <4 x i1> %a to <4 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %A2 = uitofp <4 x i1> %a to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %B1 = uitofp <4 x i8> %b to <4 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %B2 = uitofp <4 x i8> %b to <4 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %C1 = uitofp <4 x i16> %c to <4 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %C2 = uitofp <4 x i16> %c to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %B2 = uitofp <4 x i8> %b to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %C1 = uitofp <4 x i16> %c to <4 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %C2 = uitofp <4 x i16> %c to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %D1 = uitofp <4 x i32> %d to <4 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %D2 = uitofp <4 x i32> %d to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> > @@ -433,7 +433,7 @@ define void @uitofp4(<4 x i1> %a, <4 x i
> >  define void @uitofp8(<8 x i1> %a, <8 x i8> %b, <8 x i16> %c, <8 x i32>
> %d) {
> >  ; SSE-LABEL: 'uitofp8'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %A1 = uitofp <8 x i1> %a to <8 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %B1 = uitofp <8 x i8> %b to <8 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %B1 = uitofp <8 x i8> %b to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %C1 = uitofp <8 x i16> %c to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 16 for instruction:
> %D1 = uitofp <8 x i32> %d to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll Wed Aug  7 09:24:26
> 2019
> > @@ -92,35 +92,28 @@ define i32 @fptosi_double_i32(i32 %arg)
> >  define i32 @fptosi_double_i16(i32 %arg) {
> >  ; SSE-LABEL: 'fptosi_double_i16'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I16 = fptosi double undef to i16
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 13 for instruction:
> %V4I16 = fptosi <4 x double> undef to <4 x i16>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 27 for instruction:
> %V8I16 = fptosi <8 x double> undef to <8 x i16>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4I16 = fptosi <4 x double> undef to <4 x i16>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8I16 = fptosi <8 x double> undef to <8 x i16>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX-LABEL: 'fptosi_double_i16'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I16 = fptosi double undef to i16
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2I16 = fptosi <2 x double> undef to <2 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I16 = fptosi <4 x double> undef to <4 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V8I16 = fptosi <8 x double> undef to <8 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > -; AVX512F-LABEL: 'fptosi_double_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptosi double undef to i16
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'fptosi_double_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptosi double undef to i16
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > +; AVX512-LABEL: 'fptosi_double_i16'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptosi double undef to i16
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; BTVER2-LABEL: 'fptosi_double_i16'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptosi double undef to i16
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > @@ -143,29 +136,22 @@ define i32 @fptosi_double_i8(i32 %arg) {
> >  ; AVX-LABEL: 'fptosi_double_i8'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I8 = fptosi double undef to i8
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2I8 = fptosi <2 x double> undef to <2 x i8>
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V8I8 = fptosi <8 x double> undef to <8 x i8>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 25 for instruction:
> %V8I8 = fptosi <8 x double> undef to <8 x i8>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > -; AVX512F-LABEL: 'fptosi_double_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptosi double undef to i8
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'fptosi_double_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptosi double undef to i8
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > +; AVX512-LABEL: 'fptosi_double_i8'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptosi double undef to i8
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; BTVER2-LABEL: 'fptosi_double_i8'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptosi double undef to i8
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 25 for
> instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >    %I8 = fptosi double undef to i8
> > @@ -285,9 +271,9 @@ define i32 @fptosi_float_i16(i32 %arg) {
> >  define i32 @fptosi_float_i8(i32 %arg) {
> >  ; SSE-LABEL: 'fptosi_float_i8'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I8 = fptosi float undef to i8
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I8 = fptosi <4 x float> undef to <4 x i8>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V8I8 = fptosi <8 x float> undef to <8 x i8>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V16I8 = fptosi <16 x float> undef to <16 x i8>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V4I8 = fptosi <4 x float> undef to <4 x i8>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 25 for instruction:
> %V8I8 = fptosi <8 x float> undef to <8 x i8>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 51 for instruction:
> %V16I8 = fptosi <16 x float> undef to <16 x i8>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX-LABEL: 'fptosi_float_i8'
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll Wed Aug  7 09:24:26
> 2019
> > @@ -68,19 +68,12 @@ define i32 @fptoui_double_i32(i32 %arg)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 33 for instruction:
> %V8I32 = fptoui <8 x double> undef to <8 x i32>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > -; AVX512F-LABEL: 'fptoui_double_i32'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I32 = fptoui double undef to i32
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I32 = fptoui <2 x double> undef to <2 x i32>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I32 = fptoui <4 x double> undef to <4 x i32>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'fptoui_double_i32'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I32 = fptoui double undef to i32
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I32 = fptoui <2 x double> undef to <2 x i32>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I32 = fptoui <4 x double> undef to <4 x i32>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > +; AVX512-LABEL: 'fptoui_double_i32'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I32 = fptoui double undef to i32
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I32 = fptoui <2 x double> undef to <2 x i32>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I32 = fptoui <4 x double> undef to <4 x i32>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; BTVER2-LABEL: 'fptoui_double_i32'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I32 = fptoui double undef to i32
> > @@ -106,30 +99,23 @@ define i32 @fptoui_double_i16(i32 %arg)
> >  ;
> >  ; AVX-LABEL: 'fptoui_double_i16'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I16 = fptoui double undef to i16
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 25 for instruction:
> %V8I16 = fptoui <8 x double> undef to <8 x i16>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V8I16 = fptoui <8 x double> undef to <8 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > -; AVX512F-LABEL: 'fptoui_double_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptoui double undef to i16
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'fptoui_double_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptoui double undef to i16
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > +; AVX512-LABEL: 'fptoui_double_i16'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptoui double undef to i16
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; BTVER2-LABEL: 'fptoui_double_i16'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptoui double undef to i16
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 25 for
> instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >    %I16 = fptoui double undef to i16
> > @@ -154,19 +140,12 @@ define i32 @fptoui_double_i8(i32 %arg) {
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 25 for instruction:
> %V8I8 = fptoui <8 x double> undef to <8 x i8>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > -; AVX512F-LABEL: 'fptoui_double_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptoui double undef to i8
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2I8 = fptoui <2 x double> undef to <2 x i8>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptoui <4 x double> undef to <4 x i8>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'fptoui_double_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptoui double undef to i8
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I8 = fptoui <2 x double> undef to <2 x i8>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptoui <4 x double> undef to <4 x i8>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > +; AVX512-LABEL: 'fptoui_double_i8'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptoui double undef to i8
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2I8 = fptoui <2 x double> undef to <2 x i8>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I8 = fptoui <4 x double> undef to <4 x i8>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; BTVER2-LABEL: 'fptoui_double_i8'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptoui double undef to i8
> > @@ -277,7 +256,7 @@ define i32 @fptoui_float_i16(i32 %arg) {
> >  ;
> >  ; AVX-LABEL: 'fptoui_float_i16'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I16 = fptoui float undef to i16
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V4I16 = fptoui <4 x float> undef to <4 x i16>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I16 = fptoui <4 x float> undef to <4 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8I16 = fptoui <8 x float> undef to <8 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V16I16 = fptoui <16 x float> undef to <16 x i16>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > @@ -291,7 +270,7 @@ define i32 @fptoui_float_i16(i32 %arg) {
> >  ;
> >  ; BTVER2-LABEL: 'fptoui_float_i16'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I16 = fptoui float undef to i16
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V4I16 = fptoui <4 x float> undef to <4 x i16>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4I16 = fptoui <4 x float> undef to <4 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I16 = fptoui <8 x float> undef to <8 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V16I16 = fptoui <16 x float> undef to <16 x i16>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > @@ -314,8 +293,8 @@ define i32 @fptoui_float_i8(i32 %arg) {
> >  ; AVX-LABEL: 'fptoui_float_i8'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %I8 = fptoui float undef to i8
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V4I8 = fptoui <4 x float> undef to <4 x i8>
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8I8 = fptoui <8 x float> undef to <8 x i8>
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V16I8 = fptoui <16 x float> undef to <16 x i8>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 24 for instruction:
> %V8I8 = fptoui <8 x float> undef to <8 x i8>
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 49 for instruction:
> %V16I8 = fptoui <16 x float> undef to <16 x i8>
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512-LABEL: 'fptoui_float_i8'
> > @@ -328,8 +307,8 @@ define i32 @fptoui_float_i8(i32 %arg) {
> >  ; BTVER2-LABEL: 'fptoui_float_i8'
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %I8 = fptoui float undef to i8
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V4I8 = fptoui <4 x float> undef to <4 x i8>
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8I8 = fptoui <8 x float> undef to <8 x i8>
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V16I8 = fptoui <16 x float> undef to <16 x i8>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V8I8 = fptoui <8 x float> undef to <8 x i8>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 49 for
> instruction: %V16I8 = fptoui <16 x float> undef to <16 x i8>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >    %I8 = fptoui float undef to i8
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll
> (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll Wed
> Aug  7 09:24:26 2019
> > @@ -52,7 +52,7 @@ define i32 @masked_load() {
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V16I32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>*
> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8I32 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* undef,
> i32 1, <8 x i1> undef, <8 x i32> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4I32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef,
> i32 1, <4 x i1> undef, <4 x i32> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef,
> i32 1, <2 x i1> undef, <2 x i32> undef)
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef,
> i32 1, <2 x i1> undef, <2 x i32> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 128 for
> instruction: %V32I16 = call <32 x i16>
> @llvm.masked.load.v32i16.p0v32i16(<32 x i16>* undef, i32 1, <32 x i1>
> undef, <32 x i16> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 64 for instruction:
> %V16I16 = call <16 x i16> @llvm.masked.load.v16i16.p0v16i16(<16 x i16>*
> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 32 for instruction:
> %V8I16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef,
> i32 1, <8 x i1> undef, <8 x i16> undef)
> > @@ -79,7 +79,7 @@ define i32 @masked_load() {
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16I32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>*
> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8I32 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* undef,
> i32 1, <8 x i1> undef, <8 x i32> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef,
> i32 1, <4 x i1> undef, <4 x i32> undef)
> > -; KNL-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef,
> i32 1, <2 x i1> undef, <2 x i32> undef)
> > +; KNL-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef,
> i32 1, <2 x i1> undef, <2 x i32> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 128 for
> instruction: %V32I16 = call <32 x i16>
> @llvm.masked.load.v32i16.p0v32i16(<32 x i16>* undef, i32 1, <32 x i1>
> undef, <32 x i16> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 64 for instruction:
> %V16I16 = call <16 x i16> @llvm.masked.load.v16i16.p0v16i16(<16 x i16>*
> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 32 for instruction:
> %V8I16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef,
> i32 1, <8 x i1> undef, <8 x i16> undef)
> > @@ -106,15 +106,15 @@ define i32 @masked_load() {
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16I32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>*
> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8I32 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* undef,
> i32 1, <8 x i1> undef, <8 x i32> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4I32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef,
> i32 1, <4 x i1> undef, <4 x i32> undef)
> > -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef,
> i32 1, <2 x i1> undef, <2 x i32> undef)
> > +; SKX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef,
> i32 1, <2 x i1> undef, <2 x i32> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V32I16 = call <32 x i16> @llvm.masked.load.v32i16.p0v32i16(<32 x i16>*
> undef, i32 1, <32 x i1> undef, <32 x i16> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16I16 = call <16 x i16> @llvm.masked.load.v16i16.p0v16i16(<16 x i16>*
> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8I16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef,
> i32 1, <8 x i1> undef, <8 x i16> undef)
> > -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4I16 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* undef,
> i32 1, <4 x i1> undef, <4 x i16> undef)
> > +; SKX-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> %V4I16 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* undef,
> i32 1, <4 x i1> undef, <4 x i16> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V64I8 = call <64 x i8> @llvm.masked.load.v64i8.p0v64i8(<64 x i8>* undef,
> i32 1, <64 x i1> undef, <64 x i8> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V32I8 = call <32 x i8> @llvm.masked.load.v32i8.p0v32i8(<32 x i8>* undef,
> i32 1, <32 x i1> undef, <32 x i8> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16I8 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* undef,
> i32 1, <16 x i1> undef, <16 x i8> undef)
> > -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V8I8 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* undef, i32 1,
> <8 x i1> undef, <8 x i8> undef)
> > +; SKX-NEXT:  Cost Model: Found an estimated cost of 17 for instruction:
> %V8I8 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* undef, i32 1,
> <8 x i1> undef, <8 x i8> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 0
> >  ;
> >    %V8F64 = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x
> double>* undef, i32 1, <8 x i1> undef, <8 x double> undef)
> > @@ -194,7 +194,7 @@ define i32 @masked_store() {
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 16 for instruction:
> call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> undef, <16 x i32>*
> undef, i32 1, <16 x i1> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> undef, <8 x i32>*
> undef, i32 1, <8 x i1> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> undef, <4 x i32>*
> undef, i32 1, <4 x i1> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>*
> undef, i32 1, <2 x i1> undef)
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>*
> undef, i32 1, <2 x i1> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 128 for
> instruction: call void @llvm.masked.store.v32i16.p0v32i16(<32 x i16> undef,
> <32 x i16>* undef, i32 1, <32 x i1> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 64 for instruction:
> call void @llvm.masked.store.v16i16.p0v16i16(<16 x i16> undef, <16 x i16>*
> undef, i32 1, <16 x i1> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 32 for instruction:
> call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> undef, <8 x i16>*
> undef, i32 1, <8 x i1> undef)
> > @@ -221,7 +221,7 @@ define i32 @masked_store() {
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> undef, <16 x i32>*
> undef, i32 1, <16 x i1> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> undef, <8 x i32>*
> undef, i32 1, <8 x i1> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> undef, <4 x i32>*
> undef, i32 1, <4 x i1> undef)
> > -; KNL-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>*
> undef, i32 1, <2 x i1> undef)
> > +; KNL-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>*
> undef, i32 1, <2 x i1> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 128 for
> instruction: call void @llvm.masked.store.v32i16.p0v32i16(<32 x i16> undef,
> <32 x i16>* undef, i32 1, <32 x i1> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 64 for instruction:
> call void @llvm.masked.store.v16i16.p0v16i16(<16 x i16> undef, <16 x i16>*
> undef, i32 1, <16 x i1> undef)
> >  ; KNL-NEXT:  Cost Model: Found an estimated cost of 32 for instruction:
> call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> undef, <8 x i16>*
> undef, i32 1, <8 x i1> undef)
> > @@ -248,15 +248,15 @@ define i32 @masked_store() {
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> undef, <16 x i32>*
> undef, i32 1, <16 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> undef, <8 x i32>*
> undef, i32 1, <8 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> undef, <4 x i32>*
> undef, i32 1, <4 x i1> undef)
> > -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>*
> undef, i32 1, <2 x i1> undef)
> > +; SKX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>*
> undef, i32 1, <2 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v32i16.p0v32i16(<32 x i16> undef, <32 x i16>*
> undef, i32 1, <32 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v16i16.p0v16i16(<16 x i16> undef, <16 x i16>*
> undef, i32 1, <16 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> undef, <8 x i16>*
> undef, i32 1, <8 x i1> undef)
> > -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> call void @llvm.masked.store.v4i16.p0v4i16(<4 x i16> undef, <4 x i16>*
> undef, i32 1, <4 x i1> undef)
> > +; SKX-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> call void @llvm.masked.store.v4i16.p0v4i16(<4 x i16> undef, <4 x i16>*
> undef, i32 1, <4 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v64i8.p0v64i8(<64 x i8> undef, <64 x i8>*
> undef, i32 1, <64 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v32i8.p0v32i8(<32 x i8> undef, <32 x i8>*
> undef, i32 1, <32 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> undef, <16 x i8>*
> undef, i32 1, <16 x i1> undef)
> > -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> call void @llvm.masked.store.v8i8.p0v8i8(<8 x i8> undef, <8 x i8>* undef,
> i32 1, <8 x i1> undef)
> > +; SKX-NEXT:  Cost Model: Found an estimated cost of 17 for instruction:
> call void @llvm.masked.store.v8i8.p0v8i8(<8 x i8> undef, <8 x i8>* undef,
> i32 1, <8 x i1> undef)
> >  ; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 0
> >  ;
> >    call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> undef, <8 x
> double>* undef, i32 1, <8 x i1> undef)
> > @@ -960,15 +960,10 @@ define <8 x float> @test4(<8 x i32> %tri
> >  }
> >
> >  define void @test5(<2 x i32> %trigger, <2 x float>* %addr, <2 x float>
> %val) {
> > -; SSE2-LABEL: 'test5'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x float>*
> %addr, i32 4, <2 x i1> %mask)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> > -;
> > -; SSE42-LABEL: 'test5'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val,
> <2 x float>* %addr, i32 4, <2 x i1> %mask)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> > +; SSE-LABEL: 'test5'
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x float>*
> %addr, i32 4, <2 x i1> %mask)
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; AVX-LABEL: 'test5'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > @@ -986,24 +981,19 @@ define void @test5(<2 x i32> %trigger, <
> >  }
> >
> >  define void @test6(<2 x i32> %trigger, <2 x i32>* %addr, <2 x i32>
> %val) {
> > -; SSE2-LABEL: 'test6'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>*
> %addr, i32 4, <2 x i1> %mask)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> > -;
> > -; SSE42-LABEL: 'test6'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2
> x i32>* %addr, i32 4, <2 x i1> %mask)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> > +; SSE-LABEL: 'test6'
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>*
> %addr, i32 4, <2 x i1> %mask)
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; AVX-LABEL: 'test6'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>*
> %addr, i32 4, <2 x i1> %mask)
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>*
> %addr, i32 4, <2 x i1> %mask)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; AVX512-LABEL: 'test6'
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2
> x i32>* %addr, i32 4, <2 x i1> %mask)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2
> x i32>* %addr, i32 4, <2 x i1> %mask)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> >  ;
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > @@ -1012,15 +1002,10 @@ define void @test6(<2 x i32> %trigger, <
> >  }
> >
> >  define <2 x float> @test7(<2 x i32> %trigger, <2 x float>* %addr, <2 x
> float> %dst) {
> > -; SSE2-LABEL: 'test7'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr,
> i32 4, <2 x i1> %mask, <2 x float> %dst)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x float> %res
> > -;
> > -; SSE42-LABEL: 'test7'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x
> float>* %addr, i32 4, <2 x i1> %mask, <2 x float> %dst)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x float> %res
> > +; SSE-LABEL: 'test7'
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr,
> i32 4, <2 x i1> %mask, <2 x float> %dst)
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x float> %res
> >  ;
> >  ; AVX-LABEL: 'test7'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > @@ -1038,24 +1023,19 @@ define <2 x float> @test7(<2 x i32> %tri
> >  }
> >
> >  define <2 x i32> @test8(<2 x i32> %trigger, <2 x i32>* %addr, <2 x i32>
> %dst) {
> > -; SSE2-LABEL: 'test8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32
> 4, <2 x i1> %mask, <2 x i32> %dst)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %res
> > -;
> > -; SSE42-LABEL: 'test8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x
> i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %res
> > +; SSE-LABEL: 'test8'
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32
> 4, <2 x i1> %mask, <2 x i32> %dst)
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %res
> >  ;
> >  ; AVX-LABEL: 'test8'
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32
> 4, <2 x i1> %mask, <2 x i32> %dst)
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32
> 4, <2 x i1> %mask, <2 x i32> %dst)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %res
> >  ;
> >  ; AVX512-LABEL: 'test8'
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x
> i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x
> i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret <2 x i32> %res
> >  ;
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >
> > Removed: llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll?rev=368182&view=auto
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll (removed)
> > @@ -1,307 +0,0 @@
> > -; NOTE: Assertions have been autogenerated by
> utils/update_analyze_test_checks.py
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse2 | FileCheck
> %s --check-prefixes=CHECK,SSE,SSE2
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+ssse3 | FileCheck
> %s --check-prefixes=CHECK,SSE,SSSE3
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.2 |
> FileCheck %s --check-prefixes=CHECK,SSE,SSE42
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx | FileCheck
> %s --check-prefixes=CHECK,AVX,AVX1
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx2 | FileCheck
> %s --check-prefixes=CHECK,AVX,AVX2
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f |
> FileCheck %s --check-prefixes=CHECK,AVX512,AVX512F
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512bw
> | FileCheck %s --check-prefixes=CHECK,AVX512,AVX512BW
> > -; RUN: opt < %s -x86-experimental-vector-widening-legalization
> -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512dq
> | FileCheck %s --check-prefixes=CHECK,AVX512,AVX512DQ
> > -
> > -define i32 @reduce_i64(i32 %arg) {
> > -; SSE2-LABEL: 'reduce_i64'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V16 = call i64
> @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; SSSE3-LABEL: 'reduce_i64'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x
> i64> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x
> i64> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x
> i64> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x
> i64> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V16 = call i64
> @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; SSE42-LABEL: 'reduce_i64'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x
> i64> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x
> i64> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x
> i64> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x
> i64> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V16 = call i64
> @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX-LABEL: 'reduce_i64'
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64>
> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; AVX512-LABEL: 'reduce_i64'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x
> i64> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x
> i64> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x
> i64> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x
> i64> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i64
> @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -  %V1  = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64>
> undef)
> > -  %V2  = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64>
> undef)
> > -  %V4  = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64>
> undef)
> > -  %V8  = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64>
> undef)
> > -  %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x
> i64> undef)
> > -  ret i32 undef
> > -}
> > -
> > -define i32 @reduce_i32(i32 %arg) {
> > -; SSE2-LABEL: 'reduce_i32'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>
> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; SSSE3-LABEL: 'reduce_i32'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x
> i32> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x
> i32> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x
> i32> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; SSE42-LABEL: 'reduce_i32'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x
> i32> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x
> i32> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x
> i32> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX-LABEL: 'reduce_i32'
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>
> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 20 for instruction:
> %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32>
> undef)
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; AVX512-LABEL: 'reduce_i32'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x
> i32> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x
> i32> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x
> i32> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -  %V2  = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32>
> undef)
> > -  %V4  = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>
> undef)
> > -  %V8  = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32>
> undef)
> > -  %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x
> i32> undef)
> > -  %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x
> i32> undef)
> > -  ret i32 undef
> > -}
> > -
> > -define i32 @reduce_i16(i32 %arg) {
> > -; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; SSSE3-LABEL: 'reduce_i16'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 14 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; SSE42-LABEL: 'reduce_i16'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 32 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 49 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 53 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 11 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'reduce_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V64 = call i16
> @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -  %V2  = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16>
> undef)
> > -  %V4  = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16>
> undef)
> > -  %V8  = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16>
> undef)
> > -  %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x
> i16> undef)
> > -  %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x
> i16> undef)
> > -  %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x
> i16> undef)
> > -  ret i32 undef
> > -}
> > -
> > -define i32 @reduce_i8(i32 %arg) {
> > -; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 52 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; SSSE3-LABEL: 'reduce_i8'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 65 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 73 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 29 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > -;
> > -; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 29 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 55 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 56 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 29 for
> instruction: %V128 = call i8
> @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -  %V2   = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8>
> undef)
> > -  %V4   = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8>
> undef)
> > -  %V8   = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8>
> undef)
> > -  %V16  = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8>
> undef)
> > -  %V32  = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8>
> undef)
> > -  %V64  = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8>
> undef)
> > -  %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x
> i8> undef)
> > -  ret i32 undef
> > -}
> > -
> > -declare i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64>)
> > -declare i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64>)
> > -declare i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64>)
> > -declare i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64>)
> > -declare i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64>)
> > -
> > -declare i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32>)
> > -declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
> > -declare i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32>)
> > -declare i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>)
> > -declare i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32>)
> > -
> > -declare i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16>)
> > -declare i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16>)
> > -declare i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16>)
> > -declare i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16>)
> > -declare i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16>)
> > -declare i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16>)
> > -
> > -declare i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8>)
> > -declare i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8>)
> > -declare i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8>)
> > -declare i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8>)
> > -declare i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8>)
> > -declare i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8>)
> > -declare i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8>)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll Wed Aug  7
> 09:24:26 2019
> > @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX-LABEL: 'reduce_i32'
> > -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> > +; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>
> undef)
> > @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512-LABEL: 'reduce_i32'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x
> i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x
> i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x
> i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x
> i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> > @@ -108,8 +108,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> > @@ -135,7 +135,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 49 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > @@ -144,7 +144,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > @@ -153,7 +153,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > @@ -162,7 +162,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > @@ -171,7 +171,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x
> i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x
> i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x
> i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> > @@ -190,9 +190,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > @@ -210,9 +210,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > @@ -220,9 +220,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 65 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > @@ -230,9 +230,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > @@ -240,9 +240,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > @@ -250,9 +250,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 55 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> > @@ -260,9 +260,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x
> i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 26 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll Wed Aug  7
> 09:24:26 2019
> > @@ -92,8 +92,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.and.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.and.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.and.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.and.v4i16(<4 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.and.v8i16(<8 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.and.v16i16(<16 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.and.v32i16(<32 x i16> undef)
> > @@ -174,9 +174,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.and.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.and.v4i8(<4 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.and.v8i8(<8 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.and.v2i8(<2 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.and.v4i8(<4 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.and.v8i8(<8 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.and.v16i8(<16
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.and.v32i8(<32
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.and.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll Wed Aug  7
> 09:24:26 2019
> > @@ -67,7 +67,7 @@ define i32 @reduce_i64(i32 %arg) {
> >
> >  define i32 @reduce_i32(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i32'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 15 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 33 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > @@ -75,7 +75,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i32'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 15 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 33 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i32'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i32'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 25 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 29 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > @@ -99,36 +99,20 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i32'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > -; AVX512F-LABEL: 'reduce_i32'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512BW-LABEL: 'reduce_i32'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > -;
> > -; AVX512DQ-LABEL: 'reduce_i32'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> > +; AVX512-LABEL: 'reduce_i32'
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x
> i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x
> i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x
> i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i32
> @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >    %V2  = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32>
> undef)
> >    %V4  = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32>
> undef)
> > @@ -140,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 15 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> > @@ -149,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i16'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 15 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x
> i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> > @@ -158,8 +142,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i16'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x
> i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> > @@ -167,8 +151,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 49 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 53 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> > @@ -176,8 +160,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> > @@ -185,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x
> i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> > @@ -194,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x
> i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x
> i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x
> i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> > @@ -222,9 +206,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 15 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 67 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 89 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 101 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 125 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -232,9 +216,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i8'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 15 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 14 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 40 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 53 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 65 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 89 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -242,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 14 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 40 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 53 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 65 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 89 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -252,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 14 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 27 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 40 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 53 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 171 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 197 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -262,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 17 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 25 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 33 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 106 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 123 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -272,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 25 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 86 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 99 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -282,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 11 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 21 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 36 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 115 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> > @@ -292,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x
> i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 25 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 86 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 99 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll Wed Aug  7
> 09:24:26 2019
> > @@ -92,8 +92,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.or.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.or.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.or.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.or.v4i16(<4 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.or.v8i16(<8 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for
> instruction: %V16 = call i16 @llvm.experimental.vector.reduce.or.v16i16(<16
> x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16 @llvm.experimental.vector.reduce.or.v32i16(<32
> x i16> undef)
> > @@ -174,9 +174,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.or.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.or.v4i8(<4 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.or.v8i8(<8 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.or.v2i8(<2 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.or.v4i8(<4 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.or.v8i8(<8 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.or.v16i8(<16 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.or.v32i8(<32 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.or.v64i8(<64 x
> i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll Wed Aug  7
> 09:24:26 2019
> > @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i32'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2
> x i32> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32> undef)
> > @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i32'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32>
> undef)
> > @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i32'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32>
> undef)
> > @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512-LABEL: 'reduce_i32'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2
> x i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32> undef)
> > @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.smax.v32i16(<32 x i16> undef)
> > @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i16'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.smax.v32i16(<32 x i16> undef)
> > @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i16'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> > @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16>
> undef)
> > @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16>
> undef)
> > @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> > @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> > @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> > @@ -206,8 +206,8 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32
> x i8> undef)
> > @@ -216,8 +216,8 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i8'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32
> x i8> undef)
> > @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64
> x i8> undef)
> > @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> > @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> > @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64
> x i8> undef)
> > @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64
> x i8> undef)
> > @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x
> i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll Wed Aug  7
> 09:24:26 2019
> > @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i32'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2
> x i32> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32> undef)
> > @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i32'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32>
> undef)
> > @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i32'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32>
> undef)
> > @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512-LABEL: 'reduce_i32'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2
> x i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32> undef)
> > @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.smin.v32i16(<32 x i16> undef)
> > @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i16'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.smin.v32i16(<32 x i16> undef)
> > @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i16'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> > @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16>
> undef)
> > @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16>
> undef)
> > @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> > @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> > @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> > @@ -206,8 +206,8 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32
> x i8> undef)
> > @@ -216,8 +216,8 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i8'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32
> x i8> undef)
> > @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64
> x i8> undef)
> > @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> > @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> > @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64
> x i8> undef)
> > @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64
> x i8> undef)
> > @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x
> i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll Wed Aug  7
> 09:24:26 2019
> > @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i32'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2
> x i32> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32> undef)
> > @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i32'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32>
> undef)
> > @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i32'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32>
> undef)
> > @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512-LABEL: 'reduce_i32'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2
> x i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32> undef)
> > @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16>
> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.umax.v32i16(<32 x i16> undef)
> > @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i16'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.umax.v32i16(<32 x i16> undef)
> > @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i16'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> > @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16>
> undef)
> > @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16>
> undef)
> > @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> > @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> > @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> > @@ -206,9 +206,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 32 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64
> x i8> undef)
> > @@ -216,9 +216,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i8'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 32 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64
> x i8> undef)
> > @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64
> x i8> undef)
> > @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> > @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> > @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64
> x i8> undef)
> > @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64
> x i8> undef)
> > @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x
> i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll Wed Aug  7
> 09:24:26 2019
> > @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i32'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2
> x i32> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8
> x i32> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32> undef)
> > @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i32'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8 x i32> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32>
> undef)
> > @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i32'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8 x i32> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i32 @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32>
> undef)
> > @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512-LABEL: 'reduce_i32'
> > -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2
> x i32> undef)
> > +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8
> x i32> undef)
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i32
> @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32> undef)
> > @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16>
> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.umin.v32i16(<32 x i16> undef)
> > @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i16'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8
> x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.umin.v32i16(<32 x i16> undef)
> > @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i16'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8
> x i16> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> > @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i16'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16>
> undef)
> > @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i16'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16>
> undef)
> > @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i16'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8
> x i16> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> > @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i16'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8
> x i16> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> > @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i16'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8
> x i16> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> > @@ -206,9 +206,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 32 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64
> x i8> undef)
> > @@ -216,9 +216,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; SSSE3-LABEL: 'reduce_i8'
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32
> x i8> undef)
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 32 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64
> x i8> undef)
> > @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; SSE42-LABEL: 'reduce_i8'
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32
> x i8> undef)
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64
> x i8> undef)
> > @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX1-LABEL: 'reduce_i8'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> > @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX2-LABEL: 'reduce_i8'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> > @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX512F-LABEL: 'reduce_i8'
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32
> x i8> undef)
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64
> x i8> undef)
> > @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512BW-LABEL: 'reduce_i8'
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32
> x i8> undef)
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64
> x i8> undef)
> > @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
> >  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret i32 undef
> >  ;
> >  ; AVX512DQ-LABEL: 'reduce_i8'
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x
> i8> undef)
> > +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x
> i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32
> x i8> undef)
> >  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll Wed Aug  7
> 09:24:26 2019
> > @@ -92,8 +92,8 @@ define i32 @reduce_i32(i32 %arg) {
> >
> >  define i32 @reduce_i16(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i16'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.xor.v2i16(<2 x i16> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i16 @llvm.experimental.vector.reduce.xor.v4i16(<4 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction:
> %V2 = call i16 @llvm.experimental.vector.reduce.xor.v2i16(<2 x i16> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for
> instruction: %V4 = call i16 @llvm.experimental.vector.reduce.xor.v4i16(<4 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i16 @llvm.experimental.vector.reduce.xor.v8i16(<8 x
> i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for
> instruction: %V16 = call i16
> @llvm.experimental.vector.reduce.xor.v16i16(<16 x i16> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for
> instruction: %V32 = call i16
> @llvm.experimental.vector.reduce.xor.v32i16(<32 x i16> undef)
> > @@ -174,9 +174,9 @@ define i32 @reduce_i16(i32 %arg) {
> >
> >  define i32 @reduce_i8(i32 %arg) {
> >  ; SSE2-LABEL: 'reduce_i8'
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V2 = call i8 @llvm.experimental.vector.reduce.xor.v2i8(<2 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %V4 = call i8 @llvm.experimental.vector.reduce.xor.v4i8(<4 x i8> undef)
> > -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.xor.v8i8(<8 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for
> instruction: %V2 = call i8 @llvm.experimental.vector.reduce.xor.v2i8(<2 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for
> instruction: %V4 = call i8 @llvm.experimental.vector.reduce.xor.v4i8(<4 x
> i8> undef)
> > +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for
> instruction: %V8 = call i8 @llvm.experimental.vector.reduce.xor.v8i8(<8 x
> i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for
> instruction: %V16 = call i8 @llvm.experimental.vector.reduce.xor.v16i8(<16
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for
> instruction: %V32 = call i8 @llvm.experimental.vector.reduce.xor.v32i8(<32
> x i8> undef)
> >  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for
> instruction: %V64 = call i8 @llvm.experimental.vector.reduce.xor.v64i8(<64
> x i8> undef)
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll
> (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll Wed Aug
> 7 09:24:26 2019
> > @@ -123,21 +123,21 @@ define void @test_vXf32(<2 x float> %a64
> >
> >  define void @test_vXi32(<2 x i32> %a64, <2 x i32> %b64, <4 x i32>
> %a128, <4 x i32> %b128, <8 x i32> %a256, <8 x i32> %b256, <16 x i32> %a512,
> <16 x i32> %b512) {
> >  ; SSE-LABEL: 'test_vXi32'
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32
> 2>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32
> 2>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0,
> i32 4, i32 2, i32 6>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0,
> i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 56 for instruction:
> %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16 x i32> <i32
> 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32
> 10, i32 26, i32 12, i32 28, i32 14, i32 30>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; AVX1-LABEL: 'test_vXi32'
> > -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32
> 2>
> > +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32
> 2>
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0,
> i32 4, i32 2, i32 6>
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction:
> %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0,
> i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16
> x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8,
> i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
> >  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret void
> >  ;
> >  ; AVX2-LABEL: 'test_vXi32'
> > -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32
> 2>
> > +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32
> 2>
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0,
> i32 4, i32 2, i32 6>
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0,
> i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
> >  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 18 for
> instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16
> x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8,
> i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
> > @@ -151,7 +151,7 @@ define void @test_vXi32(<2 x i32> %a64,
> >  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for
> instruction: ret void
> >  ;
> >  ; BTVER2-LABEL: 'test_vXi32'
> > -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for
> instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32>
> <i32 0, i32 2>
> > +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32>
> <i32 0, i32 2>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 2 for
> instruction: %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x
> i32> <i32 0, i32 4, i32 2, i32 6>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 4 for
> instruction: %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x
> i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
> >  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 24 for
> instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16
> x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8,
> i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll Wed Aug  7 09:24:26
> 2019
> > @@ -13,9 +13,9 @@
> >  define i32 @sitofp_i8_double() {
> >  ; SSE-LABEL: 'sitofp_i8_double'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i8_f64 = sitofp i8 undef to double
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 20 for instruction:
> %cvt_v2i8_v2f64 = sitofp <2 x i8> undef to <2 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v4i8_v4f64 = sitofp <4 x i8> undef to <4 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v8i8_v8f64 = sitofp <8 x i8> undef to <8 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %cvt_v2i8_v2f64 = sitofp <2 x i8> undef to <2 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %cvt_v4i8_v4f64 = sitofp <4 x i8> undef to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %cvt_v8i8_v8f64 = sitofp <8 x i8> undef to <8 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX-LABEL: 'sitofp_i8_double'
> > @@ -49,8 +49,8 @@ define i32 @sitofp_i8_double() {
> >  define i32 @sitofp_i16_double() {
> >  ; SSE-LABEL: 'sitofp_i16_double'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i16_f64 = sitofp i16 undef to double
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 20 for instruction:
> %cvt_v2i16_v2f64 = sitofp <2 x i16> undef to <2 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v4i16_v4f64 = sitofp <4 x i16> undef to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v2i16_v2f64 = sitofp <2 x i16> undef to <2 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v4i16_v4f64 = sitofp <4 x i16> undef to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v8i16_v8f64 = sitofp <8 x i16> undef to <8 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > @@ -85,7 +85,7 @@ define i32 @sitofp_i16_double() {
> >  define i32 @sitofp_i32_double() {
> >  ; SSE-LABEL: 'sitofp_i32_double'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i32_f64 = sitofp i32 undef to double
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 20 for instruction:
> %cvt_v2i32_v2f64 = sitofp <2 x i32> undef to <2 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v2i32_v2f64 = sitofp <2 x i32> undef to <2 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v4i32_v4f64 = sitofp <4 x i32> undef to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v8i32_v8f64 = sitofp <8 x i32> undef to <8 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > @@ -164,8 +164,8 @@ define i32 @sitofp_i64_double() {
> >  define i32 @sitofp_i8_float() {
> >  ; SSE-LABEL: 'sitofp_i8_float'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i8_f32 = sitofp i8 undef to float
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %cvt_v4i8_v4f32 = sitofp <4 x i8> undef to <4 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %cvt_v8i8_v8f32 = sitofp <8 x i8> undef to <8 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v4i8_v4f32 = sitofp <4 x i8> undef to <4 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v8i8_v8f32 = sitofp <8 x i8> undef to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v16i8_v16f32 = sitofp <16 x i8> undef to <16 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > @@ -200,7 +200,7 @@ define i32 @sitofp_i8_float() {
> >  define i32 @sitofp_i16_float() {
> >  ; SSE-LABEL: 'sitofp_i16_float'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i16_f32 = sitofp i16 undef to float
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %cvt_v4i16_v4f32 = sitofp <4 x i16> undef to <4 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %cvt_v4i16_v4f32 = sitofp <4 x i16> undef to <4 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %cvt_v8i16_v8f32 = sitofp <8 x i16> undef to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 30 for instruction:
> %cvt_v16i16_v16f32 = sitofp <16 x i16> undef to <16 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll Wed Aug  7
> 09:24:26 2019
> > @@ -47,11 +47,11 @@ entry:
> >
> >  define <2 x i8> @slm-costs_8_v2_mul(<2 x i8> %a, <2 x i8> %b)  {
> >  ; SLM-LABEL: 'slm-costs_8_v2_mul'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction:
> %res = mul nsw <2 x i8> %a, %b
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 14 for instruction:
> %res = mul nsw <2 x i8> %a, %b
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i8> %res
> >  ;
> >  ; GLM-LABEL: 'slm-costs_8_v2_mul'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %res = mul nsw <2 x i8> %a, %b
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %res = mul nsw <2 x i8> %a, %b
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i8> %res
> >  ;
> >  entry:
> > @@ -61,11 +61,11 @@ entry:
> >
> >  define <4 x i8> @slm-costs_8_v4_mul(<4 x i8> %a, <4 x i8> %b)  {
> >  ; SLM-LABEL: 'slm-costs_8_v4_mul'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 3 for instruction:
> %res = mul nsw <4 x i8> %a, %b
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 14 for instruction:
> %res = mul nsw <4 x i8> %a, %b
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <4 x i8> %res
> >  ;
> >  ; GLM-LABEL: 'slm-costs_8_v4_mul'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %res = mul nsw <4 x i8> %a, %b
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %res = mul nsw <4 x i8> %a, %b
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <4 x i8> %res
> >  ;
> >  entry:
> > @@ -177,11 +177,11 @@ entry:
> >
> >  define <8 x i8> @slm-costs_8_v8_mul(<8 x i8> %a, <8 x i8> %b)  {
> >  ; SLM-LABEL: 'slm-costs_8_v8_mul'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %res = mul nsw <8 x i8> %a, %b
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 14 for instruction:
> %res = mul nsw <8 x i8> %a, %b
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <8 x i8> %res
> >  ;
> >  ; GLM-LABEL: 'slm-costs_8_v8_mul'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %res = mul nsw <8 x i8> %a, %b
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 12 for instruction:
> %res = mul nsw <8 x i8> %a, %b
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <8 x i8> %res
> >  ;
> >  entry:
> > @@ -216,11 +216,11 @@ entry:
> >
> >  define <2 x i16> @slm-costs_16_v2_mul(<2 x i16> %a, <2 x i16> %b)  {
> >  ; SLM-LABEL: 'slm-costs_16_v2_mul'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction:
> %res = mul nsw <2 x i16> %a, %b
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %res = mul nsw <2 x i16> %a, %b
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i16> %res
> >  ;
> >  ; GLM-LABEL: 'slm-costs_16_v2_mul'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %res = mul nsw <2 x i16> %a, %b
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %res = mul nsw <2 x i16> %a, %b
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i16> %res
> >  ;
> >  entry:
> > @@ -230,11 +230,11 @@ entry:
> >
> >  define <4 x i16> @slm-costs_16_v4_mul(<4 x i16> %a, <4 x i16> %b)  {
> >  ; SLM-LABEL: 'slm-costs_16_v4_mul'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 5 for instruction:
> %res = mul nsw <4 x i16> %a, %b
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %res = mul nsw <4 x i16> %a, %b
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <4 x i16> %res
> >  ;
> >  ; GLM-LABEL: 'slm-costs_16_v4_mul'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %res = mul nsw <4 x i16> %a, %b
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %res = mul nsw <4 x i16> %a, %b
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <4 x i16> %res
> >  ;
> >  entry:
> > @@ -385,11 +385,11 @@ entry:
> >
> >  define <2 x i32> @slm-costs_32_v2_mul(<2 x i32> %a, <2 x i32> %b)  {
> >  ; SLM-LABEL: 'slm-costs_32_v2_mul'
> > -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction:
> %res = mul nsw <2 x i32> %a, %b
> > +; SLM-NEXT:  Cost Model: Found an estimated cost of 11 for instruction:
> %res = mul nsw <2 x i32> %a, %b
> >  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %res
> >  ;
> >  ; GLM-LABEL: 'slm-costs_32_v2_mul'
> > -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %res = mul nsw <2 x i32> %a, %b
> > +; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:
> %res = mul nsw <2 x i32> %a, %b
> >  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret <2 x i32> %res
> >  ;
> >  entry:
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll Wed Aug  7
> 09:24:26 2019
> > @@ -5,9 +5,9 @@
> >  define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i16
> > -  ; SSE2: cost of 12 {{.*}} ashr
> > +  ; SSE2: cost of 32 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift2i16
> > -  ; SSE2-CODEGEN: psrlq
> > +  ; SSE2-CODEGEN: psraw
> >
> >    %0 = ashr %shifttype %a , %b
> >    ret %shifttype %0
> > @@ -17,9 +17,9 @@ entry:
> >  define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i16
> > -  ; SSE2: cost of 16 {{.*}} ashr
> > +  ; SSE2: cost of 32 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift4i16
> > -  ; SSE2-CODEGEN: psrad
> > +  ; SSE2-CODEGEN: psraw
> >
> >    %0 = ashr %shifttype4i16 %a , %b
> >    ret %shifttype4i16 %0
> > @@ -65,9 +65,9 @@ entry:
> >  define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i32
> > -  ; SSE2: cost of 12 {{.*}} ashr
> > +  ; SSE2: cost of 16 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift2i32
> > -  ; SSE2-CODEGEN: psrlq
> > +  ; SSE2-CODEGEN: psrad
> >
> >    %0 = ashr %shifttype2i32 %a , %b
> >    ret %shifttype2i32 %0
> > @@ -185,9 +185,9 @@ entry:
> >  define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i8
> > -  ; SSE2: cost of 12 {{.*}} ashr
> > +  ; SSE2: cost of 54 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift2i8
> > -  ; SSE2-CODEGEN: psrlq
> > +  ; SSE2-CODEGEN: psrlw
> >
> >    %0 = ashr %shifttype2i8 %a , %b
> >    ret %shifttype2i8 %0
> > @@ -197,9 +197,9 @@ entry:
> >  define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i8
> > -  ; SSE2: cost of 16 {{.*}} ashr
> > +  ; SSE2: cost of 54 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift4i8
> > -  ; SSE2-CODEGEN: psrad
> > +  ; SSE2-CODEGEN: psraw
> >
> >    %0 = ashr %shifttype4i8 %a , %b
> >    ret %shifttype4i8 %0
> > @@ -209,7 +209,7 @@ entry:
> >  define %shifttype8i8 @shift8i8(%shifttype8i8 %a, %shifttype8i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift8i8
> > -  ; SSE2: cost of 32 {{.*}} ashr
> > +  ; SSE2: cost of 54 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift8i8
> >    ; SSE2-CODEGEN: psraw
> >
> > @@ -247,9 +247,9 @@ entry:
> >  define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i16const
> > -  ; SSE2: cost of 4 {{.*}} ashr
> > +  ; SSE2: cost of 1 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift2i16const
> > -  ; SSE2-CODEGEN: psrad $3
> > +  ; SSE2-CODEGEN: psraw $3
> >
> >    %0 = ashr %shifttypec %a , <i16 3, i16 3>
> >    ret %shifttypec %0
> > @@ -261,7 +261,7 @@ entry:
> >    ; SSE2-LABEL: shift4i16const
> >    ; SSE2: cost of 1 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift4i16const
> > -  ; SSE2-CODEGEN: psrad $19
> > +  ; SSE2-CODEGEN: psraw $3
> >
> >    %0 = ashr %shifttypec4i16 %a , <i16 3, i16 3, i16 3, i16 3>
> >    ret %shifttypec4i16 %0
> > @@ -320,7 +320,7 @@ entry:
> >  define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32
> %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i32c
> > -  ; SSE2: cost of 4 {{.*}} ashr
> > +  ; SSE2: cost of 1 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift2i32c
> >    ; SSE2-CODEGEN: psrad $3
> >
> > @@ -464,7 +464,7 @@ entry:
> >    ; SSE2-LABEL: shift2i8c
> >    ; SSE2: cost of 4 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift2i8c
> > -  ; SSE2-CODEGEN: psrad $3
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>
> >    ret %shifttypec2i8 %0
> > @@ -474,9 +474,9 @@ entry:
> >  define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i8c
> > -  ; SSE2: cost of 1 {{.*}} ashr
> > +  ; SSE2: cost of 4 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift4i8c
> > -  ; SSE2-CODEGEN: psrad $27
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = ashr %shifttypec4i8 %a , <i8 3, i8 3, i8 3, i8 3>
> >    ret %shifttypec4i8 %0
> > @@ -486,9 +486,9 @@ entry:
> >  define %shifttypec8i8 @shift8i8c(%shifttypec8i8 %a, %shifttypec8i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift8i8c
> > -  ; SSE2: cost of 1 {{.*}} ashr
> > +  ; SSE2: cost of 4 {{.*}} ashr
> >    ; SSE2-CODEGEN-LABEL: shift8i8c
> > -  ; SSE2-CODEGEN: psraw $11
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = ashr %shifttypec8i8 %a , <i8 3, i8 3, i8 3, i8 3,
> >                                   i8 3, i8 3, i8 3, i8 3>
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll Wed Aug  7
> 09:24:26 2019
> > @@ -5,9 +5,9 @@
> >  define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i16
> > -  ; SSE2: cost of 4 {{.*}} lshr
> > +  ; SSE2: cost of 32 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift2i16
> > -  ; SSE2-CODEGEN: psrlq
> > +  ; SSE2-CODEGEN: psrlw
> >
> >    %0 = lshr %shifttype %a , %b
> >    ret %shifttype %0
> > @@ -17,9 +17,9 @@ entry:
> >  define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i16
> > -  ; SSE2: cost of 16 {{.*}} lshr
> > +  ; SSE2: cost of 32 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift4i16
> > -  ; SSE2-CODEGEN: psrld
> > +  ; SSE2-CODEGEN: psrlw
> >
> >    %0 = lshr %shifttype4i16 %a , %b
> >    ret %shifttype4i16 %0
> > @@ -65,9 +65,9 @@ entry:
> >  define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i32
> > -  ; SSE2: cost of 4 {{.*}} lshr
> > +  ; SSE2: cost of 16 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift2i32
> > -  ; SSE2-CODEGEN: psrlq
> > +  ; SSE2-CODEGEN: psrld
> >
> >    %0 = lshr %shifttype2i32 %a , %b
> >    ret %shifttype2i32 %0
> > @@ -185,9 +185,9 @@ entry:
> >  define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i8
> > -  ; SSE2: cost of 4 {{.*}} lshr
> > +  ; SSE2: cost of 26 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift2i8
> > -  ; SSE2-CODEGEN: psrlq
> > +  ; SSE2-CODEGEN: psrlw
> >
> >    %0 = lshr %shifttype2i8 %a , %b
> >    ret %shifttype2i8 %0
> > @@ -197,9 +197,9 @@ entry:
> >  define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i8
> > -  ; SSE2: cost of 16 {{.*}} lshr
> > +  ; SSE2: cost of 26 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift4i8
> > -  ; SSE2-CODEGEN: psrld
> > +  ; SSE2-CODEGEN: psrlw
> >
> >    %0 = lshr %shifttype4i8 %a , %b
> >    ret %shifttype4i8 %0
> > @@ -209,7 +209,7 @@ entry:
> >  define %shifttype8i8 @shift8i8(%shifttype8i8 %a, %shifttype8i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift8i8
> > -  ; SSE2: cost of 32 {{.*}} lshr
> > +  ; SSE2: cost of 26 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift8i8
> >    ; SSE2-CODEGEN: psrlw
> >
> > @@ -249,7 +249,7 @@ entry:
> >    ; SSE2-LABEL: shift2i16const
> >    ; SSE2: cost of 1 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift2i16const
> > -  ; SSE2-CODEGEN: psrlq $3
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = lshr %shifttypec %a , <i16 3, i16 3>
> >    ret %shifttypec %0
> > @@ -261,7 +261,7 @@ entry:
> >    ; SSE2-LABEL: shift4i16const
> >    ; SSE2: cost of 1 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift4i16const
> > -  ; SSE2-CODEGEN: psrld $3
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = lshr %shifttypec4i16 %a , <i16 3, i16 3, i16 3, i16 3>
> >    ret %shifttypec4i16 %0
> > @@ -322,7 +322,7 @@ entry:
> >    ; SSE2-LABEL: shift2i32c
> >    ; SSE2: cost of 1 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift2i32c
> > -  ; SSE2-CODEGEN: psrlq $3
> > +  ; SSE2-CODEGEN: psrld $3
> >
> >    %0 = lshr %shifttypec2i32 %a , <i32 3, i32 3>
> >    ret %shifttypec2i32 %0
> > @@ -461,9 +461,9 @@ entry:
> >  define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i8c
> > -  ; SSE2: cost of 1 {{.*}} lshr
> > +  ; SSE2: cost of 2 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift2i8c
> > -  ; SSE2-CODEGEN: psrlq $3
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = lshr %shifttypec2i8 %a , <i8 3, i8 3>
> >    ret %shifttypec2i8 %0
> > @@ -473,9 +473,9 @@ entry:
> >  define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i8c
> > -  ; SSE2: cost of 1 {{.*}} lshr
> > +  ; SSE2: cost of 2 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift4i8c
> > -  ; SSE2-CODEGEN: psrld $3
> > +  ; SSE2-CODEGEN: psrlw $3
> >
> >    %0 = lshr %shifttypec4i8 %a , <i8 3, i8 3, i8 3, i8 3>
> >    ret %shifttypec4i8 %0
> > @@ -485,7 +485,7 @@ entry:
> >  define %shifttypec8i8 @shift8i8c(%shifttypec8i8 %a, %shifttypec8i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift8i8c
> > -  ; SSE2: cost of 1 {{.*}} lshr
> > +  ; SSE2: cost of 2 {{.*}} lshr
> >    ; SSE2-CODEGEN-LABEL: shift8i8c
> >    ; SSE2-CODEGEN: psrlw $3
> >
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll Wed Aug  7
> 09:24:26 2019
> > @@ -5,9 +5,9 @@
> >  define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i16
> > -  ; SSE2: cost of 4 {{.*}} shl
> > +  ; SSE2: cost of 32 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift2i16
> > -  ; SSE2-CODEGEN: psllq
> > +  ; SSE2-CODEGEN: pmullw
> >
> >    %0 = shl %shifttype %a , %b
> >    ret %shifttype %0
> > @@ -17,9 +17,9 @@ entry:
> >  define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i16
> > -  ; SSE2: cost of 10 {{.*}} shl
> > +  ; SSE2: cost of 32 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift4i16
> > -  ; SSE2-CODEGEN: pmuludq
> > +  ; SSE2-CODEGEN: pmullw
> >
> >    %0 = shl %shifttype4i16 %a , %b
> >    ret %shifttype4i16 %0
> > @@ -65,9 +65,9 @@ entry:
> >  define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i32
> > -  ; SSE2: cost of 4 {{.*}} shl
> > +  ; SSE2: cost of 10 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift2i32
> > -  ; SSE2-CODEGEN: psllq
> > +  ; SSE2-CODEGEN: pmuludq
> >
> >    %0 = shl %shifttype2i32 %a , %b
> >    ret %shifttype2i32 %0
> > @@ -185,9 +185,9 @@ entry:
> >  define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i8
> > -  ; SSE2: cost of 4 {{.*}} shl
> > +  ; SSE2: cost of 26 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift2i8
> > -  ; SSE2-CODEGEN: psllq
> > +  ; SSE2-CODEGEN: psllw
> >
> >    %0 = shl %shifttype2i8 %a , %b
> >    ret %shifttype2i8 %0
> > @@ -197,9 +197,9 @@ entry:
> >  define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i8
> > -  ; SSE2: cost of 10 {{.*}} shl
> > +  ; SSE2: cost of 26 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift4i8
> > -  ; SSE2-CODEGEN: pmuludq
> > +  ; SSE2-CODEGEN: psllw
> >
> >    %0 = shl %shifttype4i8 %a , %b
> >    ret %shifttype4i8 %0
> > @@ -209,9 +209,9 @@ entry:
> >  define %shifttype8i8 @shift8i8(%shifttype8i8 %a, %shifttype8i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift8i8
> > -  ; SSE2: cost of 32 {{.*}} shl
> > +  ; SSE2: cost of 26 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift8i8
> > -  ; SSE2-CODEGEN: pmullw
> > +  ; SSE2-CODEGEN: psllw
> >
> >    %0 = shl %shifttype8i8 %a , %b
> >    ret %shifttype8i8 %0
> > @@ -249,7 +249,7 @@ entry:
> >    ; SSE2-LABEL: shift2i16const
> >    ; SSE2: cost of 1 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift2i16const
> > -  ; SSE2-CODEGEN: psllq $3
> > +  ; SSE2-CODEGEN: psllw $3
> >
> >    %0 = shl %shifttypec %a , <i16 3, i16 3>
> >    ret %shifttypec %0
> > @@ -261,7 +261,7 @@ entry:
> >    ; SSE2-LABEL: shift4i16const
> >    ; SSE2: cost of 1 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift4i16const
> > -  ; SSE2-CODEGEN: pslld $3
> > +  ; SSE2-CODEGEN: psllw $3
> >
> >    %0 = shl %shifttypec4i16 %a , <i16 3, i16 3, i16 3, i16 3>
> >    ret %shifttypec4i16 %0
> > @@ -322,7 +322,7 @@ entry:
> >    ; SSE2-LABEL: shift2i32c
> >    ; SSE2: cost of 1 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift2i32c
> > -  ; SSE2-CODEGEN: psllq $3
> > +  ; SSE2-CODEGEN: pslld $3
> >
> >    %0 = shl %shifttypec2i32 %a , <i32 3, i32 3>
> >    ret %shifttypec2i32 %0
> > @@ -461,9 +461,9 @@ entry:
> >  define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift2i8c
> > -  ; SSE2: cost of 1 {{.*}} shl
> > +  ; SSE2: cost of 2 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift2i8c
> > -  ; SSE2-CODEGEN: psllq $3
> > +  ; SSE2-CODEGEN: psllw $3
> >
> >    %0 = shl %shifttypec2i8 %a , <i8 3, i8 3>
> >    ret %shifttypec2i8 %0
> > @@ -473,9 +473,9 @@ entry:
> >  define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift4i8c
> > -  ; SSE2: cost of 1 {{.*}} shl
> > +  ; SSE2: cost of 2 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift4i8c
> > -  ; SSE2-CODEGEN: pslld $3
> > +  ; SSE2-CODEGEN: psllw $3
> >
> >    %0 = shl %shifttypec4i8 %a , <i8 3, i8 3, i8 3, i8 3>
> >    ret %shifttypec4i8 %0
> > @@ -485,7 +485,7 @@ entry:
> >  define %shifttypec8i8 @shift8i8c(%shifttypec8i8 %a, %shifttypec8i8 %b) {
> >  entry:
> >    ; SSE2-LABEL: shift8i8c
> > -  ; SSE2: cost of 1 {{.*}} shl
> > +  ; SSE2: cost of 2 {{.*}} shl
> >    ; SSE2-CODEGEN-LABEL: shift8i8c
> >    ; SSE2-CODEGEN: psllw $3
> >
> >
> > Modified: llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll (original)
> > +++ llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll Wed Aug  7 09:24:26
> 2019
> > @@ -13,9 +13,9 @@
> >  define i32 @uitofp_i8_double() {
> >  ; SSE-LABEL: 'uitofp_i8_double'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i8_f64 = uitofp i8 undef to double
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %cvt_v2i8_v2f64 = uitofp <2 x i8> undef to <2 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v4i8_v4f64 = uitofp <4 x i8> undef to <4 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v8i8_v8f64 = uitofp <8 x i8> undef to <8 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %cvt_v2i8_v2f64 = uitofp <2 x i8> undef to <2 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %cvt_v4i8_v4f64 = uitofp <4 x i8> undef to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for
> instruction: %cvt_v8i8_v8f64 = uitofp <8 x i8> undef to <8 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> >  ; AVX-LABEL: 'uitofp_i8_double'
> > @@ -49,8 +49,8 @@ define i32 @uitofp_i8_double() {
> >  define i32 @uitofp_i16_double() {
> >  ; SSE-LABEL: 'uitofp_i16_double'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i16_f64 = uitofp i16 undef to double
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %cvt_v2i16_v2f64 = uitofp <2 x i16> undef to <2 x double>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v4i16_v4f64 = uitofp <4 x i16> undef to <4 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v2i16_v2f64 = uitofp <2 x i16> undef to <2 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v4i16_v4f64 = uitofp <4 x i16> undef to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v8i16_v8f64 = uitofp <8 x i16> undef to <8 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > @@ -85,7 +85,7 @@ define i32 @uitofp_i16_double() {
> >  define i32 @uitofp_i32_double() {
> >  ; SSE-LABEL: 'uitofp_i32_double'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i32_f64 = uitofp i32 undef to double
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction:
> %cvt_v2i32_v2f64 = uitofp <2 x i32> undef to <2 x double>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v2i32_v2f64 = uitofp <2 x i32> undef to <2 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction:
> %cvt_v4i32_v4f64 = uitofp <4 x i32> undef to <4 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction:
> %cvt_v8i32_v8f64 = uitofp <8 x i32> undef to <8 x double>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> > @@ -165,7 +165,7 @@ define i32 @uitofp_i8_float() {
> >  ; SSE-LABEL: 'uitofp_i8_float'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i8_f32 = uitofp i8 undef to float
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v4i8_v4f32 = uitofp <4 x i8> undef to <4 x float>
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %cvt_v8i8_v8f32 = uitofp <8 x i8> undef to <8 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v8i8_v8f32 = uitofp <8 x i8> undef to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v16i8_v16f32 = uitofp <16 x i8> undef to <16 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >  ;
> > @@ -200,7 +200,7 @@ define i32 @uitofp_i8_float() {
> >  define i32 @uitofp_i16_float() {
> >  ; SSE-LABEL: 'uitofp_i16_float'
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction:
> %cvt_i16_f32 = uitofp i16 undef to float
> > -; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction:
> %cvt_v4i16_v4f32 = uitofp <4 x i16> undef to <4 x float>
> > +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %cvt_v4i16_v4f32 = uitofp <4 x i16> undef to <4 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction:
> %cvt_v8i16_v8f32 = uitofp <8 x i16> undef to <8 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 30 for instruction:
> %cvt_v16i16_v16f32 = uitofp <16 x i16> undef to <16 x float>
> >  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction:
> ret i32 undef
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll Wed Aug  7
> 09:24:26 2019
> > @@ -7,7 +7,6 @@
> >  define <2 x double> @a(<2 x i32> %x) nounwind {
> >  ; CHECK-LABEL: a:
> >  ; CHECK:       # %bb.0: # %entry
> > -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; CHECK-NEXT:    cvtdq2pd %xmm0, %xmm0
> >  ; CHECK-NEXT:    retl
> >  entry:
> > @@ -19,7 +18,6 @@ define <2 x i32> @b(<2 x double> %x) nou
> >  ; CHECK-LABEL: b:
> >  ; CHECK:       # %bb.0: # %entry
> >  ; CHECK-NEXT:    cvttpd2dq %xmm0, %xmm0
> > -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> >  ; CHECK-NEXT:    retl
> >  entry:
> >    %y = fptosi <2 x double> %x to <2 x i32>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll Wed Aug  7
> 09:24:26 2019
> > @@ -7,6 +7,7 @@ define <4 x i16> @a(i32* %x1) nounwind {
> >  ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; CHECK-NEXT:    movl (%eax), %eax
> >  ; CHECK-NEXT:    shrl %eax
> > +; CHECK-NEXT:    movzwl %ax, %eax
> >  ; CHECK-NEXT:    movd %eax, %xmm0
> >  ; CHECK-NEXT:    retl
> >
> > @@ -40,7 +41,7 @@ define <8 x i8> @c(i32* %x1) nounwind {
> >  ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; CHECK-NEXT:    movl (%eax), %eax
> >  ; CHECK-NEXT:    shrl %eax
> > -; CHECK-NEXT:    movzwl %ax, %eax
> > +; CHECK-NEXT:    movzbl %al, %eax
> >  ; CHECK-NEXT:    movd %eax, %xmm0
> >  ; CHECK-NEXT:    retl
> >
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll Wed Aug  7
> 09:24:26 2019
> > @@ -17,19 +17,23 @@ target triple = "x86_64-unknown-linux-gn
> >  define i32 @main() nounwind uwtable {
> >  ; CHECK-LABEL: main:
> >  ; CHECK:       # %bb.0: # %entry
> > -; CHECK-NEXT:    pmovsxbq {{.*}}(%rip), %xmm0
> > -; CHECK-NEXT:    pmovsxbq {{.*}}(%rip), %xmm1
> > -; CHECK-NEXT:    pextrq $1, %xmm1, %rax
> > -; CHECK-NEXT:    pextrq $1, %xmm0, %rcx
> > -; CHECK-NEXT:    cqto
> > -; CHECK-NEXT:    idivq %rcx
> > -; CHECK-NEXT:    movq %rax, %xmm2
> > -; CHECK-NEXT:    movq %xmm1, %rax
> > -; CHECK-NEXT:    movq %xmm0, %rcx
> > -; CHECK-NEXT:    cqto
> > -; CHECK-NEXT:    idivq %rcx
> > -; CHECK-NEXT:    movq %rax, %xmm0
> > -; CHECK-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
> > +; CHECK-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; CHECK-NEXT:    pextrb $1, %xmm0, %eax
> > +; CHECK-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > +; CHECK-NEXT:    pextrb $1, %xmm1, %ecx
> > +; CHECK-NEXT:    # kill: def $al killed $al killed $eax
> > +; CHECK-NEXT:    cbtw
> > +; CHECK-NEXT:    idivb %cl
> > +; CHECK-NEXT:    movl %eax, %ecx
> > +; CHECK-NEXT:    pextrb $0, %xmm0, %eax
> > +; CHECK-NEXT:    # kill: def $al killed $al killed $eax
> > +; CHECK-NEXT:    cbtw
> > +; CHECK-NEXT:    pextrb $0, %xmm1, %edx
> > +; CHECK-NEXT:    idivb %dl
> > +; CHECK-NEXT:    movzbl %cl, %ecx
> > +; CHECK-NEXT:    movzbl %al, %eax
> > +; CHECK-NEXT:    movd %eax, %xmm0
> > +; CHECK-NEXT:    pinsrb $1, %ecx, %xmm0
> >  ; CHECK-NEXT:    pextrw $0, %xmm0, {{.*}}(%rip)
> >  ; CHECK-NEXT:    xorl %eax, %eax
> >  ; CHECK-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll Wed Aug  7
> 09:24:26 2019
> > @@ -18,10 +18,11 @@ target triple = "x86_64-apple-darwin11.2
> >  define void @foo8(float* nocapture %RET) nounwind {
> >  ; CHECK-LABEL: foo8:
> >  ; CHECK:       ## %bb.0: ## %allocas
> > -; CHECK-NEXT:    movaps {{.*#+}} xmm0 = [1.0E+2,2.0E+0,1.0E+2,4.0E+0]
> > -; CHECK-NEXT:    movaps {{.*#+}} xmm1 = [1.0E+2,6.0E+0,1.0E+2,8.0E+0]
> > -; CHECK-NEXT:    movups %xmm1, 16(%rdi)
> > -; CHECK-NEXT:    movups %xmm0, (%rdi)
> > +; CHECK-NEXT:    pmovzxbd {{.*#+}} xmm0 =
> mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
> > +; CHECK-NEXT:    cvtdq2ps %xmm0, %xmm0
> > +; CHECK-NEXT:    movaps {{.*#+}} xmm1 = [1.0E+2,2.0E+0,1.0E+2,4.0E+0]
> > +; CHECK-NEXT:    movups %xmm1, (%rdi)
> > +; CHECK-NEXT:    movups %xmm0, 16(%rdi)
> >  ; CHECK-NEXT:    retq
> >  allocas:
> >    %resultvec.i = select <8 x i1> <i1 false, i1 true, i1 false, i1 true,
> i1 false, i1 true, i1 false, i1 true>, <8 x i8> <i8 1, i8 2, i8 3, i8 4, i8
> 5, i8 6, i8 7, i8 8>, <8 x i8> <i8 100, i8 100, i8 100, i8 100, i8 100, i8
> 100, i8 100, i8 100>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll Wed Aug  7
> 09:24:26 2019
> > @@ -6,16 +6,12 @@
> >  define void @prom_bug(<4 x i8> %t, i16* %p) {
> >  ; SSE2-LABEL: prom_bug:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > -; SSE2-NEXT:    pextrw $0, %xmm0, %eax
> > +; SSE2-NEXT:    movd %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, (%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE41-LABEL: prom_bug:
> >  ; SSE41:       ## %bb.0:
> > -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE41-NEXT:    pextrw $0, %xmm0, (%rdi)
> >  ; SSE41-NEXT:    retq
> >    %r = bitcast <4 x i8> %t to <2 x i16>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll Wed Aug  7
> 09:24:26 2019
> > @@ -4,9 +4,8 @@
> >  define <2 x i32> @vcast(<2 x float> %a, <2 x float> %b) {
> >  ; CHECK-LABEL: vcast:
> >  ; CHECK:       # %bb.0:
> > -; CHECK-NEXT:    pmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
> > -; CHECK-NEXT:    pmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero
> > -; CHECK-NEXT:    psubq %xmm1, %xmm0
> > +; CHECK-NEXT:    movdqa (%rcx), %xmm0
> > +; CHECK-NEXT:    psubd (%rdx), %xmm0
> >  ; CHECK-NEXT:    retq
> >    %af = bitcast <2 x float> %a to <2 x i32>
> >    %bf = bitcast <2 x float> %b to <2 x i32>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll Wed Aug  7
> 09:24:26 2019
> > @@ -4,7 +4,6 @@
> >  define <4 x i8> @build_vector_again(<16 x i8> %in) nounwind readnone {
> >  ; CHECK-LABEL: build_vector_again:
> >  ; CHECK:       ## %bb.0: ## %entry
> > -; CHECK-NEXT:    vpmovzxbd {{.*#+}} xmm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
> >  ; CHECK-NEXT:    retq
> >  entry:
> >    %out = shufflevector <16 x i8> %in, <16 x i8> undef, <4 x i32> <i32
> 0, i32 1, i32 2, i32 3>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll Wed Aug  7
> 09:24:26 2019
> > @@ -33,7 +33,7 @@ define <2 x i32> @load_64(<2 x i32>* %pt
> >  ; CHECK-LABEL: load_64:
> >  ; CHECK:       # %bb.0: # %BB
> >  ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; CHECK-NEXT:    pmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
> > +; CHECK-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> >  ; CHECK-NEXT:    retl
> >  BB:
> >    %t = load <2 x i32>, <2 x i32>* %ptr
> >
> > Modified: llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll Wed Aug  7 09:24:26
> 2019
> > @@ -14,8 +14,7 @@ define <8 x i8> @test_pavgusb(x86_mmx %a
> >  ; X64:       # %bb.0: # %entry
> >  ; X64-NEXT:    pavgusb %mm1, %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast x86_mmx %a.coerce to <8 x i8>
> > @@ -52,8 +51,7 @@ define <2 x i32> @test_pf2id(<2 x float>
> >  ; X64-NEXT:    movdq2q %xmm0, %mm0
> >  ; X64-NEXT:    pf2id %mm0, %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast <2 x float> %a to x86_mmx
> > @@ -169,8 +167,7 @@ define <2 x i32> @test_pfcmpeq(<2 x floa
> >  ; X64-NEXT:    movdq2q %xmm0, %mm1
> >  ; X64-NEXT:    pfcmpeq %mm0, %mm1
> >  ; X64-NEXT:    movq %mm1, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast <2 x float> %a to x86_mmx
> > @@ -209,8 +206,7 @@ define <2 x i32> @test_pfcmpge(<2 x floa
> >  ; X64-NEXT:    movdq2q %xmm0, %mm1
> >  ; X64-NEXT:    pfcmpge %mm0, %mm1
> >  ; X64-NEXT:    movq %mm1, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast <2 x float> %a to x86_mmx
> > @@ -249,8 +245,7 @@ define <2 x i32> @test_pfcmpgt(<2 x floa
> >  ; X64-NEXT:    movdq2q %xmm0, %mm1
> >  ; X64-NEXT:    pfcmpgt %mm0, %mm1
> >  ; X64-NEXT:    movq %mm1, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast <2 x float> %a to x86_mmx
> > @@ -685,8 +680,7 @@ define <4 x i16> @test_pmulhrw(x86_mmx %
> >  ; X64:       # %bb.0: # %entry
> >  ; X64-NEXT:    pmulhrw %mm1, %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast x86_mmx %a.coerce to <4 x i16>
> > @@ -723,8 +717,7 @@ define <2 x i32> @test_pf2iw(<2 x float>
> >  ; X64-NEXT:    movdq2q %xmm0, %mm0
> >  ; X64-NEXT:    pf2iw %mm0, %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast <2 x float> %a to x86_mmx
> > @@ -896,12 +889,10 @@ define <2 x i32> @test_pswapdsi(<2 x i32
> >  ;
> >  ; X64-LABEL: test_pswapdsi:
> >  ; X64:       # %bb.0: # %entry
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; X64-NEXT:    movdq2q %xmm0, %mm0
> >  ; X64-NEXT:    pswapd %mm0, %mm0 # mm0 = mm0[1,0]
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    retq
> >  entry:
> >    %0 = bitcast <2 x i32> %a to x86_mmx
> >
> > Modified: llvm/trunk/test/CodeGen/X86/4char-promote.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/4char-promote.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/4char-promote.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/4char-promote.ll Wed Aug  7 09:24:26 2019
> > @@ -7,8 +7,11 @@ target triple = "x86_64-apple-darwin"
> >  define <4 x i8> @foo(<4 x i8> %x, <4 x i8> %y) {
> >  ; CHECK-LABEL: foo:
> >  ; CHECK:       ## %bb.0: ## %entry
> > -; CHECK-NEXT:    pmulld %xmm0, %xmm1
> > -; CHECK-NEXT:    paddd %xmm1, %xmm0
> > +; CHECK-NEXT:    pmovzxbw {{.*#+}} xmm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > +; CHECK-NEXT:    pmovzxbw {{.*#+}} xmm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; CHECK-NEXT:    pmullw %xmm1, %xmm2
> > +; CHECK-NEXT:    pshufb {{.*#+}} xmm2 =
> xmm2[0,2,4,6,u,u,u,u,u,u,u,u,u,u,u,u]
> > +; CHECK-NEXT:    paddb %xmm2, %xmm0
> >  ; CHECK-NEXT:    retq
> >  entry:
> >   %binop = mul <4 x i8> %x, %y
> >
> > Modified: llvm/trunk/test/CodeGen/X86/and-load-fold.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/and-load-fold.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/and-load-fold.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/and-load-fold.ll Wed Aug  7 09:24:26 2019
> > @@ -6,10 +6,8 @@
> >  define i8 @foo(<4 x i8>* %V) {
> >  ; CHECK-LABEL: foo:
> >  ; CHECK:       # %bb.0:
> > -; CHECK-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > -; CHECK-NEXT:    pextrw $1, %xmm0, %eax
> > +; CHECK-NEXT:    movb 2(%rdi), %al
> >  ; CHECK-NEXT:    andb $95, %al
> > -; CHECK-NEXT:    # kill: def $al killed $al killed $eax
> >  ; CHECK-NEXT:    retq
> >    %Vp = bitcast <4 x i8>* %V to <3 x i8>*
> >    %V3i8 = load <3 x i8>, <3 x i8>* %Vp, align 4
> >
> > Modified: llvm/trunk/test/CodeGen/X86/atomic-unordered.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/atomic-unordered.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/atomic-unordered.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/atomic-unordered.ll Wed Aug  7 09:24:26
> 2019
> > @@ -460,7 +460,7 @@ define void @vec_store(i32* %p0, <2 x i3
> >  ; CHECK-O0-LABEL: vec_store:
> >  ; CHECK-O0:       # %bb.0:
> >  ; CHECK-O0-NEXT:    vmovd %xmm0, %eax
> > -; CHECK-O0-NEXT:    vpextrd $2, %xmm0, %ecx
> > +; CHECK-O0-NEXT:    vpextrd $1, %xmm0, %ecx
> >  ; CHECK-O0-NEXT:    movl %eax, (%rdi)
> >  ; CHECK-O0-NEXT:    movl %ecx, 4(%rdi)
> >  ; CHECK-O0-NEXT:    retq
> > @@ -468,7 +468,7 @@ define void @vec_store(i32* %p0, <2 x i3
> >  ; CHECK-O3-LABEL: vec_store:
> >  ; CHECK-O3:       # %bb.0:
> >  ; CHECK-O3-NEXT:    vmovd %xmm0, %eax
> > -; CHECK-O3-NEXT:    vpextrd $2, %xmm0, %ecx
> > +; CHECK-O3-NEXT:    vpextrd $1, %xmm0, %ecx
> >  ; CHECK-O3-NEXT:    movl %eax, (%rdi)
> >  ; CHECK-O3-NEXT:    movl %ecx, 4(%rdi)
> >  ; CHECK-O3-NEXT:    retq
> > @@ -485,7 +485,7 @@ define void @vec_store_unaligned(i32* %p
> >  ; CHECK-O0-LABEL: vec_store_unaligned:
> >  ; CHECK-O0:       # %bb.0:
> >  ; CHECK-O0-NEXT:    vmovd %xmm0, %eax
> > -; CHECK-O0-NEXT:    vpextrd $2, %xmm0, %ecx
> > +; CHECK-O0-NEXT:    vpextrd $1, %xmm0, %ecx
> >  ; CHECK-O0-NEXT:    movl %eax, (%rdi)
> >  ; CHECK-O0-NEXT:    movl %ecx, 4(%rdi)
> >  ; CHECK-O0-NEXT:    retq
> > @@ -493,7 +493,7 @@ define void @vec_store_unaligned(i32* %p
> >  ; CHECK-O3-LABEL: vec_store_unaligned:
> >  ; CHECK-O3:       # %bb.0:
> >  ; CHECK-O3-NEXT:    vmovd %xmm0, %eax
> > -; CHECK-O3-NEXT:    vpextrd $2, %xmm0, %ecx
> > +; CHECK-O3-NEXT:    vpextrd $1, %xmm0, %ecx
> >  ; CHECK-O3-NEXT:    movl %eax, (%rdi)
> >  ; CHECK-O3-NEXT:    movl %ecx, 4(%rdi)
> >  ; CHECK-O3-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avg.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avg.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avg.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avg.ll Wed Aug  7 09:24:26 2019
> > @@ -378,63 +378,65 @@ define void @avg_v48i8(<48 x i8>* %a, <4
> >  ; AVX2-LABEL: avg_v48i8:
> >  ; AVX2:       # %bb.0:
> >  ; AVX2-NEXT:    vmovdqa (%rdi), %xmm0
> > -; AVX2-NEXT:    vmovdqa 32(%rdi), %xmm1
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 =
> xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> > -; AVX2-NEXT:    vpbroadcastq 24(%rdi), %xmm3
> > +; AVX2-NEXT:    vmovdqa 16(%rdi), %xmm1
> > +; AVX2-NEXT:    vmovdqa 32(%rdi), %xmm2
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> >  ; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm3 =
> xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero,xmm3[4],zero,zero,zero,xmm3[5],zero,zero,zero,xmm3[6],zero,zero,zero,xmm3[7],zero,zero,zero
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm4 =
> mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm1[2,3,0,1]
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 =
> xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm8 =
> xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
> > -; AVX2-NEXT:    vmovdqa (%rsi), %xmm6
> > -; AVX2-NEXT:    vmovdqa 32(%rsi), %xmm7
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm6[2,3,0,1]
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm1 =
> xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
> > -; AVX2-NEXT:    vpaddd %ymm1, %ymm2, %ymm1
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 =
> xmm6[0],zero,zero,zero,xmm6[1],zero,zero,zero,xmm6[2],zero,zero,zero,xmm6[3],zero,zero,zero,xmm6[4],zero,zero,zero,xmm6[5],zero,zero,zero,xmm6[6],zero,zero,zero,xmm6[7],zero,zero,zero
> > -; AVX2-NEXT:    vpaddd %ymm2, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpbroadcastq 24(%rsi), %xmm2
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 =
> xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> > -; AVX2-NEXT:    vpaddd %ymm2, %ymm3, %ymm2
> > -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm3 =
> mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
> > -; AVX2-NEXT:    vpaddd %ymm3, %ymm4, %ymm3
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm4 = xmm7[2,3,0,1]
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm4 = xmm1[2,3,0,1]
> >  ; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm4 =
> xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero,xmm4[4],zero,zero,zero,xmm4[5],zero,zero,zero,xmm4[6],zero,zero,zero,xmm4[7],zero,zero,zero
> > -; AVX2-NEXT:    vpaddd %ymm4, %ymm5, %ymm4
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm1 =
> xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm9 =
> xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm8 =
> xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> > +; AVX2-NEXT:    vmovdqa (%rsi), %xmm6
> > +; AVX2-NEXT:    vmovdqa 16(%rsi), %xmm7
> > +; AVX2-NEXT:    vmovdqa 32(%rsi), %xmm2
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm6[2,3,0,1]
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 =
> xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> > +; AVX2-NEXT:    vpaddd %ymm5, %ymm3, %ymm3
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 =
> xmm6[0],zero,zero,zero,xmm6[1],zero,zero,zero,xmm6[2],zero,zero,zero,xmm6[3],zero,zero,zero,xmm6[4],zero,zero,zero,xmm6[5],zero,zero,zero,xmm6[6],zero,zero,zero,xmm6[7],zero,zero,zero
> > +; AVX2-NEXT:    vpaddd %ymm5, %ymm0, %ymm0
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm7[2,3,0,1]
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 =
> xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> > +; AVX2-NEXT:    vpaddd %ymm5, %ymm4, %ymm4
> >  ; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 =
> xmm7[0],zero,zero,zero,xmm7[1],zero,zero,zero,xmm7[2],zero,zero,zero,xmm7[3],zero,zero,zero,xmm7[4],zero,zero,zero,xmm7[5],zero,zero,zero,xmm7[6],zero,zero,zero,xmm7[7],zero,zero,zero
> > -; AVX2-NEXT:    vpaddd %ymm5, %ymm8, %ymm5
> > +; AVX2-NEXT:    vpaddd %ymm5, %ymm1, %ymm1
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 =
> xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> > +; AVX2-NEXT:    vpaddd %ymm5, %ymm9, %ymm5
> > +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 =
> xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> > +; AVX2-NEXT:    vpaddd %ymm2, %ymm8, %ymm2
> >  ; AVX2-NEXT:    vpcmpeqd %ymm6, %ymm6, %ymm6
> > -; AVX2-NEXT:    vpsubd %ymm6, %ymm1, %ymm1
> > -; AVX2-NEXT:    vpsubd %ymm6, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpsubd %ymm6, %ymm2, %ymm2
> >  ; AVX2-NEXT:    vpsubd %ymm6, %ymm3, %ymm3
> > +; AVX2-NEXT:    vpsubd %ymm6, %ymm0, %ymm0
> >  ; AVX2-NEXT:    vpsubd %ymm6, %ymm4, %ymm4
> > +; AVX2-NEXT:    vpsubd %ymm6, %ymm1, %ymm1
> >  ; AVX2-NEXT:    vpsubd %ymm6, %ymm5, %ymm5
> > +; AVX2-NEXT:    vpsubd %ymm6, %ymm2, %ymm2
> > +; AVX2-NEXT:    vpsrld $1, %ymm2, %ymm2
> >  ; AVX2-NEXT:    vpsrld $1, %ymm5, %ymm5
> > +; AVX2-NEXT:    vpsrld $1, %ymm1, %ymm1
> >  ; AVX2-NEXT:    vpsrld $1, %ymm4, %ymm4
> > -; AVX2-NEXT:    vpsrld $1, %ymm3, %ymm3
> > -; AVX2-NEXT:    vpsrld $1, %ymm2, %ymm2
> >  ; AVX2-NEXT:    vpsrld $1, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpsrld $1, %ymm1, %ymm1
> > -; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm0[2,3],ymm1[2,3]
> > -; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
> > +; AVX2-NEXT:    vpsrld $1, %ymm3, %ymm3
> > +; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm0[2,3],ymm3[2,3]
> > +; AVX2-NEXT:    vinserti128 $1, %xmm3, %ymm0, %ymm0
> >  ; AVX2-NEXT:    vpackusdw %ymm6, %ymm0, %ymm0
> > -; AVX2-NEXT:    vmovdqa {{.*#+}} ymm1 =
> [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
> > -; AVX2-NEXT:    vpand %ymm1, %ymm0, %ymm0
> > -; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm3[2,3],ymm2[2,3]
> > -; AVX2-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> > -; AVX2-NEXT:    vpackusdw %ymm6, %ymm2, %ymm2
> > -; AVX2-NEXT:    vpand %ymm1, %ymm2, %ymm2
> > -; AVX2-NEXT:    vinserti128 $1, %xmm2, %ymm0, %ymm3
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} ymm3 =
> [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
> > +; AVX2-NEXT:    vpand %ymm3, %ymm0, %ymm0
> > +; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm1[2,3],ymm4[2,3]
> > +; AVX2-NEXT:    vinserti128 $1, %xmm4, %ymm1, %ymm1
> > +; AVX2-NEXT:    vpackusdw %ymm6, %ymm1, %ymm1
> > +; AVX2-NEXT:    vpand %ymm3, %ymm1, %ymm1
> > +; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm4
> >  ; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm0
> > -; AVX2-NEXT:    vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm2[4,5,6,7]
> > -; AVX2-NEXT:    vpackuswb %ymm0, %ymm3, %ymm0
> > -; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm2 = ymm5[2,3],ymm4[2,3]
> > -; AVX2-NEXT:    vinserti128 $1, %xmm4, %ymm5, %ymm3
> > -; AVX2-NEXT:    vpackusdw %ymm2, %ymm3, %ymm2
> > -; AVX2-NEXT:    vpand %ymm1, %ymm2, %ymm1
> > +; AVX2-NEXT:    vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
> > +; AVX2-NEXT:    vpackuswb %ymm0, %ymm4, %ymm0
> > +; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm1 = ymm2[2,3],ymm5[2,3]
> > +; AVX2-NEXT:    vinserti128 $1, %xmm5, %ymm2, %ymm2
> > +; AVX2-NEXT:    vpackusdw %ymm1, %ymm2, %ymm1
> > +; AVX2-NEXT:    vpand %ymm3, %ymm1, %ymm1
> >  ; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> >  ; AVX2-NEXT:    vpackuswb %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovdqu %xmm1, (%rax)
> > @@ -1897,118 +1899,178 @@ define void @not_avg_v16i8_wide_constant
> >  ; SSE2-NEXT:    pushq %r13
> >  ; SSE2-NEXT:    pushq %r12
> >  ; SSE2-NEXT:    pushq %rbx
> > -; SSE2-NEXT:    movaps (%rdi), %xmm0
> > -; SSE2-NEXT:    movaps (%rsi), %xmm1
> > -; SSE2-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
> > +; SSE2-NEXT:    movaps (%rdi), %xmm1
> > +; SSE2-NEXT:    movaps (%rsi), %xmm0
> > +; SSE2-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> >  ; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r13d
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> >  ; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> >  ; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r14d
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r15d
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r13d
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r12d
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r15d
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r11d
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r10d
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r9d
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r8d
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ecx
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edi
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > +; SSE2-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebp
> > +; SSE2-NEXT:    addq %r11, %rbp
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r14d
> > +; SSE2-NEXT:    addq %r10, %r14
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > +; SSE2-NEXT:    addq %r9, %rbx
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r11d
> > +; SSE2-NEXT:    addq %r8, %r11
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r10d
> > +; SSE2-NEXT:    addq %rdx, %r10
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r8d
> > +; SSE2-NEXT:    addq %rcx, %r8
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edi
> > +; SSE2-NEXT:    addq %rax, %rdi
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> > -; SSE2-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
> > +; SSE2-NEXT:    addq %rsi, %rdx
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > -; SSE2-NEXT:    leal -1(%rdx,%rsi), %edx
> > -; SSE2-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> > -; SSE2-NEXT:    leal -1(%rbx,%rdx), %edx
> > -; SSE2-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> > -; SSE2-NEXT:    leal -1(%rbp,%rdx), %edx
> > -; SSE2-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> > -; SSE2-NEXT:    leal -1(%rdi,%rdx), %r8d
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> > -; SSE2-NEXT:    leal -1(%rax,%rdx), %edi
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> > -; SSE2-NEXT:    leal -1(%rcx,%rax), %edx
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> > -; SSE2-NEXT:    leal -1(%r9,%rax), %ecx
> > +; SSE2-NEXT:    leaq -1(%r15,%rsi), %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > -; SSE2-NEXT:    leal -1(%r10,%rsi), %eax
> > +; SSE2-NEXT:    leaq -1(%r12,%rsi), %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> >  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > -; SSE2-NEXT:    leaq -1(%r11,%rsi), %rsi
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    leaq -1(%r12,%rbx), %r12
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    leaq -1(%r15,%rbx), %r15
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    leaq -1(%r14,%rbx), %r14
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> > -; SSE2-NEXT:    leaq -1(%rbp,%rbx), %r11
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> > -; SSE2-NEXT:    leaq -1(%rbp,%rbx), %r10
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    leaq -1(%r13,%rbx), %r9
> > -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> > -; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r13 # 8-byte Reload
> > -; SSE2-NEXT:    leaq -1(%r13,%rbx), %rbx
> > -; SSE2-NEXT:    shrl %eax
> > -; SSE2-NEXT:    movd %eax, %xmm8
> > -; SSE2-NEXT:    shrl %ecx
> > -; SSE2-NEXT:    movd %ecx, %xmm15
> > -; SSE2-NEXT:    shrl %edx
> > -; SSE2-NEXT:    movd %edx, %xmm9
> > -; SSE2-NEXT:    shrl %edi
> > -; SSE2-NEXT:    movd %edi, %xmm2
> > -; SSE2-NEXT:    shrl %r8d
> > -; SSE2-NEXT:    movd %r8d, %xmm10
> > -; SSE2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; SSE2-NEXT:    shrl %eax
> > -; SSE2-NEXT:    movd %eax, %xmm6
> > -; SSE2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; SSE2-NEXT:    shrl %eax
> > -; SSE2-NEXT:    movd %eax, %xmm11
> > -; SSE2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; SSE2-NEXT:    shrl %eax
> > -; SSE2-NEXT:    movd %eax, %xmm4
> > -; SSE2-NEXT:    shrq %rsi
> > -; SSE2-NEXT:    movd %esi, %xmm12
> > -; SSE2-NEXT:    shrq %r12
> > -; SSE2-NEXT:    movd %r12d, %xmm3
> > -; SSE2-NEXT:    shrq %r15
> > -; SSE2-NEXT:    movd %r15d, %xmm13
> > -; SSE2-NEXT:    shrq %r14
> > -; SSE2-NEXT:    movd %r14d, %xmm7
> > -; SSE2-NEXT:    shrq %r11
> > -; SSE2-NEXT:    movd %r11d, %xmm14
> > -; SSE2-NEXT:    shrq %r10
> > -; SSE2-NEXT:    movd %r10d, %xmm5
> > -; SSE2-NEXT:    shrq %r9
> > -; SSE2-NEXT:    movd %r9d, %xmm0
> > -; SSE2-NEXT:    shrq %rbx
> > -; SSE2-NEXT:    movd %ebx, %xmm1
> > +; SSE2-NEXT:    leaq -1(%r13,%rsi), %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rsi
> > +; SSE2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rsi
> > +; SSE2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    addq $-1, %rbp
> > +; SSE2-NEXT:    movl $0, %r9d
> > +; SSE2-NEXT:    adcq $-1, %r9
> > +; SSE2-NEXT:    addq $-1, %r14
> > +; SSE2-NEXT:    movl $0, %esi
> > +; SSE2-NEXT:    adcq $-1, %rsi
> > +; SSE2-NEXT:    addq $-1, %rbx
> > +; SSE2-NEXT:    movl $0, %eax
> > +; SSE2-NEXT:    adcq $-1, %rax
> > +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; SSE2-NEXT:    addq $-1, %r11
> > +; SSE2-NEXT:    movl $0, %r12d
> > +; SSE2-NEXT:    adcq $-1, %r12
> > +; SSE2-NEXT:    addq $-1, %r10
> > +; SSE2-NEXT:    movl $0, %r13d
> > +; SSE2-NEXT:    adcq $-1, %r13
> > +; SSE2-NEXT:    addq $-1, %r8
> > +; SSE2-NEXT:    movl $0, %r15d
> > +; SSE2-NEXT:    adcq $-1, %r15
> > +; SSE2-NEXT:    addq $-1, %rdi
> > +; SSE2-NEXT:    movl $0, %ecx
> > +; SSE2-NEXT:    adcq $-1, %rcx
> > +; SSE2-NEXT:    addq $-1, %rdx
> > +; SSE2-NEXT:    movl $0, %eax
> > +; SSE2-NEXT:    adcq $-1, %rax
> > +; SSE2-NEXT:    shldq $63, %rdx, %rax
> > +; SSE2-NEXT:    shldq $63, %rdi, %rcx
> > +; SSE2-NEXT:    movq %rcx, %rdx
> > +; SSE2-NEXT:    shldq $63, %r8, %r15
> > +; SSE2-NEXT:    shldq $63, %r10, %r13
> > +; SSE2-NEXT:    shldq $63, %r11, %r12
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdi # 8-byte Reload
> > +; SSE2-NEXT:    shldq $63, %rbx, %rdi
> > +; SSE2-NEXT:    shldq $63, %r14, %rsi
> > +; SSE2-NEXT:    shldq $63, %rbp, %r9
> > +; SSE2-NEXT:    movq %r9, %xmm8
> > +; SSE2-NEXT:    movq %rsi, %xmm15
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rcx
> > +; SSE2-NEXT:    movq %rcx, %xmm9
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rcx
> > +; SSE2-NEXT:    movq %rcx, %xmm2
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rcx
> > +; SSE2-NEXT:    movq %rcx, %xmm10
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rcx
> > +; SSE2-NEXT:    movq %rcx, %xmm4
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rcx
> > +; SSE2-NEXT:    movq %rcx, %xmm11
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rcx
> > +; SSE2-NEXT:    movq %rcx, %xmm7
> > +; SSE2-NEXT:    movq %rdi, %xmm12
> > +; SSE2-NEXT:    movq %r12, %xmm0
> > +; SSE2-NEXT:    movq %r13, %xmm13
> > +; SSE2-NEXT:    movq %r15, %xmm6
> > +; SSE2-NEXT:    movq %rdx, %xmm14
> > +; SSE2-NEXT:    movq %rax, %xmm5
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rax
> > +; SSE2-NEXT:    movq %rax, %xmm3
> > +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; SSE2-NEXT:    shrq %rax
> > +; SSE2-NEXT:    movq %rax, %xmm1
> >  ; SSE2-NEXT:    punpcklbw {{.*#+}} xmm15 =
> xmm15[0],xmm8[0],xmm15[1],xmm8[1],xmm15[2],xmm8[2],xmm15[3],xmm8[3],xmm15[4],xmm8[4],xmm15[5],xmm8[5],xmm15[6],xmm8[6],xmm15[7],xmm8[7]
> >  ; SSE2-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0],xmm9[0],xmm2[1],xmm9[1],xmm2[2],xmm9[2],xmm2[3],xmm9[3],xmm2[4],xmm9[4],xmm2[5],xmm9[5],xmm2[6],xmm9[6],xmm2[7],xmm9[7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm2 =
> xmm2[0],xmm15[0],xmm2[1],xmm15[1],xmm2[2],xmm15[2],xmm2[3],xmm15[3]
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm6 =
> xmm6[0],xmm10[0],xmm6[1],xmm10[1],xmm6[2],xmm10[2],xmm6[3],xmm10[3],xmm6[4],xmm10[4],xmm6[5],xmm10[5],xmm6[6],xmm10[6],xmm6[7],xmm10[7]
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm4 =
> xmm4[0],xmm11[0],xmm4[1],xmm11[1],xmm4[2],xmm11[2],xmm4[3],xmm11[3],xmm4[4],xmm11[4],xmm4[5],xmm11[5],xmm4[6],xmm11[6],xmm4[7],xmm11[7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm4 =
> xmm4[0],xmm6[0],xmm4[1],xmm6[1],xmm4[2],xmm6[2],xmm4[3],xmm6[3]
> > -; SSE2-NEXT:    punpckldq {{.*#+}} xmm4 =
> xmm4[0],xmm2[0],xmm4[1],xmm2[1]
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm3 =
> xmm3[0],xmm12[0],xmm3[1],xmm12[1],xmm3[2],xmm12[2],xmm3[3],xmm12[3],xmm3[4],xmm12[4],xmm3[5],xmm12[5],xmm3[6],xmm12[6],xmm3[7],xmm12[7]
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm7 =
> xmm7[0],xmm13[0],xmm7[1],xmm13[1],xmm7[2],xmm13[2],xmm7[3],xmm13[3],xmm7[4],xmm13[4],xmm7[5],xmm13[5],xmm7[6],xmm13[6],xmm7[7],xmm13[7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm7 =
> xmm7[0],xmm3[0],xmm7[1],xmm3[1],xmm7[2],xmm3[2],xmm7[3],xmm3[3]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm8 = xmm15[0,1,2,0]
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm8
> > +; SSE2-NEXT:    pslldq {{.*#+}} xmm2 =
> zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0,1]
> > +; SSE2-NEXT:    por %xmm8, %xmm2
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm4 =
> xmm4[0],xmm10[0],xmm4[1],xmm10[1],xmm4[2],xmm10[2],xmm4[3],xmm10[3],xmm4[4],xmm10[4],xmm4[5],xmm10[5],xmm4[6],xmm10[6],xmm4[7],xmm10[7]
> > +; SSE2-NEXT:    pslldq {{.*#+}} xmm4 =
> zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm4[0,1,2,3,4,5]
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm7 =
> xmm7[0],xmm11[0],xmm7[1],xmm11[1],xmm7[2],xmm11[2],xmm7[3],xmm11[3],xmm7[4],xmm11[4],xmm7[5],xmm11[5],xmm7[6],xmm11[6],xmm7[7],xmm11[7]
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm8 =
> [65535,65535,65535,65535,65535,0,65535,65535]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm7[0,1,0,1]
> > +; SSE2-NEXT:    pand %xmm8, %xmm7
> > +; SSE2-NEXT:    pandn %xmm4, %xmm8
> > +; SSE2-NEXT:    por %xmm7, %xmm8
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm8[0,1,2,2]
> > +; SSE2-NEXT:    punpckhdq {{.*#+}} xmm4 =
> xmm4[2],xmm2[2],xmm4[3],xmm2[3]
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0],xmm12[0],xmm0[1],xmm12[1],xmm0[2],xmm12[2],xmm0[3],xmm12[3],xmm0[4],xmm12[4],xmm0[5],xmm12[5],xmm0[6],xmm12[6],xmm0[7],xmm12[7]
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm6 =
> xmm6[0],xmm13[0],xmm6[1],xmm13[1],xmm6[2],xmm13[2],xmm6[3],xmm13[3],xmm6[4],xmm13[4],xmm6[5],xmm13[5],xmm6[6],xmm13[6],xmm6[7],xmm13[7]
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm2 =
> [65535,0,65535,65535,65535,65535,65535,65535]
> > +; SSE2-NEXT:    pand %xmm2, %xmm0
> > +; SSE2-NEXT:    pslld $16, %xmm6
> > +; SSE2-NEXT:    pandn %xmm6, %xmm2
> > +; SSE2-NEXT:    por %xmm0, %xmm2
> >  ; SSE2-NEXT:    punpcklbw {{.*#+}} xmm5 =
> xmm5[0],xmm14[0],xmm5[1],xmm14[1],xmm5[2],xmm14[2],xmm5[3],xmm14[3],xmm5[4],xmm14[4],xmm5[5],xmm14[5],xmm5[6],xmm14[6],xmm5[7],xmm14[7]
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3]
> > -; SSE2-NEXT:    punpckldq {{.*#+}} xmm1 =
> xmm1[0],xmm7[0],xmm1[1],xmm7[1]
> > -; SSE2-NEXT:    punpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm1[0]
> > -; SSE2-NEXT:    movdqu %xmm4, (%rax)
> > +; SSE2-NEXT:    psllq $48, %xmm5
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm0 =
> [65535,65535,65535,0,65535,65535,65535,65535]
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3],xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,0,1,1]
> > +; SSE2-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-NEXT:    pandn %xmm5, %xmm0
> > +; SSE2-NEXT:    por %xmm1, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> > +; SSE2-NEXT:    punpckldq {{.*#+}} xmm2 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1]
> > +; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[0,1],xmm4[2,3]
> > +; SSE2-NEXT:    movups %xmm2, (%rax)
> >  ; SSE2-NEXT:    popq %rbx
> >  ; SSE2-NEXT:    popq %r12
> >  ; SSE2-NEXT:    popq %r13
> > @@ -2025,118 +2087,181 @@ define void @not_avg_v16i8_wide_constant
> >  ; AVX1-NEXT:    pushq %r13
> >  ; AVX1-NEXT:    pushq %r12
> >  ; AVX1-NEXT:    pushq %rbx
> > -; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm1 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > -; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm3 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > +; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm4 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> >  ; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm0 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > -; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm5 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm6 =
> xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
> > -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm4 =
> xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
> > -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm7 =
> xmm4[2],xmm2[2],xmm4[3],xmm2[3]
> > -; AVX1-NEXT:    vpextrq $1, %xmm7, %r15
> > -; AVX1-NEXT:    vmovq %xmm7, %r14
> > -; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm4 = xmm4[0],zero,xmm4[1],zero
> > -; AVX1-NEXT:    vpextrq $1, %xmm4, %r11
> > -; AVX1-NEXT:    vmovq %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm2 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > +; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm1 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > +; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm5 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > +; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 =
> xmm5[2],xmm3[2],xmm5[3],xmm3[3]
> > +; AVX1-NEXT:    vmovq %xmm5, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX1-NEXT:    vpextrq $1, %xmm5, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm5 =
> xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 =
> xmm5[2],xmm3[2],xmm5[3],xmm3[3]
> > +; AVX1-NEXT:    vmovq %xmm6, %r10
> > +; AVX1-NEXT:    vpextrq $1, %xmm6, %r9
> > +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm6 =
> xmm4[4],xmm3[4],xmm4[5],xmm3[5],xmm4[6],xmm3[6],xmm4[7],xmm3[7]
> > +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm7 = xmm6[0],zero,xmm6[1],zero
> > +; AVX1-NEXT:    vmovq %xmm7, %r8
> > +; AVX1-NEXT:    vpextrq $1, %xmm7, %rdi
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 =
> xmm6[2],xmm3[2],xmm6[3],xmm3[3]
> > +; AVX1-NEXT:    vpextrq $1, %xmm6, %rcx
> > +; AVX1-NEXT:    vmovq %xmm6, %r14
> > +; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm6 =
> xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 =
> xmm6[2],xmm3[2],xmm6[3],xmm3[3]
> > +; AVX1-NEXT:    vpextrq $1, %xmm6, %rax
> > +; AVX1-NEXT:    vmovq %xmm6, %rbp
> > +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm5 = xmm5[0],zero,xmm5[1],zero
> > +; AVX1-NEXT:    vpextrq $1, %xmm5, %r11
> > +; AVX1-NEXT:    vmovq %xmm5, %r15
> > +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm8 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
> > +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm4 =
> xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero
> > +; AVX1-NEXT:    vpextrq $1, %xmm4, %rbx
> > +; AVX1-NEXT:    vmovq %xmm4, %rdx
> >  ; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm4 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm4 =
> xmm4[2],xmm2[2],xmm4[3],xmm2[3]
> > -; AVX1-NEXT:    vpextrq $1, %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > -; AVX1-NEXT:    vmovq %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm4 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> > -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm7 =
> xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero
> > -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm8 =
> xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
> > -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm1 =
> xmm5[4],xmm2[4],xmm5[5],xmm2[5],xmm5[6],xmm2[6],xmm5[7],xmm2[7]
> > -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm3 =
> xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
> > -; AVX1-NEXT:    vmovd %xmm6, %ecx
> > -; AVX1-NEXT:    vpextrd $1, %xmm6, %edx
> > -; AVX1-NEXT:    vpextrd $2, %xmm6, %r13d
> > -; AVX1-NEXT:    vpextrd $3, %xmm6, %r12d
> > -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 =
> xmm3[2],xmm2[2],xmm3[3],xmm2[3]
> > -; AVX1-NEXT:    vmovd %xmm1, %ebx
> > -; AVX1-NEXT:    vpextrd $1, %xmm1, %ebp
> > -; AVX1-NEXT:    vpextrd $2, %xmm1, %esi
> > -; AVX1-NEXT:    vpextrd $3, %xmm1, %edi
> > -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm1 =
> xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero
> > -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm5 =
> xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero
> > -; AVX1-NEXT:    vmovd %xmm7, %r8d
> > -; AVX1-NEXT:    leal -1(%r12,%rdi), %eax
> > -; AVX1-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX1-NEXT:    vpextrd $2, %xmm7, %eax
> > -; AVX1-NEXT:    leal -1(%r13,%rsi), %esi
> > -; AVX1-NEXT:    movl %esi, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX1-NEXT:    vpextrd $2, %xmm4, %edi
> > -; AVX1-NEXT:    leal -1(%rdx,%rbp), %edx
> > -; AVX1-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX1-NEXT:    vpextrd $3, %xmm4, %edx
> > -; AVX1-NEXT:    leal -1(%rcx,%rbx), %r10d
> > -; AVX1-NEXT:    vpextrd $3, %xmm1, %ecx
> > -; AVX1-NEXT:    leal -1(%rdx,%rcx), %r9d
> > -; AVX1-NEXT:    vpextrd $2, %xmm1, %ecx
> > -; AVX1-NEXT:    leal -1(%rdi,%rcx), %edi
> > -; AVX1-NEXT:    vpextrd $2, %xmm5, %ecx
> > -; AVX1-NEXT:    leal -1(%rax,%rcx), %eax
> > -; AVX1-NEXT:    vmovd %xmm5, %ecx
> > -; AVX1-NEXT:    leal -1(%r8,%rcx), %r8d
> > -; AVX1-NEXT:    vpextrq $1, %xmm6, %rdx
> > -; AVX1-NEXT:    leal -1(%r15,%rdx), %r15d
> > -; AVX1-NEXT:    vmovq %xmm6, %rdx
> > -; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm3[0],zero,xmm3[1],zero
> > -; AVX1-NEXT:    leal -1(%r14,%rdx), %r14d
> > -; AVX1-NEXT:    vpextrq $1, %xmm1, %rdx
> > -; AVX1-NEXT:    leal -1(%r11,%rdx), %edx
> > -; AVX1-NEXT:    vmovq %xmm1, %rcx
> > -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm1 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm1 =
> xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> > -; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
> > -; AVX1-NEXT:    leal -1(%rsi,%rcx), %ecx
> > -; AVX1-NEXT:    vpextrq $1, %xmm1, %rsi
> > -; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> > -; AVX1-NEXT:    leal -1(%rbp,%rsi), %esi
> > -; AVX1-NEXT:    vmovq %xmm1, %rbx
> > -; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> > -; AVX1-NEXT:    leal -1(%rbp,%rbx), %ebx
> > -; AVX1-NEXT:    vpextrq $1, %xmm8, %r11
> > -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
> > -; AVX1-NEXT:    vpextrq $1, %xmm0, %r12
> > -; AVX1-NEXT:    leal -1(%r11,%r12), %r11d
> > -; AVX1-NEXT:    vmovq %xmm8, %r12
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm4 =
> xmm4[2],xmm3[2],xmm4[3],xmm3[3]
> > +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm7 =
> xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 =
> xmm7[2],xmm3[2],xmm7[3],xmm3[3]
> > +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm0 =
> xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
> > +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm6 = xmm0[0],zero,xmm0[1],zero
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm0 =
> xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> > +; AVX1-NEXT:    vpextrq $1, %xmm0, %rsi
> > +; AVX1-NEXT:    addq %rcx, %rsi
> >  ; AVX1-NEXT:    vmovq %xmm0, %r13
> > -; AVX1-NEXT:    leal -1(%r12,%r13), %ebp
> > -; AVX1-NEXT:    shrl %ebp
> > -; AVX1-NEXT:    vmovd %ebp, %xmm0
> > -; AVX1-NEXT:    shrl %r11d
> > -; AVX1-NEXT:    vpinsrb $1, %r11d, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %ebx
> > -; AVX1-NEXT:    vpinsrb $2, %ebx, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %esi
> > -; AVX1-NEXT:    vpinsrb $3, %esi, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %ecx
> > -; AVX1-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %edx
> > -; AVX1-NEXT:    vpinsrb $5, %edx, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %r14d
> > -; AVX1-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %r15d
> > -; AVX1-NEXT:    vpinsrb $7, %r15d, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %r8d
> > -; AVX1-NEXT:    vpinsrb $8, %r8d, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %eax
> > -; AVX1-NEXT:    vpinsrb $9, %eax, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %edi
> > -; AVX1-NEXT:    vpinsrb $10, %edi, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %r9d
> > -; AVX1-NEXT:    vpinsrb $11, %r9d, %xmm0, %xmm0
> > -; AVX1-NEXT:    shrl %r10d
> > -; AVX1-NEXT:    vpinsrb $12, %r10d, %xmm0, %xmm0
> > -; AVX1-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX1-NEXT:    shrl %eax
> > -; AVX1-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> > -; AVX1-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX1-NEXT:    shrl %eax
> > -; AVX1-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> > -; AVX1-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX1-NEXT:    shrl %eax
> > -; AVX1-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> > +; AVX1-NEXT:    addq %r14, %r13
> > +; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm0 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm0 =
> xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> > +; AVX1-NEXT:    vpextrq $1, %xmm0, %r12
> > +; AVX1-NEXT:    addq %rax, %r12
> > +; AVX1-NEXT:    vmovq %xmm0, %r14
> > +; AVX1-NEXT:    addq %rbp, %r14
> > +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm7[0],zero,xmm7[1],zero
> > +; AVX1-NEXT:    vpextrq $1, %xmm0, %rbp
> > +; AVX1-NEXT:    addq %r11, %rbp
> > +; AVX1-NEXT:    vmovq %xmm0, %r11
> > +; AVX1-NEXT:    addq %r15, %r11
> > +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm0 =
> xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero
> > +; AVX1-NEXT:    vpextrq $1, %xmm0, %r15
> > +; AVX1-NEXT:    addq %rbx, %r15
> > +; AVX1-NEXT:    vmovq %xmm0, %rbx
> > +; AVX1-NEXT:    addq %rdx, %rbx
> > +; AVX1-NEXT:    vpextrq $1, %xmm6, %rax
> > +; AVX1-NEXT:    leaq -1(%rdi,%rax), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vmovq %xmm6, %rax
> > +; AVX1-NEXT:    leaq -1(%r8,%rax), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vpextrq $1, %xmm5, %rax
> > +; AVX1-NEXT:    leaq -1(%r9,%rax), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vmovq %xmm5, %rax
> > +; AVX1-NEXT:    leaq -1(%r10,%rax), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vpextrq $1, %xmm4, %rax
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; AVX1-NEXT:    leaq -1(%rcx,%rax), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vmovq %xmm4, %rax
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; AVX1-NEXT:    leaq -1(%rcx,%rax), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vpextrq $1, %xmm8, %rax
> > +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm0 =
> xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
> > +; AVX1-NEXT:    vpextrq $1, %xmm0, %rcx
> > +; AVX1-NEXT:    leaq -1(%rax,%rcx), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    vmovq %xmm8, %rax
> > +; AVX1-NEXT:    vmovq %xmm0, %rcx
> > +; AVX1-NEXT:    leaq -1(%rax,%rcx), %rax
> > +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    xorl %r10d, %r10d
> > +; AVX1-NEXT:    addq $-1, %rsi
> > +; AVX1-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX1-NEXT:    movl $0, %ecx
> > +; AVX1-NEXT:    adcq $-1, %rcx
> > +; AVX1-NEXT:    addq $-1, %r13
> > +; AVX1-NEXT:    movl $0, %eax
> > +; AVX1-NEXT:    adcq $-1, %rax
> > +; AVX1-NEXT:    addq $-1, %r12
> > +; AVX1-NEXT:    movl $0, %edi
> > +; AVX1-NEXT:    adcq $-1, %rdi
> > +; AVX1-NEXT:    addq $-1, %r14
> > +; AVX1-NEXT:    movl $0, %esi
> > +; AVX1-NEXT:    adcq $-1, %rsi
> > +; AVX1-NEXT:    addq $-1, %rbp
> > +; AVX1-NEXT:    movl $0, %r9d
> > +; AVX1-NEXT:    adcq $-1, %r9
> > +; AVX1-NEXT:    addq $-1, %r11
> > +; AVX1-NEXT:    movl $0, %r8d
> > +; AVX1-NEXT:    adcq $-1, %r8
> > +; AVX1-NEXT:    addq $-1, %r15
> > +; AVX1-NEXT:    movl $0, %edx
> > +; AVX1-NEXT:    adcq $-1, %rdx
> > +; AVX1-NEXT:    addq $-1, %rbx
> > +; AVX1-NEXT:    adcq $-1, %r10
> > +; AVX1-NEXT:    shldq $63, %r11, %r8
> > +; AVX1-NEXT:    shldq $63, %rbp, %r9
> > +; AVX1-NEXT:    shldq $63, %r14, %rsi
> > +; AVX1-NEXT:    shldq $63, %r12, %rdi
> > +; AVX1-NEXT:    shldq $63, %r13, %rax
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> > +; AVX1-NEXT:    shldq $63, %rbp, %rcx
> > +; AVX1-NEXT:    shldq $63, %rbx, %r10
> > +; AVX1-NEXT:    shldq $63, %r15, %rdx
> > +; AVX1-NEXT:    vmovq %rcx, %xmm8
> > +; AVX1-NEXT:    vmovq %rax, %xmm9
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm0
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm11
> > +; AVX1-NEXT:    vmovq %rdi, %xmm12
> > +; AVX1-NEXT:    vmovq %rsi, %xmm13
> > +; AVX1-NEXT:    vmovq %rdx, %xmm14
> > +; AVX1-NEXT:    vmovq %r10, %xmm15
> > +; AVX1-NEXT:    vmovq %r9, %xmm10
> > +; AVX1-NEXT:    vmovq %r8, %xmm1
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm2
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm3
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm4
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm5
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm6
> > +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX1-NEXT:    shrq %rax
> > +; AVX1-NEXT:    vmovq %rax, %xmm7
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm8 =
> xmm9[0],xmm8[0],xmm9[1],xmm8[1],xmm9[2],xmm8[2],xmm9[3],xmm8[3],xmm9[4],xmm8[4],xmm9[5],xmm8[5],xmm9[6],xmm8[6],xmm9[7],xmm8[7]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm9 =
> xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3],xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
> > +; AVX1-NEXT:    vpsllq $48, %xmm8, %xmm8
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm9[0,0,1,1]
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm8 =
> xmm0[0,1,2],xmm8[3],xmm0[4,5,6,7]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm0 =
> xmm13[0],xmm12[0],xmm13[1],xmm12[1],xmm13[2],xmm12[2],xmm13[3],xmm12[3],xmm13[4],xmm12[4],xmm13[5],xmm12[5],xmm13[6],xmm12[6],xmm13[7],xmm12[7]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm9 =
> xmm15[0],xmm14[0],xmm15[1],xmm14[1],xmm15[2],xmm14[2],xmm15[3],xmm14[3],xmm15[4],xmm14[4],xmm15[5],xmm14[5],xmm15[6],xmm14[6],xmm15[7],xmm14[7]
> > +; AVX1-NEXT:    vpslld $16, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm9[0],xmm0[1],xmm9[2,3,4,5,6,7]
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm8[2,3],xmm0[4,5,6,7]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm10[0],xmm1[1],xmm10[1],xmm1[2],xmm10[2],xmm1[3],xmm10[3],xmm1[4],xmm10[4],xmm1[5],xmm10[5],xmm1[6],xmm10[6],xmm1[7],xmm10[7]
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,1,2,0]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm2 =
> xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3],xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
> > +; AVX1-NEXT:    vpslldq {{.*#+}} xmm2 =
> zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0,1]
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,5,6],xmm2[7]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm2 =
> xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
> > +; AVX1-NEXT:    vpslldq {{.*#+}} xmm2 =
> zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0,1,2,3,4,5]
> > +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm3 =
> xmm7[0],xmm6[0],xmm7[1],xmm6[1],xmm7[2],xmm6[2],xmm7[3],xmm6[3],xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[0,1,0,1]
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 =
> xmm3[0,1,2,3,4],xmm2[5],xmm3[6,7]
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm2[0,1,2,3,4,5],xmm1[6,7]
> > +; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
> >  ; AVX1-NEXT:    vmovdqu %xmm0, (%rax)
> >  ; AVX1-NEXT:    popq %rbx
> >  ; AVX1-NEXT:    popq %r12
> > @@ -2154,123 +2279,230 @@ define void @not_avg_v16i8_wide_constant
> >  ; AVX2-NEXT:    pushq %r13
> >  ; AVX2-NEXT:    pushq %r12
> >  ; AVX2-NEXT:    pushq %rbx
> > +; AVX2-NEXT:    subq $16, %rsp
> >  ; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm1 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> >  ; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm0 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > -; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm10 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm1
> > +; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm3, %xmm4
> > +; AVX2-NEXT:    vpextrq $1, %xmm4, %rbx
> > +; AVX2-NEXT:    vmovq %xmm4, %rbp
> > +; AVX2-NEXT:    vpextrq $1, %xmm3, %rdi
> > +; AVX2-NEXT:    vmovq %xmm3, %rcx
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX2-NEXT:    vpextrq $1, %xmm3, %rdx
> > +; AVX2-NEXT:    vmovq %xmm3, %r9
> > +; AVX2-NEXT:    vpextrq $1, %xmm2, %r13
> > +; AVX2-NEXT:    vmovq %xmm2, %r12
> >  ; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm3
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm5 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm5, %xmm4
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm9 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm9, %xmm7
> > -; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm1
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> >  ; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > -; AVX2-NEXT:    vpextrq $1, %xmm2, %r15
> > -; AVX2-NEXT:    vmovq %xmm2, %r14
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX2-NEXT:    vpextrq $1, %xmm3, %r14
> > +; AVX2-NEXT:    vmovq %xmm3, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX2-NEXT:    vpextrq $1, %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > +; AVX2-NEXT:    vmovq %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> >  ; AVX2-NEXT:    vpextrq $1, %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > -; AVX2-NEXT:    vmovq %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > -; AVX2-NEXT:    vextracti128 $1, %ymm10, %xmm1
> > -; AVX2-NEXT:    vpextrq $1, %xmm1, %r13
> > -; AVX2-NEXT:    vmovq %xmm1, %r11
> > -; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm11 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm0
> > +; AVX2-NEXT:    vmovq %xmm1, %r10
> > +; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm1
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm2
> > +; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm3, %xmm4
> > +; AVX2-NEXT:    vpextrq $1, %xmm4, %rax
> > +; AVX2-NEXT:    addq %rbx, %rax
> > +; AVX2-NEXT:    movq %rax, %rbx
> > +; AVX2-NEXT:    vmovq %xmm4, %rsi
> > +; AVX2-NEXT:    addq %rbp, %rsi
> > +; AVX2-NEXT:    vpextrq $1, %xmm3, %rax
> > +; AVX2-NEXT:    addq %rdi, %rax
> > +; AVX2-NEXT:    movq %rax, %rdi
> > +; AVX2-NEXT:    vmovq %xmm3, %r11
> > +; AVX2-NEXT:    addq %rcx, %r11
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX2-NEXT:    vpextrq $1, %xmm3, %rcx
> > +; AVX2-NEXT:    addq %rdx, %rcx
> > +; AVX2-NEXT:    vmovq %xmm3, %r8
> > +; AVX2-NEXT:    addq %r9, %r8
> > +; AVX2-NEXT:    vpextrq $1, %xmm2, %r9
> > +; AVX2-NEXT:    addq %r13, %r9
> > +; AVX2-NEXT:    vmovq %xmm2, %r15
> > +; AVX2-NEXT:    addq %r12, %r15
> >  ; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm1
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm8 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm8, %xmm1
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX2-NEXT:    vextracti128 $1, %ymm3, %xmm6
> > -; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm0
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX2-NEXT:    vmovd %xmm9, %r12d
> > -; AVX2-NEXT:    vpextrd $2, %xmm9, %r9d
> > -; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm0
> > -; AVX2-NEXT:    vmovd %xmm7, %ecx
> > -; AVX2-NEXT:    vpextrd $2, %xmm7, %edi
> > -; AVX2-NEXT:    vmovd %xmm5, %ebx
> > -; AVX2-NEXT:    vpextrd $2, %xmm5, %esi
> > -; AVX2-NEXT:    vmovd %xmm4, %edx
> > -; AVX2-NEXT:    vpextrd $2, %xmm4, %ebp
> > -; AVX2-NEXT:    vpextrd $2, %xmm1, %eax
> > -; AVX2-NEXT:    leal -1(%rbp,%rax), %eax
> > -; AVX2-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX2-NEXT:    vmovd %xmm1, %eax
> > -; AVX2-NEXT:    leal -1(%rdx,%rax), %eax
> > -; AVX2-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX2-NEXT:    vpextrd $2, %xmm8, %eax
> > -; AVX2-NEXT:    leal -1(%rsi,%rax), %eax
> > -; AVX2-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX2-NEXT:    vmovd %xmm8, %eax
> > -; AVX2-NEXT:    leal -1(%rbx,%rax), %r10d
> > -; AVX2-NEXT:    vpextrd $2, %xmm6, %eax
> > -; AVX2-NEXT:    leal -1(%rdi,%rax), %r8d
> > -; AVX2-NEXT:    vmovd %xmm6, %eax
> > -; AVX2-NEXT:    leal -1(%rcx,%rax), %edi
> > -; AVX2-NEXT:    vpextrd $2, %xmm3, %eax
> > -; AVX2-NEXT:    leal -1(%r9,%rax), %r9d
> > -; AVX2-NEXT:    vmovd %xmm3, %ecx
> > -; AVX2-NEXT:    leal -1(%r12,%rcx), %r12d
> > -; AVX2-NEXT:    vpextrq $1, %xmm0, %rcx
> > -; AVX2-NEXT:    leal -1(%r15,%rcx), %r15d
> > -; AVX2-NEXT:    vmovq %xmm0, %rcx
> > -; AVX2-NEXT:    leal -1(%r14,%rcx), %r14d
> > -; AVX2-NEXT:    vpextrq $1, %xmm2, %rdx
> > -; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > -; AVX2-NEXT:    leal -1(%rax,%rdx), %edx
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm2
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX2-NEXT:    vpextrq $1, %xmm3, %rax
> > +; AVX2-NEXT:    addq %r14, %rax
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    vmovq %xmm3, %rax
> > +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded
> Reload
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    vpextrq $1, %xmm2, %rax
> > +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded
> Reload
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> >  ; AVX2-NEXT:    vmovq %xmm2, %rax
> > -; AVX2-NEXT:    vextracti128 $1, %ymm11, %xmm0
> > +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded
> Reload
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > +; AVX2-NEXT:    vpextrq $1, %xmm0, %rbp
> > +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Folded
> Reload
> > +; AVX2-NEXT:    vmovq %xmm0, %r12
> > +; AVX2-NEXT:    addq %r10, %r12
> > +; AVX2-NEXT:    vpextrq $1, %xmm1, %rax
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm0
> > +; AVX2-NEXT:    vpextrq $1, %xmm0, %r10
> > +; AVX2-NEXT:    addq %rax, %r10
> > +; AVX2-NEXT:    vmovq %xmm1, %rax
> > +; AVX2-NEXT:    vmovq %xmm0, %rdx
> > +; AVX2-NEXT:    addq %rax, %rdx
> > +; AVX2-NEXT:    addq $-1, %rbx
> > +; AVX2-NEXT:    movq %rbx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    addq $-1, %rsi
> > +; AVX2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, (%rsp) # 8-byte Spill
> > +; AVX2-NEXT:    addq $-1, %rdi
> > +; AVX2-NEXT:    movq %rdi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    addq $-1, %r11
> > +; AVX2-NEXT:    movq %r11, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    addq $-1, %rcx
> > +; AVX2-NEXT:    movq %rcx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    addq $-1, %r8
> > +; AVX2-NEXT:    movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    addq $-1, %r9
> > +; AVX2-NEXT:    movq %r9, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %eax
> > +; AVX2-NEXT:    adcq $-1, %rax
> > +; AVX2-NEXT:    movq %rax, %rsi
> > +; AVX2-NEXT:    addq $-1, %r15
> > +; AVX2-NEXT:    movq %r15, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movl $0, %r15d
> > +; AVX2-NEXT:    adcq $-1, %r15
> > +; AVX2-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> > +; AVX2-NEXT:    movl $0, %r13d
> > +; AVX2-NEXT:    adcq $-1, %r13
> > +; AVX2-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> > +; AVX2-NEXT:    movl $0, %r14d
> > +; AVX2-NEXT:    adcq $-1, %r14
> > +; AVX2-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> > +; AVX2-NEXT:    movl $0, %ebx
> > +; AVX2-NEXT:    adcq $-1, %rbx
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    addq $-1, %rax
> > +; AVX2-NEXT:    movl $0, %r11d
> > +; AVX2-NEXT:    adcq $-1, %r11
> > +; AVX2-NEXT:    addq $-1, %rbp
> > +; AVX2-NEXT:    movl $0, %r9d
> > +; AVX2-NEXT:    adcq $-1, %r9
> > +; AVX2-NEXT:    addq $-1, %r12
> > +; AVX2-NEXT:    movl $0, %r8d
> > +; AVX2-NEXT:    adcq $-1, %r8
> > +; AVX2-NEXT:    addq $-1, %r10
> > +; AVX2-NEXT:    movl $0, %edi
> > +; AVX2-NEXT:    adcq $-1, %rdi
> > +; AVX2-NEXT:    addq $-1, %rdx
> > +; AVX2-NEXT:    movl $0, %ecx
> > +; AVX2-NEXT:    adcq $-1, %rcx
> > +; AVX2-NEXT:    shldq $63, %rdx, %rcx
> > +; AVX2-NEXT:    movq %rcx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    shldq $63, %r10, %rdi
> > +; AVX2-NEXT:    shldq $63, %r12, %r8
> > +; AVX2-NEXT:    shldq $63, %rbp, %r9
> > +; AVX2-NEXT:    shldq $63, %rax, %r11
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rdx, %rbx
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rdx, %r14
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rdx, %r13
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rax, %r15
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rax, %rsi
> > +; AVX2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rax, %rsi
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r12 # 8-byte Reload
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rax, %r12
> >  ; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > -; AVX2-NEXT:    leal -1(%rcx,%rax), %eax
> > -; AVX2-NEXT:    vpextrq $1, %xmm0, %rsi
> > -; AVX2-NEXT:    leal -1(%r13,%rsi), %esi
> > -; AVX2-NEXT:    vmovq %xmm0, %rbx
> > -; AVX2-NEXT:    leal -1(%r11,%rbx), %ebx
> > -; AVX2-NEXT:    vpextrq $1, %xmm10, %rcx
> > -; AVX2-NEXT:    vpextrq $1, %xmm11, %r13
> > -; AVX2-NEXT:    leal -1(%rcx,%r13), %ecx
> > -; AVX2-NEXT:    vmovq %xmm10, %r13
> > -; AVX2-NEXT:    vmovq %xmm11, %r11
> > -; AVX2-NEXT:    leaq -1(%r13,%r11), %rbp
> > -; AVX2-NEXT:    shrq %rbp
> > -; AVX2-NEXT:    vmovd %ebp, %xmm0
> > -; AVX2-NEXT:    shrl %ecx
> > -; AVX2-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %ebx
> > -; AVX2-NEXT:    vpinsrb $2, %ebx, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %esi
> > -; AVX2-NEXT:    vpinsrb $3, %esi, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %eax
> > -; AVX2-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %edx
> > -; AVX2-NEXT:    vpinsrb $5, %edx, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %r14d
> > -; AVX2-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %r15d
> > -; AVX2-NEXT:    vpinsrb $7, %r15d, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %r12d
> > -; AVX2-NEXT:    vpinsrb $8, %r12d, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %r9d
> > -; AVX2-NEXT:    vpinsrb $9, %r9d, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %edi
> > -; AVX2-NEXT:    vpinsrb $10, %edi, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %r8d
> > -; AVX2-NEXT:    vpinsrb $11, %r8d, %xmm0, %xmm0
> > -; AVX2-NEXT:    shrl %r10d
> > -; AVX2-NEXT:    vpinsrb $12, %r10d, %xmm0, %xmm0
> > -; AVX2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX2-NEXT:    shrl %eax
> > -; AVX2-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> > -; AVX2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX2-NEXT:    shrl %eax
> > -; AVX2-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> > -; AVX2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX2-NEXT:    shrl %eax
> > -; AVX2-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rax, %rcx
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r10 # 8-byte Reload
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rax, %r10
> > +; AVX2-NEXT:    movq (%rsp), %rax # 8-byte Reload
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rdx, %rax
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> > +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> > +; AVX2-NEXT:    shldq $63, %rdx, %rbp
> > +; AVX2-NEXT:    vmovq %rbp, %xmm8
> > +; AVX2-NEXT:    vmovq %rax, %xmm9
> > +; AVX2-NEXT:    vmovq %r10, %xmm0
> > +; AVX2-NEXT:    vmovq %rcx, %xmm1
> > +; AVX2-NEXT:    vmovq %r12, %xmm12
> > +; AVX2-NEXT:    vmovq %rsi, %xmm13
> > +; AVX2-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm14 # 8-byte Folded
> Reload
> > +; AVX2-NEXT:    # xmm14 = mem[0],zero
> > +; AVX2-NEXT:    vmovq %r15, %xmm15
> > +; AVX2-NEXT:    vmovq %r13, %xmm10
> > +; AVX2-NEXT:    vmovq %r14, %xmm11
> > +; AVX2-NEXT:    vmovq %rbx, %xmm2
> > +; AVX2-NEXT:    vmovq %r11, %xmm3
> > +; AVX2-NEXT:    vmovq %r9, %xmm4
> > +; AVX2-NEXT:    vmovq %r8, %xmm5
> > +; AVX2-NEXT:    vmovq %rdi, %xmm6
> > +; AVX2-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm7 # 8-byte Folded
> Reload
> > +; AVX2-NEXT:    # xmm7 = mem[0],zero
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm8 =
> xmm9[0],xmm8[0],xmm9[1],xmm8[1],xmm9[2],xmm8[2],xmm9[3],xmm8[3],xmm9[4],xmm8[4],xmm9[5],xmm8[5],xmm9[6],xmm8[6],xmm9[7],xmm8[7]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm9 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > +; AVX2-NEXT:    vpbroadcastw %xmm8, %xmm8
> > +; AVX2-NEXT:    vpbroadcastw %xmm9, %xmm0
> > +; AVX2-NEXT:    vpblendw {{.*#+}} xmm8 = xmm0[0,1,2,3,4,5,6],xmm8[7]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm0 =
> xmm13[0],xmm12[0],xmm13[1],xmm12[1],xmm13[2],xmm12[2],xmm13[3],xmm12[3],xmm13[4],xmm12[4],xmm13[5],xmm12[5],xmm13[6],xmm12[6],xmm13[7],xmm12[7]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm9 =
> xmm15[0],xmm14[0],xmm15[1],xmm14[1],xmm15[2],xmm14[2],xmm15[3],xmm14[3],xmm15[4],xmm14[4],xmm15[5],xmm14[5],xmm15[6],xmm14[6],xmm15[7],xmm14[7]
> > +; AVX2-NEXT:    vpbroadcastw %xmm0, %xmm0
> > +; AVX2-NEXT:    vpbroadcastw %xmm9, %xmm1
> > +; AVX2-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm1[0,1,2,3,4],xmm0[5],xmm1[6,7]
> > +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0,1,2],xmm8[3]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm1 =
> xmm11[0],xmm10[0],xmm11[1],xmm10[1],xmm11[2],xmm10[2],xmm11[3],xmm10[3],xmm11[4],xmm10[4],xmm11[5],xmm10[5],xmm11[6],xmm10[6],xmm11[7],xmm10[7]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm2 =
> xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3],xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
> > +; AVX2-NEXT:    vpbroadcastw %xmm1, %xmm1
> > +; AVX2-NEXT:    vpbroadcastw %xmm2, %xmm2
> > +; AVX2-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm2[0,1,2],xmm1[3],xmm2[4,5,6,7]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm2 =
> xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
> > +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm3 =
> xmm7[0],xmm6[0],xmm7[1],xmm6[1],xmm7[2],xmm6[2],xmm7[3],xmm6[3],xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
> > +; AVX2-NEXT:    vpbroadcastw %xmm3, %xmm3
> > +; AVX2-NEXT:    vpblendw {{.*#+}} xmm2 =
> xmm2[0],xmm3[1],xmm2[2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm2[0],xmm1[1],xmm2[2,3]
> > +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]
> >  ; AVX2-NEXT:    vmovdqu %xmm0, (%rax)
> > +; AVX2-NEXT:    addq $16, %rsp
> >  ; AVX2-NEXT:    popq %rbx
> >  ; AVX2-NEXT:    popq %r12
> >  ; AVX2-NEXT:    popq %r13
> > @@ -2280,139 +2512,414 @@ define void @not_avg_v16i8_wide_constant
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > -; AVX512-LABEL: not_avg_v16i8_wide_constants:
> > -; AVX512:       # %bb.0:
> > -; AVX512-NEXT:    pushq %rbp
> > -; AVX512-NEXT:    pushq %r15
> > -; AVX512-NEXT:    pushq %r14
> > -; AVX512-NEXT:    pushq %r13
> > -; AVX512-NEXT:    pushq %r12
> > -; AVX512-NEXT:    pushq %rbx
> > -; AVX512-NEXT:    vpmovzxbw {{.*#+}} ymm1 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > -; AVX512-NEXT:    vpmovzxbw {{.*#+}} ymm0 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm10 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm1
> > -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm3
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm5 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm5, %xmm4
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm9 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm9, %xmm7
> > -; AVX512-NEXT:    vextracti128 $1, %ymm2, %xmm1
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > -; AVX512-NEXT:    vpextrq $1, %xmm2, %r15
> > -; AVX512-NEXT:    vmovq %xmm2, %r14
> > -; AVX512-NEXT:    vpextrq $1, %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > -; AVX512-NEXT:    vmovq %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > -; AVX512-NEXT:    vextracti128 $1, %ymm10, %xmm1
> > -; AVX512-NEXT:    vpextrq $1, %xmm1, %r13
> > -; AVX512-NEXT:    vmovq %xmm1, %r11
> > -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm11 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm0, %xmm0
> > -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm0, %xmm1
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm8 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm8, %xmm1
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX512-NEXT:    vextracti128 $1, %ymm3, %xmm6
> > -; AVX512-NEXT:    vextracti128 $1, %ymm2, %xmm0
> > -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX512-NEXT:    vmovd %xmm9, %r12d
> > -; AVX512-NEXT:    vpextrd $2, %xmm9, %r9d
> > -; AVX512-NEXT:    vextracti128 $1, %ymm2, %xmm0
> > -; AVX512-NEXT:    vmovd %xmm7, %ecx
> > -; AVX512-NEXT:    vpextrd $2, %xmm7, %edi
> > -; AVX512-NEXT:    vmovd %xmm5, %ebx
> > -; AVX512-NEXT:    vpextrd $2, %xmm5, %esi
> > -; AVX512-NEXT:    vmovd %xmm4, %edx
> > -; AVX512-NEXT:    vpextrd $2, %xmm4, %ebp
> > -; AVX512-NEXT:    vpextrd $2, %xmm1, %eax
> > -; AVX512-NEXT:    leal -1(%rbp,%rax), %eax
> > -; AVX512-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX512-NEXT:    vmovd %xmm1, %eax
> > -; AVX512-NEXT:    leal -1(%rdx,%rax), %eax
> > -; AVX512-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX512-NEXT:    vpextrd $2, %xmm8, %eax
> > -; AVX512-NEXT:    leal -1(%rsi,%rax), %eax
> > -; AVX512-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > -; AVX512-NEXT:    vmovd %xmm8, %eax
> > -; AVX512-NEXT:    leal -1(%rbx,%rax), %r10d
> > -; AVX512-NEXT:    vpextrd $2, %xmm6, %eax
> > -; AVX512-NEXT:    leal -1(%rdi,%rax), %r8d
> > -; AVX512-NEXT:    vmovd %xmm6, %eax
> > -; AVX512-NEXT:    leal -1(%rcx,%rax), %edi
> > -; AVX512-NEXT:    vpextrd $2, %xmm3, %eax
> > -; AVX512-NEXT:    leal -1(%r9,%rax), %r9d
> > -; AVX512-NEXT:    vmovd %xmm3, %ecx
> > -; AVX512-NEXT:    leal -1(%r12,%rcx), %r12d
> > -; AVX512-NEXT:    vpextrq $1, %xmm0, %rcx
> > -; AVX512-NEXT:    leal -1(%r15,%rcx), %r15d
> > -; AVX512-NEXT:    vmovq %xmm0, %rcx
> > -; AVX512-NEXT:    leal -1(%r14,%rcx), %r14d
> > -; AVX512-NEXT:    vpextrq $1, %xmm2, %rdx
> > -; AVX512-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > -; AVX512-NEXT:    leal -1(%rax,%rdx), %edx
> > -; AVX512-NEXT:    vmovq %xmm2, %rax
> > -; AVX512-NEXT:    vextracti128 $1, %ymm11, %xmm0
> > -; AVX512-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > -; AVX512-NEXT:    leal -1(%rcx,%rax), %eax
> > -; AVX512-NEXT:    vpextrq $1, %xmm0, %rsi
> > -; AVX512-NEXT:    leal -1(%r13,%rsi), %esi
> > -; AVX512-NEXT:    vmovq %xmm0, %rbx
> > -; AVX512-NEXT:    leal -1(%r11,%rbx), %ebx
> > -; AVX512-NEXT:    vpextrq $1, %xmm10, %rcx
> > -; AVX512-NEXT:    vpextrq $1, %xmm11, %r13
> > -; AVX512-NEXT:    leal -1(%rcx,%r13), %ecx
> > -; AVX512-NEXT:    vmovq %xmm10, %r13
> > -; AVX512-NEXT:    vmovq %xmm11, %r11
> > -; AVX512-NEXT:    leaq -1(%r13,%r11), %rbp
> > -; AVX512-NEXT:    shrq %rbp
> > -; AVX512-NEXT:    vmovd %ebp, %xmm0
> > -; AVX512-NEXT:    shrl %ecx
> > -; AVX512-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %ebx
> > -; AVX512-NEXT:    vpinsrb $2, %ebx, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %esi
> > -; AVX512-NEXT:    vpinsrb $3, %esi, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %eax
> > -; AVX512-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %edx
> > -; AVX512-NEXT:    vpinsrb $5, %edx, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %r14d
> > -; AVX512-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %r15d
> > -; AVX512-NEXT:    vpinsrb $7, %r15d, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %r12d
> > -; AVX512-NEXT:    vpinsrb $8, %r12d, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %r9d
> > -; AVX512-NEXT:    vpinsrb $9, %r9d, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %edi
> > -; AVX512-NEXT:    vpinsrb $10, %edi, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %r8d
> > -; AVX512-NEXT:    vpinsrb $11, %r8d, %xmm0, %xmm0
> > -; AVX512-NEXT:    shrl %r10d
> > -; AVX512-NEXT:    vpinsrb $12, %r10d, %xmm0, %xmm0
> > -; AVX512-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX512-NEXT:    shrl %eax
> > -; AVX512-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> > -; AVX512-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX512-NEXT:    shrl %eax
> > -; AVX512-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> > -; AVX512-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > -; AVX512-NEXT:    shrl %eax
> > -; AVX512-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> > -; AVX512-NEXT:    vmovdqu %xmm0, (%rax)
> > -; AVX512-NEXT:    popq %rbx
> > -; AVX512-NEXT:    popq %r12
> > -; AVX512-NEXT:    popq %r13
> > -; AVX512-NEXT:    popq %r14
> > -; AVX512-NEXT:    popq %r15
> > -; AVX512-NEXT:    popq %rbp
> > -; AVX512-NEXT:    vzeroupper
> > -; AVX512-NEXT:    retq
> > +; AVX512F-LABEL: not_avg_v16i8_wide_constants:
> > +; AVX512F:       # %bb.0:
> > +; AVX512F-NEXT:    pushq %rbp
> > +; AVX512F-NEXT:    pushq %r15
> > +; AVX512F-NEXT:    pushq %r14
> > +; AVX512F-NEXT:    pushq %r13
> > +; AVX512F-NEXT:    pushq %r12
> > +; AVX512F-NEXT:    pushq %rbx
> > +; AVX512F-NEXT:    vpmovzxbw {{.*#+}} ymm1 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > +; AVX512F-NEXT:    vpmovzxbw {{.*#+}} ymm3 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm0 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm1
> > +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm4
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm4 =
> xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm4, %xmm5
> > +; AVX512F-NEXT:    vpextrq $1, %xmm5, %rdx
> > +; AVX512F-NEXT:    vmovq %xmm5, %rcx
> > +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rax
> > +; AVX512F-NEXT:    vmovq %xmm4, %rbx
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm4
> > +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rdi
> > +; AVX512F-NEXT:    vmovq %xmm4, %rsi
> > +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r13
> > +; AVX512F-NEXT:    vmovq %xmm1, %r15
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm2, %xmm1
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX512F-NEXT:    vpextrq $1, %xmm2, %r12
> > +; AVX512F-NEXT:    vmovq %xmm2, %r14
> > +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r11
> > +; AVX512F-NEXT:    vmovq %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm0, %xmm1
> > +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r10
> > +; AVX512F-NEXT:    vmovq %xmm1, %r9
> > +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm3, %xmm3
> > +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm3 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm3, %xmm4
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm4 =
> xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm4, %xmm5
> > +; AVX512F-NEXT:    vpextrq $1, %xmm5, %rbp
> > +; AVX512F-NEXT:    leal -1(%rdx,%rbp), %edx
> > +; AVX512F-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > +; AVX512F-NEXT:    vmovq %xmm5, %rbp
> > +; AVX512F-NEXT:    leal -1(%rcx,%rbp), %ecx
> > +; AVX512F-NEXT:    movl %ecx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rbp
> > +; AVX512F-NEXT:    leal -1(%rax,%rbp), %eax
> > +; AVX512F-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> > +; AVX512F-NEXT:    vmovq %xmm4, %rbp
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm3, %xmm4
> > +; AVX512F-NEXT:    leal -1(%rbx,%rbp), %r8d
> > +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rbp
> > +; AVX512F-NEXT:    leal -1(%rdi,%rbp), %edi
> > +; AVX512F-NEXT:    vmovq %xmm4, %rbp
> > +; AVX512F-NEXT:    leal -1(%rsi,%rbp), %esi
> > +; AVX512F-NEXT:    vpextrq $1, %xmm3, %rbp
> > +; AVX512F-NEXT:    leal -1(%r13,%rbp), %r13d
> > +; AVX512F-NEXT:    vmovq %xmm3, %rbp
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm2, %xmm2
> > +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX512F-NEXT:    leal -1(%r15,%rbp), %r15d
> > +; AVX512F-NEXT:    vpextrq $1, %xmm3, %rbp
> > +; AVX512F-NEXT:    leal -1(%r12,%rbp), %r12d
> > +; AVX512F-NEXT:    vmovq %xmm3, %rbp
> > +; AVX512F-NEXT:    leal -1(%r14,%rbp), %r14d
> > +; AVX512F-NEXT:    vpextrq $1, %xmm2, %rdx
> > +; AVX512F-NEXT:    leal -1(%r11,%rdx), %r11d
> > +; AVX512F-NEXT:    vmovq %xmm2, %rbp
> > +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX512F-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512F-NEXT:    leal -1(%rax,%rbp), %ebp
> > +; AVX512F-NEXT:    vpextrq $1, %xmm2, %rcx
> > +; AVX512F-NEXT:    leal -1(%r10,%rcx), %ecx
> > +; AVX512F-NEXT:    vmovq %xmm2, %rax
> > +; AVX512F-NEXT:    leal -1(%r9,%rax), %eax
> > +; AVX512F-NEXT:    vpextrq $1, %xmm0, %rdx
> > +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r10
> > +; AVX512F-NEXT:    leal -1(%rdx,%r10), %edx
> > +; AVX512F-NEXT:    vmovq %xmm0, %r10
> > +; AVX512F-NEXT:    vmovq %xmm1, %r9
> > +; AVX512F-NEXT:    leaq -1(%r10,%r9), %rbx
> > +; AVX512F-NEXT:    shrq %rbx
> > +; AVX512F-NEXT:    vmovd %ebx, %xmm0
> > +; AVX512F-NEXT:    shrl %edx
> > +; AVX512F-NEXT:    vpinsrb $1, %edx, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %eax
> > +; AVX512F-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %ecx
> > +; AVX512F-NEXT:    vpinsrb $3, %ecx, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %ebp
> > +; AVX512F-NEXT:    vpinsrb $4, %ebp, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %r11d
> > +; AVX512F-NEXT:    vpinsrb $5, %r11d, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %r14d
> > +; AVX512F-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %r12d
> > +; AVX512F-NEXT:    vpinsrb $7, %r12d, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %r15d
> > +; AVX512F-NEXT:    vpinsrb $8, %r15d, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %r13d
> > +; AVX512F-NEXT:    vpinsrb $9, %r13d, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %esi
> > +; AVX512F-NEXT:    vpinsrb $10, %esi, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %edi
> > +; AVX512F-NEXT:    vpinsrb $11, %edi, %xmm0, %xmm0
> > +; AVX512F-NEXT:    shrl %r8d
> > +; AVX512F-NEXT:    vpinsrb $12, %r8d, %xmm0, %xmm0
> > +; AVX512F-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > +; AVX512F-NEXT:    shrl %eax
> > +; AVX512F-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> > +; AVX512F-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > +; AVX512F-NEXT:    shrl %eax
> > +; AVX512F-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> > +; AVX512F-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> > +; AVX512F-NEXT:    shrl %eax
> > +; AVX512F-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vmovdqu %xmm0, (%rax)
> > +; AVX512F-NEXT:    popq %rbx
> > +; AVX512F-NEXT:    popq %r12
> > +; AVX512F-NEXT:    popq %r13
> > +; AVX512F-NEXT:    popq %r14
> > +; AVX512F-NEXT:    popq %r15
> > +; AVX512F-NEXT:    popq %rbp
> > +; AVX512F-NEXT:    vzeroupper
> > +; AVX512F-NEXT:    retq
> > +;
> > +; AVX512BW-LABEL: not_avg_v16i8_wide_constants:
> > +; AVX512BW:       # %bb.0:
> > +; AVX512BW-NEXT:    pushq %rbp
> > +; AVX512BW-NEXT:    pushq %r15
> > +; AVX512BW-NEXT:    pushq %r14
> > +; AVX512BW-NEXT:    pushq %r13
> > +; AVX512BW-NEXT:    pushq %r12
> > +; AVX512BW-NEXT:    pushq %rbx
> > +; AVX512BW-NEXT:    subq $24, %rsp
> > +; AVX512BW-NEXT:    vpmovzxbw {{.*#+}} ymm0 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > +; AVX512BW-NEXT:    vpmovzxbw {{.*#+}} ymm1 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> > +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm3, %xmm4
> > +; AVX512BW-NEXT:    vmovq %xmm4, %rbx
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm4, %rbp
> > +; AVX512BW-NEXT:    vmovq %xmm3, %rdi
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %rsi
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm2
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX512BW-NEXT:    vmovq %xmm3, %rdx
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %r15
> > +; AVX512BW-NEXT:    vmovq %xmm2, %r8
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r14
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm0
> > +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX512BW-NEXT:    vmovq %xmm3, %r9
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %r10
> > +; AVX512BW-NEXT:    vmovq %xmm2, %r11
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) #
> 8-byte Folded Spill
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm0
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm2
> > +; AVX512BW-NEXT:    vmovq %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte
> Folded Spill
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r13
> > +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm2 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm3 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm3, %xmm4
> > +; AVX512BW-NEXT:    vmovq %xmm4, %rax
> > +; AVX512BW-NEXT:    addq %rbx, %rax
> > +; AVX512BW-NEXT:    movq %rax, %rbx
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm4, %rax
> > +; AVX512BW-NEXT:    addq %rbp, %rax
> > +; AVX512BW-NEXT:    movq %rax, %rbp
> > +; AVX512BW-NEXT:    vmovq %xmm3, %rcx
> > +; AVX512BW-NEXT:    addq %rdi, %rcx
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %r12
> > +; AVX512BW-NEXT:    addq %rsi, %r12
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm2
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX512BW-NEXT:    vmovq %xmm3, %rax
> > +; AVX512BW-NEXT:    addq %rdx, %rax
> > +; AVX512BW-NEXT:    movq %rax, %rdx
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %rax
> > +; AVX512BW-NEXT:    addq %r15, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    vmovq %xmm2, %rax
> > +; AVX512BW-NEXT:    addq %r8, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %rax
> > +; AVX512BW-NEXT:    addq %r14, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm1
> > +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> > +; AVX512BW-NEXT:    vmovq %xmm3, %rax
> > +; AVX512BW-NEXT:    addq %r9, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %rax
> > +; AVX512BW-NEXT:    addq %r10, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    vmovq %xmm2, %rax
> > +; AVX512BW-NEXT:    addq %r11, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r14
> > +; AVX512BW-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %r14 # 8-byte Folded
> Reload
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm1
> > +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX512BW-NEXT:    vmovq %xmm2, %r10
> > +; AVX512BW-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %r10 # 8-byte Folded
> Reload
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r9
> > +; AVX512BW-NEXT:    addq %r13, %r9
> > +; AVX512BW-NEXT:    vmovq %xmm0, %rax
> > +; AVX512BW-NEXT:    vmovq %xmm1, %r8
> > +; AVX512BW-NEXT:    addq %rax, %r8
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm0, %rdi
> > +; AVX512BW-NEXT:    vpextrq $1, %xmm1, %rsi
> > +; AVX512BW-NEXT:    addq %rdi, %rsi
> > +; AVX512BW-NEXT:    addq $-1, %rbx
> > +; AVX512BW-NEXT:    movq %rbx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    movl $0, %r15d
> > +; AVX512BW-NEXT:    adcq $-1, %r15
> > +; AVX512BW-NEXT:    addq $-1, %rbp
> > +; AVX512BW-NEXT:    movq %rbp, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    movl $0, %ebx
> > +; AVX512BW-NEXT:    adcq $-1, %rbx
> > +; AVX512BW-NEXT:    addq $-1, %rcx
> > +; AVX512BW-NEXT:    movq %rcx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    movl $0, %r11d
> > +; AVX512BW-NEXT:    adcq $-1, %r11
> > +; AVX512BW-NEXT:    addq $-1, %r12
> > +; AVX512BW-NEXT:    movq %r12, (%rsp) # 8-byte Spill
> > +; AVX512BW-NEXT:    movl $0, %edi
> > +; AVX512BW-NEXT:    adcq $-1, %rdi
> > +; AVX512BW-NEXT:    addq $-1, %rdx
> > +; AVX512BW-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX512BW-NEXT:    movl $0, %r13d
> > +; AVX512BW-NEXT:    adcq $-1, %r13
> > +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX512BW-NEXT:    movl $0, %r12d
> > +; AVX512BW-NEXT:    adcq $-1, %r12
> > +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded
> Spill
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; AVX512BW-NEXT:    addq $-1, %rcx
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    addq $-1, %r14
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    addq $-1, %r10
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    addq $-1, %r9
> > +; AVX512BW-NEXT:    movl $0, %edx
> > +; AVX512BW-NEXT:    adcq $-1, %rdx
> > +; AVX512BW-NEXT:    addq $-1, %r8
> > +; AVX512BW-NEXT:    movl $0, %eax
> > +; AVX512BW-NEXT:    adcq $-1, %rax
> > +; AVX512BW-NEXT:    addq $-1, %rsi
> > +; AVX512BW-NEXT:    movl $0, %ebp
> > +; AVX512BW-NEXT:    adcq $-1, %rbp
> > +; AVX512BW-NEXT:    shldq $63, %rsi, %rbp
> > +; AVX512BW-NEXT:    movq %rbp, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    shldq $63, %r8, %rax
> > +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> > +; AVX512BW-NEXT:    shldq $63, %r9, %rdx
> > +; AVX512BW-NEXT:    movq %rdx, %rbp
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r8 # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %r10, %r8
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r10 # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %r14, %r10
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r9 # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rcx, %r9
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r14 # 8-byte Reload
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %r14
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %rsi
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %r12
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %r13
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %rdx
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %rcx
> > +; AVX512BW-NEXT:    movq (%rsp), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %rdi
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %r11
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %rbx
> > +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> > +; AVX512BW-NEXT:    shldq $63, %rax, %r15
> > +; AVX512BW-NEXT:    vmovq %r15, %xmm0
> > +; AVX512BW-NEXT:    vmovq %rbx, %xmm1
> > +; AVX512BW-NEXT:    vmovq %r11, %xmm2
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
> > +; AVX512BW-NEXT:    vmovq %rdi, %xmm1
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm1, %ymm2, %ymm1
> > +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm0, %zmm1, %zmm0
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm0, %eax
> > +; AVX512BW-NEXT:    vmovd %eax, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $1, %eax, %xmm2, %xmm1
> > +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm0, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $2, %eax, %xmm1, %xmm1
> > +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm0, %xmm0
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm0, %eax
> > +; AVX512BW-NEXT:    vpinsrb $3, %eax, %xmm1, %xmm0
> > +; AVX512BW-NEXT:    vmovq %rcx, %xmm1
> > +; AVX512BW-NEXT:    vmovq %rdx, %xmm2
> > +; AVX512BW-NEXT:    vmovq %r13, %xmm3
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
> > +; AVX512BW-NEXT:    vmovq %r12, %xmm2
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> > +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm1, %zmm2, %zmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm1, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm1, %xmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vmovq %rsi, %xmm1
> > +; AVX512BW-NEXT:    vmovq %r14, %xmm2
> > +; AVX512BW-NEXT:    vmovq %r9, %xmm3
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
> > +; AVX512BW-NEXT:    vmovq %r10, %xmm2
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> > +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm1, %zmm2, %zmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $8, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $9, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm1, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $10, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm1, %xmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $11, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vmovq %r8, %xmm1
> > +; AVX512BW-NEXT:    vmovq %rbp, %xmm2
> > +; AVX512BW-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm3 # 8-byte
> Folded Reload
> > +; AVX512BW-NEXT:    # xmm3 = mem[0],zero
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
> > +; AVX512BW-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 8-byte
> Folded Reload
> > +; AVX512BW-NEXT:    # xmm2 = mem[0],zero
> > +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> > +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm1, %zmm2, %zmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $12, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm1, %xmm2
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> > +; AVX512BW-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm1, %xmm1
> > +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> > +; AVX512BW-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vmovdqu %xmm0, (%rax)
> > +; AVX512BW-NEXT:    addq $24, %rsp
> > +; AVX512BW-NEXT:    popq %rbx
> > +; AVX512BW-NEXT:    popq %r12
> > +; AVX512BW-NEXT:    popq %r13
> > +; AVX512BW-NEXT:    popq %r14
> > +; AVX512BW-NEXT:    popq %r15
> > +; AVX512BW-NEXT:    popq %rbp
> > +; AVX512BW-NEXT:    vzeroupper
> > +; AVX512BW-NEXT:    retq
> >    %1 = load <16 x i8>, <16 x i8>* %a
> >    %2 = load <16 x i8>, <16 x i8>* %b
> >    %3 = zext <16 x i8> %1 to <16 x i128>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll Wed Aug  7 09:24:26 2019
> > @@ -40,7 +40,7 @@ define void @fptoui8(%f32vec_t %a, %i8ve
> >  ; CHECK:       # %bb.0:
> >  ; CHECK-NEXT:    vcvttps2dq %ymm0, %ymm0
> >  ; CHECK-NEXT:    vextractf128 $1, %ymm0, %xmm1
> > -; CHECK-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
> >  ; CHECK-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; CHECK-NEXT:    vmovq %xmm0, (%rdi)
> >  ; CHECK-NEXT:    vzeroupper
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx-fp2int.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx-fp2int.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx-fp2int.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx-fp2int.ll Wed Aug  7 09:24:26 2019
> > @@ -7,6 +7,7 @@ define <4 x i8> @test1(<4 x double> %d)
> >  ; CHECK-LABEL: test1:
> >  ; CHECK:       ## %bb.0:
> >  ; CHECK-NEXT:    vcvttpd2dq %ymm0, %xmm0
> > +; CHECK-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; CHECK-NEXT:    vzeroupper
> >  ; CHECK-NEXT:    retl
> >    %c = fptoui <4 x double> %d to <4 x i8>
> > @@ -16,6 +17,7 @@ define <4 x i8> @test2(<4 x double> %d)
> >  ; CHECK-LABEL: test2:
> >  ; CHECK:       ## %bb.0:
> >  ; CHECK-NEXT:    vcvttpd2dq %ymm0, %xmm0
> > +; CHECK-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; CHECK-NEXT:    vzeroupper
> >  ; CHECK-NEXT:    retl
> >    %c = fptosi <4 x double> %d to <4 x i8>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx2-conversions.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx2-conversions.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx2-conversions.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx2-conversions.ll Wed Aug  7 09:24:26
> 2019
> > @@ -117,14 +117,12 @@ define <8 x i32> @zext8(<8 x i16> %A) no
> >  define <8 x i32> @zext_8i8_8i32(<8 x i8> %A) nounwind {
> >  ; X32-LABEL: zext_8i8_8i32:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    vpand {{\.LCPI.*}}, %xmm0, %xmm0
> > -; X32-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; X32-NEXT:    vpmovzxbd {{.*#+}} ymm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: zext_8i8_8i32:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> > -; X64-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; X64-NEXT:    vpmovzxbd {{.*#+}} ymm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> >  ; X64-NEXT:    retq
> >    %B = zext <8 x i8> %A to <8 x i32>
> >    ret <8 x i32>%B
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll Wed Aug  7
> 09:24:26 2019
> > @@ -9,23 +9,21 @@ declare <2 x i32> @llvm.masked.gather.v2
> >  define <2 x i32> @masked_gather_v2i32(<2 x i32*>* %ptr, <2 x i1>
> %masks, <2 x i32> %passthro) {
> >  ; X86-LABEL: masked_gather_v2i32:
> >  ; X86:       # %bb.0: # %entry
> > -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> > -; X86-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; X86-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> >  ; X86-NEXT:    vpslld $31, %xmm0, %xmm0
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> >  ; X86-NEXT:    vpgatherdd %xmm0, (,%xmm2), %xmm1
> > -; X86-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; X86-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; X86-NEXT:    retl
> >  ;
> >  ; X64-LABEL: masked_gather_v2i32:
> >  ; X64:       # %bb.0: # %entry
> >  ; X64-NEXT:    vmovdqa (%rdi), %xmm2
> > -; X64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; X64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; X64-NEXT:    vpslld $31, %xmm0, %xmm0
> >  ; X64-NEXT:    vpgatherqd %xmm0, (,%xmm2), %xmm1
> > -; X64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; X64-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; X64-NEXT:    retq
> >  ;
> >  ; NOGATHER-LABEL: masked_gather_v2i32:
> > @@ -43,14 +41,12 @@ define <2 x i32> @masked_gather_v2i32(<2
> >  ; NOGATHER-NEXT:    retq
> >  ; NOGATHER-NEXT:  .LBB0_1: # %cond.load
> >  ; NOGATHER-NEXT:    vmovq %xmm2, %rcx
> > -; NOGATHER-NEXT:    movl (%rcx), %ecx
> > -; NOGATHER-NEXT:    vpinsrq $0, %rcx, %xmm1, %xmm1
> > +; NOGATHER-NEXT:    vpinsrd $0, (%rcx), %xmm1, %xmm1
> >  ; NOGATHER-NEXT:    testb $2, %al
> >  ; NOGATHER-NEXT:    je .LBB0_4
> >  ; NOGATHER-NEXT:  .LBB0_3: # %cond.load1
> >  ; NOGATHER-NEXT:    vpextrq $1, %xmm2, %rax
> > -; NOGATHER-NEXT:    movl (%rax), %eax
> > -; NOGATHER-NEXT:    vpinsrq $1, %rax, %xmm1, %xmm1
> > +; NOGATHER-NEXT:    vpinsrd $1, (%rax), %xmm1, %xmm1
> >  ; NOGATHER-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; NOGATHER-NEXT:    retq
> >  entry:
> > @@ -62,11 +58,10 @@ entry:
> >  define <4 x i32> @masked_gather_v2i32_concat(<2 x i32*>* %ptr, <2 x i1>
> %masks, <2 x i32> %passthro) {
> >  ; X86-LABEL: masked_gather_v2i32_concat:
> >  ; X86:       # %bb.0: # %entry
> > -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> > -; X86-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; X86-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> >  ; X86-NEXT:    vpslld $31, %xmm0, %xmm0
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> >  ; X86-NEXT:    vpgatherdd %xmm0, (,%xmm2), %xmm1
> >  ; X86-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; X86-NEXT:    retl
> > @@ -74,7 +69,6 @@ define <4 x i32> @masked_gather_v2i32_co
> >  ; X64-LABEL: masked_gather_v2i32_concat:
> >  ; X64:       # %bb.0: # %entry
> >  ; X64-NEXT:    vmovdqa (%rdi), %xmm2
> > -; X64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; X64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; X64-NEXT:    vpslld $31, %xmm0, %xmm0
> >  ; X64-NEXT:    vpgatherqd %xmm0, (,%xmm2), %xmm1
> > @@ -92,19 +86,17 @@ define <4 x i32> @masked_gather_v2i32_co
> >  ; NOGATHER-NEXT:    testb $2, %al
> >  ; NOGATHER-NEXT:    jne .LBB1_3
> >  ; NOGATHER-NEXT:  .LBB1_4: # %else2
> > -; NOGATHER-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > +; NOGATHER-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; NOGATHER-NEXT:    retq
> >  ; NOGATHER-NEXT:  .LBB1_1: # %cond.load
> >  ; NOGATHER-NEXT:    vmovq %xmm2, %rcx
> > -; NOGATHER-NEXT:    movl (%rcx), %ecx
> > -; NOGATHER-NEXT:    vpinsrq $0, %rcx, %xmm1, %xmm1
> > +; NOGATHER-NEXT:    vpinsrd $0, (%rcx), %xmm1, %xmm1
> >  ; NOGATHER-NEXT:    testb $2, %al
> >  ; NOGATHER-NEXT:    je .LBB1_4
> >  ; NOGATHER-NEXT:  .LBB1_3: # %cond.load1
> >  ; NOGATHER-NEXT:    vpextrq $1, %xmm2, %rax
> > -; NOGATHER-NEXT:    movl (%rax), %eax
> > -; NOGATHER-NEXT:    vpinsrq $1, %rax, %xmm1, %xmm1
> > -; NOGATHER-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > +; NOGATHER-NEXT:    vpinsrd $1, (%rax), %xmm1, %xmm1
> > +; NOGATHER-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; NOGATHER-NEXT:    retq
> >  entry:
> >    %ld  = load <2 x i32*>, <2 x i32*>* %ptr
> > @@ -714,10 +706,10 @@ declare <2 x i64> @llvm.masked.gather.v2
> >  define <2 x i64> @masked_gather_v2i64(<2 x i64*>* %ptr, <2 x i1>
> %masks, <2 x i64> %passthro) {
> >  ; X86-LABEL: masked_gather_v2i64:
> >  ; X86:       # %bb.0: # %entry
> > -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; X86-NEXT:    vpmovsxdq (%eax), %xmm2
> >  ; X86-NEXT:    vpsllq $63, %xmm0, %xmm0
> > -; X86-NEXT:    vpgatherqq %xmm0, (,%xmm2), %xmm1
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> > +; X86-NEXT:    vpgatherdq %xmm0, (,%xmm2), %xmm1
> >  ; X86-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; X86-NEXT:    retl
> >  ;
> > @@ -763,10 +755,10 @@ declare <2 x double> @llvm.masked.gather
> >  define <2 x double> @masked_gather_v2double(<2 x double*>* %ptr, <2 x
> i1> %masks, <2 x double> %passthro) {
> >  ; X86-LABEL: masked_gather_v2double:
> >  ; X86:       # %bb.0: # %entry
> > -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; X86-NEXT:    vpmovsxdq (%eax), %xmm2
> >  ; X86-NEXT:    vpsllq $63, %xmm0, %xmm0
> > -; X86-NEXT:    vgatherqpd %xmm0, (,%xmm2), %xmm1
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; X86-NEXT:    vmovsd {{.*#+}} xmm2 = mem[0],zero
> > +; X86-NEXT:    vgatherdpd %xmm0, (,%xmm2), %xmm1
> >  ; X86-NEXT:    vmovapd %xmm1, %xmm0
> >  ; X86-NEXT:    retl
> >  ;
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll Wed Aug  7 09:24:26
> 2019
> > @@ -657,12 +657,12 @@ define <4 x float> @_e2(float* %ptr) nou
> >  define <8 x i8> @_e4(i8* %ptr) nounwind uwtable readnone ssp {
> >  ; X32-LABEL: _e4:
> >  ; X32:       ## %bb.0:
> > -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [52,52,52,52,52,52,52,52]
> > +; X32-NEXT:    vmovaps {{.*#+}} xmm0 =
> <52,52,52,52,52,52,52,52,u,u,u,u,u,u,u,u>
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: _e4:
> >  ; X64:       ## %bb.0:
> > -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [52,52,52,52,52,52,52,52]
> > +; X64-NEXT:    vmovaps {{.*#+}} xmm0 =
> <52,52,52,52,52,52,52,52,u,u,u,u,u,u,u,u>
> >  ; X64-NEXT:    retq
> >    %vecinit0.i = insertelement <8 x i8> undef, i8       52, i32 0
> >    %vecinit1.i = insertelement <8 x i8> %vecinit0.i, i8 52, i32 1
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll Wed Aug  7
> 09:24:26 2019
> > @@ -4,13 +4,25 @@
> >
> >
> >  define void @any_extend_load_v8i64(<8 x i8> * %ptr) {
> > -; ALL-LABEL: any_extend_load_v8i64:
> > -; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpmovzxbq {{.*#+}} zmm0 =
> mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero,mem[4],zero,zero,zero,zero,zero,zero,zero,mem[5],zero,zero,zero,zero,zero,zero,zero,mem[6],zero,zero,zero,zero,zero,zero,zero,mem[7],zero,zero,zero,zero,zero,zero,zero
> > -; ALL-NEXT:    vpaddq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> > -; ALL-NEXT:    vpmovqb %zmm0, (%rdi)
> > -; ALL-NEXT:    vzeroupper
> > -; ALL-NEXT:    retq
> > +; KNL-LABEL: any_extend_load_v8i64:
> > +; KNL:       # %bb.0:
> > +; KNL-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> > +; KNL-NEXT:    vpmovzxbq {{.*#+}} ymm1 =
> xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,xmm0[2],zero,zero,zero,zero,zero,zero,zero,xmm0[3],zero,zero,zero,zero,zero,zero,zero
> > +; KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> > +; KNL-NEXT:    vpmovzxbq {{.*#+}} ymm0 =
> xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,xmm0[2],zero,zero,zero,zero,zero,zero,zero,xmm0[3],zero,zero,zero,zero,zero,zero,zero
> > +; KNL-NEXT:    vinserti64x4 $1, %ymm0, %zmm1, %zmm0
> > +; KNL-NEXT:    vpaddq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> > +; KNL-NEXT:    vpmovqb %zmm0, (%rdi)
> > +; KNL-NEXT:    vzeroupper
> > +; KNL-NEXT:    retq
> > +;
> > +; SKX-LABEL: any_extend_load_v8i64:
> > +; SKX:       # %bb.0:
> > +; SKX-NEXT:    vpmovzxbq {{.*#+}} zmm0 =
> mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero,mem[4],zero,zero,zero,zero,zero,zero,zero,mem[5],zero,zero,zero,zero,zero,zero,zero,mem[6],zero,zero,zero,zero,zero,zero,zero,mem[7],zero,zero,zero,zero,zero,zero,zero
> > +; SKX-NEXT:    vpaddq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> > +; SKX-NEXT:    vpmovqb %zmm0, (%rdi)
> > +; SKX-NEXT:    vzeroupper
> > +; SKX-NEXT:    retq
> >    %wide.load = load <8 x i8>, <8 x i8>* %ptr, align 1
> >    %1 = zext <8 x i8> %wide.load to <8 x i64>
> >    %2 = add nuw nsw <8 x i64> %1, <i64 4, i64 4, i64 4, i64 4, i64 4,
> i64 4, i64 4, i64 4>
> > @@ -23,10 +35,12 @@ define void @any_extend_load_v8i64(<8 x
> >  define void @any_extend_load_v8i32(<8 x i8> * %ptr) {
> >  ; KNL-LABEL: any_extend_load_v8i32:
> >  ; KNL:       # %bb.0:
> > -; KNL-NEXT:    vpmovzxbw {{.*#+}} xmm0 =
> mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> > -; KNL-NEXT:    vpaddw {{.*}}(%rip), %xmm0, %xmm0
> > -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> > +; KNL-NEXT:    vpmovzxbd {{.*#+}} ymm0 =
> mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
> > +; KNL-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [4,4,4,4,4,4,4,4]
> > +; KNL-NEXT:    vpaddd %ymm1, %ymm0, %ymm0
> > +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; KNL-NEXT:    vmovq %xmm0, (%rdi)
> > +; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: any_extend_load_v8i32:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-cvt.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-cvt.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-cvt.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-cvt.ll Wed Aug  7 09:24:26 2019
> > @@ -513,15 +513,14 @@ define <8 x i8> @f64to8uc(<8 x double> %
> >  ; NOVL-LABEL: f64to8uc:
> >  ; NOVL:       # %bb.0:
> >  ; NOVL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> > -; NOVL-NEXT:    vpmovdw %zmm0, %ymm0
> > -; NOVL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
> > +; NOVL-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; NOVL-NEXT:    vzeroupper
> >  ; NOVL-NEXT:    retq
> >  ;
> >  ; VL-LABEL: f64to8uc:
> >  ; VL:       # %bb.0:
> >  ; VL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> > -; VL-NEXT:    vpmovdw %ymm0, %xmm0
> > +; VL-NEXT:    vpmovdb %ymm0, %xmm0
> >  ; VL-NEXT:    vzeroupper
> >  ; VL-NEXT:    retq
> >    %res = fptoui <8 x double> %f to <8 x i8>
> > @@ -657,15 +656,14 @@ define <8 x i8> @f64to8sc(<8 x double> %
> >  ; NOVL-LABEL: f64to8sc:
> >  ; NOVL:       # %bb.0:
> >  ; NOVL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> > -; NOVL-NEXT:    vpmovdw %zmm0, %ymm0
> > -; NOVL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
> > +; NOVL-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; NOVL-NEXT:    vzeroupper
> >  ; NOVL-NEXT:    retq
> >  ;
> >  ; VL-LABEL: f64to8sc:
> >  ; VL:       # %bb.0:
> >  ; VL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> > -; VL-NEXT:    vpmovdw %ymm0, %xmm0
> > +; VL-NEXT:    vpmovdb %ymm0, %xmm0
> >  ; VL-NEXT:    vzeroupper
> >  ; VL-NEXT:    retq
> >    %res = fptosi <8 x double> %f to <8 x i8>
> > @@ -1557,9 +1555,7 @@ define <8 x double> @ssto16f64(<8 x i16>
> >  define <8 x double> @scto8f64(<8 x i8> %a) {
> >  ; ALL-LABEL: scto8f64:
> >  ; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; ALL-NEXT:    vpslld $24, %ymm0, %ymm0
> > -; ALL-NEXT:    vpsrad $24, %ymm0, %ymm0
> > +; ALL-NEXT:    vpmovsxbd %xmm0, %ymm0
> >  ; ALL-NEXT:    vcvtdq2pd %ymm0, %zmm0
> >  ; ALL-NEXT:    retq
> >    %1 = sitofp <8 x i8> %a to <8 x double>
> > @@ -1724,13 +1720,30 @@ define <2 x float> @sbto2f32(<2 x float>
> >  }
> >
> >  define <2 x double> @sbto2f64(<2 x double> %a) {
> > -; ALL-LABEL: sbto2f64:
> > -; ALL:       # %bb.0:
> > -; ALL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> > -; ALL-NEXT:    vcmpltpd %xmm0, %xmm1, %xmm0
> > -; ALL-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; ALL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > -; ALL-NEXT:    retq
> > +; NOVL-LABEL: sbto2f64:
> > +; NOVL:       # %bb.0:
> > +; NOVL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> > +; NOVL-NEXT:    vcmpltpd %xmm0, %xmm1, %xmm0
> > +; NOVL-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; NOVL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > +; NOVL-NEXT:    retq
> > +;
> > +; VLDQ-LABEL: sbto2f64:
> > +; VLDQ:       # %bb.0:
> > +; VLDQ-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> > +; VLDQ-NEXT:    vcmpltpd %xmm0, %xmm1, %k0
> > +; VLDQ-NEXT:    vpmovm2d %k0, %xmm0
> > +; VLDQ-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > +; VLDQ-NEXT:    retq
> > +;
> > +; VLNODQ-LABEL: sbto2f64:
> > +; VLNODQ:       # %bb.0:
> > +; VLNODQ-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> > +; VLNODQ-NEXT:    vcmpltpd %xmm0, %xmm1, %k1
> > +; VLNODQ-NEXT:    vpcmpeqd %xmm0, %xmm0, %xmm0
> > +; VLNODQ-NEXT:    vmovdqa32 %xmm0, %xmm0 {%k1} {z}
> > +; VLNODQ-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > +; VLNODQ-NEXT:    retq
> >    %cmpres = fcmp ogt <2 x double> %a, zeroinitializer
> >    %1 = sitofp <2 x i1> %cmpres to <2 x double>
> >    ret <2 x double> %1
> > @@ -1749,8 +1762,7 @@ define <16 x float> @ucto16f32(<16 x i8>
> >  define <8 x double> @ucto8f64(<8 x i8> %a) {
> >  ; ALL-LABEL: ucto8f64:
> >  ; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> > -; ALL-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; ALL-NEXT:    vpmovzxbd {{.*#+}} ymm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> >  ; ALL-NEXT:    vcvtdq2pd %ymm0, %zmm0
> >  ; ALL-NEXT:    retq
> >    %b = uitofp <8 x i8> %a to <8 x double>
> > @@ -1993,29 +2005,42 @@ define <4 x double> @ubto4f64(<4 x i32>
> >  }
> >
> >  define <2 x float> @ubto2f32(<2 x i32> %a) {
> > -; ALL-LABEL: ubto2f32:
> > -; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > -; ALL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> > -; ALL-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> > -; ALL-NEXT:    vpandn {{.*}}(%rip), %xmm0, %xmm0
> > -; ALL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; ALL-NEXT:    retq
> > +; NOVL-LABEL: ubto2f32:
> > +; NOVL:       # %bb.0:
> > +; NOVL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > +; NOVL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> > +; NOVL-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [1065353216,1065353216,1065353216,1065353216]
> > +; NOVL-NEXT:    vpandn %xmm1, %xmm0, %xmm0
> > +; NOVL-NEXT:    retq
> > +;
> > +; VL-LABEL: ubto2f32:
> > +; VL:       # %bb.0:
> > +; VL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > +; VL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> > +; VL-NEXT:    vpandnd {{.*}}(%rip){1to4}, %xmm0, %xmm0
> > +; VL-NEXT:    retq
> >    %mask = icmp ne <2 x i32> %a, zeroinitializer
> >    %1 = uitofp <2 x i1> %mask to <2 x float>
> >    ret <2 x float> %1
> >  }
> >
> >  define <2 x double> @ubto2f64(<2 x i32> %a) {
> > -; ALL-LABEL: ubto2f64:
> > -; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > -; ALL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> > -; ALL-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> > -; ALL-NEXT:    vpandn {{.*}}(%rip), %xmm0, %xmm0
> > -; ALL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; ALL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > -; ALL-NEXT:    retq
> > +; NOVL-LABEL: ubto2f64:
> > +; NOVL:       # %bb.0:
> > +; NOVL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > +; NOVL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> > +; NOVL-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [1,1,1,1]
> > +; NOVL-NEXT:    vpandn %xmm1, %xmm0, %xmm0
> > +; NOVL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > +; NOVL-NEXT:    retq
> > +;
> > +; VL-LABEL: ubto2f64:
> > +; VL:       # %bb.0:
> > +; VL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > +; VL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> > +; VL-NEXT:    vpandnd {{.*}}(%rip){1to4}, %xmm0, %xmm0
> > +; VL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> > +; VL-NEXT:    retq
> >    %mask = icmp ne <2 x i32> %a, zeroinitializer
> >    %1 = uitofp <2 x i1> %mask to <2 x double>
> >    ret <2 x double> %1
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-ext.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-ext.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-ext.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-ext.ll Wed Aug  7 09:24:26 2019
> > @@ -2134,28 +2134,53 @@ define <32 x i8> @zext_32xi1_to_32xi8(<3
> >  }
> >
> >  define <4 x i32> @zext_4xi1_to_4x32(<4 x i8> %x, <4 x i8> %y) #0 {
> > -; ALL-LABEL: zext_4xi1_to_4x32:
> > -; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpbroadcastd {{.*#+}} xmm2 = [255,255,255,255]
> > -; ALL-NEXT:    vpand %xmm2, %xmm1, %xmm1
> > -; ALL-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; ALL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> > -; ALL-NEXT:    vpsrld $31, %xmm0, %xmm0
> > -; ALL-NEXT:    retq
> > +; KNL-LABEL: zext_4xi1_to_4x32:
> > +; KNL:       # %bb.0:
> > +; KNL-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> > +; KNL-NEXT:    vpmovzxbd {{.*#+}} xmm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
> > +; KNL-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [1,1,1,1]
> > +; KNL-NEXT:    vpand %xmm1, %xmm0, %xmm0
> > +; KNL-NEXT:    retq
> > +;
> > +; SKX-LABEL: zext_4xi1_to_4x32:
> > +; SKX:       # %bb.0:
> > +; SKX-NEXT:    vpcmpeqb %xmm1, %xmm0, %k0
> > +; SKX-NEXT:    vpmovm2d %k0, %xmm0
> > +; SKX-NEXT:    vpsrld $31, %xmm0, %xmm0
> > +; SKX-NEXT:    retq
> > +;
> > +; AVX512DQNOBW-LABEL: zext_4xi1_to_4x32:
> > +; AVX512DQNOBW:       # %bb.0:
> > +; AVX512DQNOBW-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> > +; AVX512DQNOBW-NEXT:    vpmovzxbd {{.*#+}} xmm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
> > +; AVX512DQNOBW-NEXT:    vpandd {{.*}}(%rip){1to4}, %xmm0, %xmm0
> > +; AVX512DQNOBW-NEXT:    retq
> >    %mask = icmp eq <4 x i8> %x, %y
> >    %1 = zext <4 x i1> %mask to <4 x i32>
> >    ret <4 x i32> %1
> >  }
> >
> >  define <2 x i64> @zext_2xi1_to_2xi64(<2 x i8> %x, <2 x i8> %y) #0 {
> > -; ALL-LABEL: zext_2xi1_to_2xi64:
> > -; ALL:       # %bb.0:
> > -; ALL-NEXT:    vpbroadcastq {{.*#+}} xmm2 = [255,255]
> > -; ALL-NEXT:    vpand %xmm2, %xmm1, %xmm1
> > -; ALL-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; ALL-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> > -; ALL-NEXT:    vpsrlq $63, %xmm0, %xmm0
> > -; ALL-NEXT:    retq
> > +; KNL-LABEL: zext_2xi1_to_2xi64:
> > +; KNL:       # %bb.0:
> > +; KNL-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> > +; KNL-NEXT:    vpmovzxbq {{.*#+}} xmm0 =
> xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero
> > +; KNL-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> > +; KNL-NEXT:    retq
> > +;
> > +; SKX-LABEL: zext_2xi1_to_2xi64:
> > +; SKX:       # %bb.0:
> > +; SKX-NEXT:    vpcmpeqb %xmm1, %xmm0, %k0
> > +; SKX-NEXT:    vpmovm2q %k0, %xmm0
> > +; SKX-NEXT:    vpsrlq $63, %xmm0, %xmm0
> > +; SKX-NEXT:    retq
> > +;
> > +; AVX512DQNOBW-LABEL: zext_2xi1_to_2xi64:
> > +; AVX512DQNOBW:       # %bb.0:
> > +; AVX512DQNOBW-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> > +; AVX512DQNOBW-NEXT:    vpmovzxbq {{.*#+}} xmm0 =
> xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero
> > +; AVX512DQNOBW-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX512DQNOBW-NEXT:    retq
> >    %mask = icmp eq <2 x i8> %x, %y
> >    %1 = zext <2 x i1> %mask to <2 x i64>
> >    ret <2 x i64> %1
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll Wed Aug  7
> 09:24:26 2019
> > @@ -5478,19 +5478,19 @@ define <8 x i8> @test_cmp_q_512(<8 x i64
> >  ; CHECK-NEXT:    vpcmpgtq %zmm1, %zmm0, %k5 ## encoding:
> [0x62,0xf2,0xfd,0x48,0x37,0xe9]
> >  ; CHECK-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $255, %eax ## encoding: [0xb8,0xff,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
> >  ; CHECK-NEXT:    ret{{[l|q]}} ## encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.512(<8 x i64> %a0, <8 x
> i64> %a1, i32 0, i8 -1)
> > @@ -5515,7 +5515,7 @@ define <8 x i8> @test_cmp_q_512(<8 x i64
> >  define <8 x i8> @test_mask_cmp_q_512(<8 x i64> %a0, <8 x i64> %a1, i8
> %mask) {
> >  ; X86-LABEL: test_mask_cmp_q_512:
> >  ; X86:       ## %bb.0:
> > -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax ## encoding:
> [0x0f,0xb7,0x44,0x24,0x04]
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax ## encoding:
> [0x8b,0x44,0x24,0x04]
> >  ; X86-NEXT:    kmovw %eax, %k1 ## encoding: [0xc5,0xf8,0x92,0xc8]
> >  ; X86-NEXT:    vpcmpeqq %zmm1, %zmm0, %k0 {%k1} ## encoding:
> [0x62,0xf2,0xfd,0x49,0x29,0xc1]
> >  ; X86-NEXT:    vpcmpgtq %zmm0, %zmm1, %k2 {%k1} ## encoding:
> [0x62,0xf2,0xf5,0x49,0x37,0xd0]
> > @@ -5525,18 +5525,18 @@ define <8 x i8> @test_mask_cmp_q_512(<8
> >  ; X86-NEXT:    vpcmpgtq %zmm1, %zmm0, %k1 {%k1} ## encoding:
> [0x62,0xf2,0xfd,0x49,0x37,0xc9]
> >  ; X86-NEXT:    kmovw %k0, %ecx ## encoding: [0xc5,0xf8,0x93,0xc8]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x00]
> > +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x00]
> >  ; X86-NEXT:    kmovw %k2, %ecx ## encoding: [0xc5,0xf8,0x93,0xca]
> > -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x01]
> > +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x01]
> >  ; X86-NEXT:    kmovw %k3, %ecx ## encoding: [0xc5,0xf8,0x93,0xcb]
> > -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x02]
> > +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x02]
> >  ; X86-NEXT:    kmovw %k4, %ecx ## encoding: [0xc5,0xf8,0x93,0xcc]
> > -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x04]
> > +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x04]
> >  ; X86-NEXT:    kmovw %k5, %ecx ## encoding: [0xc5,0xf8,0x93,0xcd]
> > -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x05]
> > +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x05]
> >  ; X86-NEXT:    kmovw %k1, %ecx ## encoding: [0xc5,0xf8,0x93,0xc9]
> > -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x06]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
> >  ; X86-NEXT:    retl ## encoding: [0xc3]
> >  ;
> > @@ -5551,18 +5551,18 @@ define <8 x i8> @test_mask_cmp_q_512(<8
> >  ; X64-NEXT:    vpcmpgtq %zmm1, %zmm0, %k1 {%k1} ## encoding:
> [0x62,0xf2,0xfd,0x49,0x37,0xc9]
> >  ; X64-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc7,0x07]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc7,0x07]
> >  ; X64-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
> >  ; X64-NEXT:    retq ## encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.512(<8 x i64> %a0, <8 x
> i64> %a1, i32 0, i8 %mask)
> > @@ -5597,19 +5597,19 @@ define <8 x i8> @test_ucmp_q_512(<8 x i6
> >  ; CHECK-NEXT:    vpcmpnleuq %zmm1, %zmm0, %k5 ## encoding:
> [0x62,0xf3,0xfd,0x48,0x1e,0xe9,0x06]
> >  ; CHECK-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $255, %eax ## encoding: [0xb8,0xff,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
> >  ; CHECK-NEXT:    ret{{[l|q]}} ## encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.512(<8 x i64> %a0, <8 x
> i64> %a1, i32 0, i8 -1)
> > @@ -5634,7 +5634,7 @@ define <8 x i8> @test_ucmp_q_512(<8 x i6
> >  define <8 x i8> @test_mask_ucmp_q_512(<8 x i64> %a0, <8 x i64> %a1, i8
> %mask) {
> >  ; X86-LABEL: test_mask_ucmp_q_512:
> >  ; X86:       ## %bb.0:
> > -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax ## encoding:
> [0x0f,0xb7,0x44,0x24,0x04]
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax ## encoding:
> [0x8b,0x44,0x24,0x04]
> >  ; X86-NEXT:    kmovw %eax, %k1 ## encoding: [0xc5,0xf8,0x92,0xc8]
> >  ; X86-NEXT:    vpcmpeqq %zmm1, %zmm0, %k0 {%k1} ## encoding:
> [0x62,0xf2,0xfd,0x49,0x29,0xc1]
> >  ; X86-NEXT:    vpcmpltuq %zmm1, %zmm0, %k2 {%k1} ## encoding:
> [0x62,0xf3,0xfd,0x49,0x1e,0xd1,0x01]
> > @@ -5644,18 +5644,18 @@ define <8 x i8> @test_mask_ucmp_q_512(<8
> >  ; X86-NEXT:    vpcmpnleuq %zmm1, %zmm0, %k1 {%k1} ## encoding:
> [0x62,0xf3,0xfd,0x49,0x1e,0xc9,0x06]
> >  ; X86-NEXT:    kmovw %k0, %ecx ## encoding: [0xc5,0xf8,0x93,0xc8]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x00]
> > +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x00]
> >  ; X86-NEXT:    kmovw %k2, %ecx ## encoding: [0xc5,0xf8,0x93,0xca]
> > -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x01]
> > +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x01]
> >  ; X86-NEXT:    kmovw %k3, %ecx ## encoding: [0xc5,0xf8,0x93,0xcb]
> > -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x02]
> > +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x02]
> >  ; X86-NEXT:    kmovw %k4, %ecx ## encoding: [0xc5,0xf8,0x93,0xcc]
> > -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x04]
> > +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x04]
> >  ; X86-NEXT:    kmovw %k5, %ecx ## encoding: [0xc5,0xf8,0x93,0xcd]
> > -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x05]
> > +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x05]
> >  ; X86-NEXT:    kmovw %k1, %ecx ## encoding: [0xc5,0xf8,0x93,0xc9]
> > -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc1,0x06]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
> >  ; X86-NEXT:    retl ## encoding: [0xc3]
> >  ;
> > @@ -5670,18 +5670,18 @@ define <8 x i8> @test_mask_ucmp_q_512(<8
> >  ; X64-NEXT:    vpcmpnleuq %zmm1, %zmm0, %k1 {%k1} ## encoding:
> [0x62,0xf3,0xfd,0x49,0x1e,0xc9,0x06]
> >  ; X64-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xc4,0xc7,0x07]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x20,0xc7,0x07]
> >  ; X64-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
> >  ; X64-NEXT:    retq ## encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.512(<8 x i64> %a0, <8 x
> i64> %a1, i32 0, i8 %mask)
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll Wed Aug  7 09:24:26
> 2019
> > @@ -2296,21 +2296,22 @@ define <2 x i16> @load_2i1(<2 x i1>* %a)
> >  ; KNL-LABEL: load_2i1:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    kmovw (%rdi), %k1
> > -; KNL-NEXT:    vpternlogq $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> > -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> > +; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> > +; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> > +; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: load_2i1:
> >  ; SKX:       ## %bb.0:
> >  ; SKX-NEXT:    kmovb (%rdi), %k0
> > -; SKX-NEXT:    vpmovm2q %k0, %xmm0
> > +; SKX-NEXT:    vpmovm2w %k0, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: load_2i1:
> >  ; AVX512BW:       ## %bb.0:
> > -; AVX512BW-NEXT:    kmovw (%rdi), %k1
> > -; AVX512BW-NEXT:    vpternlogq $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> > +; AVX512BW-NEXT:    kmovw (%rdi), %k0
> > +; AVX512BW-NEXT:    vpmovm2w %k0, %zmm0
> >  ; AVX512BW-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2318,8 +2319,9 @@ define <2 x i16> @load_2i1(<2 x i1>* %a)
> >  ; AVX512DQ-LABEL: load_2i1:
> >  ; AVX512DQ:       ## %bb.0:
> >  ; AVX512DQ-NEXT:    kmovb (%rdi), %k0
> > -; AVX512DQ-NEXT:    vpmovm2q %k0, %zmm0
> > -; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> > +; AVX512DQ-NEXT:    vpmovm2d %k0, %zmm0
> > +; AVX512DQ-NEXT:    vpmovdw %zmm0, %ymm0
> > +; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> >  ; AVX512DQ-NEXT:    vzeroupper
> >  ; AVX512DQ-NEXT:    retq
> >  ;
> > @@ -2327,7 +2329,7 @@ define <2 x i16> @load_2i1(<2 x i1>* %a)
> >  ; X86:       ## %bb.0:
> >  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; X86-NEXT:    kmovb (%eax), %k0
> > -; X86-NEXT:    vpmovm2q %k0, %xmm0
> > +; X86-NEXT:    vpmovm2w %k0, %xmm0
> >  ; X86-NEXT:    retl
> >    %b = load <2 x i1>, <2 x i1>* %a
> >    %c = sext <2 x i1> %b to <2 x i16>
> > @@ -2339,20 +2341,21 @@ define <4 x i16> @load_4i1(<4 x i1>* %a)
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    kmovw (%rdi), %k1
> >  ; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> > -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> > +; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> > +; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: load_4i1:
> >  ; SKX:       ## %bb.0:
> >  ; SKX-NEXT:    kmovb (%rdi), %k0
> > -; SKX-NEXT:    vpmovm2d %k0, %xmm0
> > +; SKX-NEXT:    vpmovm2w %k0, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: load_4i1:
> >  ; AVX512BW:       ## %bb.0:
> > -; AVX512BW-NEXT:    kmovw (%rdi), %k1
> > -; AVX512BW-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> > +; AVX512BW-NEXT:    kmovw (%rdi), %k0
> > +; AVX512BW-NEXT:    vpmovm2w %k0, %zmm0
> >  ; AVX512BW-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2361,7 +2364,8 @@ define <4 x i16> @load_4i1(<4 x i1>* %a)
> >  ; AVX512DQ:       ## %bb.0:
> >  ; AVX512DQ-NEXT:    kmovb (%rdi), %k0
> >  ; AVX512DQ-NEXT:    vpmovm2d %k0, %zmm0
> > -; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> > +; AVX512DQ-NEXT:    vpmovdw %zmm0, %ymm0
> > +; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> >  ; AVX512DQ-NEXT:    vzeroupper
> >  ; AVX512DQ-NEXT:    retq
> >  ;
> > @@ -2369,7 +2373,7 @@ define <4 x i16> @load_4i1(<4 x i1>* %a)
> >  ; X86:       ## %bb.0:
> >  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; X86-NEXT:    kmovb (%eax), %k0
> > -; X86-NEXT:    vpmovm2d %k0, %xmm0
> > +; X86-NEXT:    vpmovm2w %k0, %xmm0
> >  ; X86-NEXT:    retl
> >    %b = load <4 x i1>, <4 x i1>* %a
> >    %c = sext <4 x i1> %b to <4 x i16>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-trunc.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-trunc.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-trunc.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-trunc.ll Wed Aug  7 09:24:26 2019
> > @@ -36,7 +36,7 @@ define <16 x i16> @trunc_v16i32_to_v16i1
> >  define <8 x i8> @trunc_qb_512(<8 x i64> %i) #0 {
> >  ; ALL-LABEL: trunc_qb_512:
> >  ; ALL:       ## %bb.0:
> > -; ALL-NEXT:    vpmovqw %zmm0, %xmm0
> > +; ALL-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; ALL-NEXT:    vzeroupper
> >  ; ALL-NEXT:    retq
> >    %x = trunc <8 x i64> %i to <8 x i8>
> > @@ -58,14 +58,13 @@ define <4 x i8> @trunc_qb_256(<4 x i64>
> >  ; KNL-LABEL: trunc_qb_256:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> > -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> > -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> > +; KNL-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: trunc_qb_256:
> >  ; SKX:       ## %bb.0:
> > -; SKX-NEXT:    vpmovqd %ymm0, %xmm0
> > +; SKX-NEXT:    vpmovqb %ymm0, %xmm0
> >  ; SKX-NEXT:    vzeroupper
> >  ; SKX-NEXT:    retq
> >    %x = trunc <4 x i64> %i to <4 x i8>
> > @@ -76,8 +75,7 @@ define void @trunc_qb_256_mem(<4 x i64>
> >  ; KNL-LABEL: trunc_qb_256_mem:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> > -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> > -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> > +; KNL-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; KNL-NEXT:    vmovd %xmm0, (%rdi)
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> > @@ -95,6 +93,7 @@ define void @trunc_qb_256_mem(<4 x i64>
> >  define <2 x i8> @trunc_qb_128(<2 x i64> %i) #0 {
> >  ; ALL-LABEL: trunc_qb_128:
> >  ; ALL:       ## %bb.0:
> > +; ALL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; ALL-NEXT:    retq
> >    %x = trunc <2 x i64> %i to <2 x i8>
> >    ret <2 x i8> %x
> > @@ -141,14 +140,13 @@ define <4 x i16> @trunc_qw_256(<4 x i64>
> >  ; KNL-LABEL: trunc_qw_256:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> > -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> > -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> > +; KNL-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: trunc_qw_256:
> >  ; SKX:       ## %bb.0:
> > -; SKX-NEXT:    vpmovqd %ymm0, %xmm0
> > +; SKX-NEXT:    vpmovqw %ymm0, %xmm0
> >  ; SKX-NEXT:    vzeroupper
> >  ; SKX-NEXT:    retq
> >    %x = trunc <4 x i64> %i to <4 x i16>
> > @@ -159,8 +157,7 @@ define void @trunc_qw_256_mem(<4 x i64>
> >  ; KNL-LABEL: trunc_qw_256_mem:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> > -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> > -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> > +; KNL-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; KNL-NEXT:    vmovq %xmm0, (%rdi)
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> > @@ -176,9 +173,16 @@ define void @trunc_qw_256_mem(<4 x i64>
> >  }
> >
> >  define <2 x i16> @trunc_qw_128(<2 x i64> %i) #0 {
> > -; ALL-LABEL: trunc_qw_128:
> > -; ALL:       ## %bb.0:
> > -; ALL-NEXT:    retq
> > +; KNL-LABEL: trunc_qw_128:
> > +; KNL:       ## %bb.0:
> > +; KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; KNL-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; KNL-NEXT:    retq
> > +;
> > +; SKX-LABEL: trunc_qw_128:
> > +; SKX:       ## %bb.0:
> > +; SKX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,8,9,8,9,10,11,8,9,10,11,12,13,14,15]
> > +; SKX-NEXT:    retq
> >    %x = trunc <2 x i64> %i to <2 x i16>
> >    ret <2 x i16> %x
> >  }
> > @@ -260,6 +264,7 @@ define void @trunc_qd_256_mem(<4 x i64>
> >  define <2 x i32> @trunc_qd_128(<2 x i64> %i) #0 {
> >  ; ALL-LABEL: trunc_qd_128:
> >  ; ALL:       ## %bb.0:
> > +; ALL-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; ALL-NEXT:    retq
> >    %x = trunc <2 x i64> %i to <2 x i32>
> >    ret <2 x i32> %x
> > @@ -306,14 +311,13 @@ define <8 x i8> @trunc_db_256(<8 x i32>
> >  ; KNL-LABEL: trunc_db_256:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> > -; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> > -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> > +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: trunc_db_256:
> >  ; SKX:       ## %bb.0:
> > -; SKX-NEXT:    vpmovdw %ymm0, %xmm0
> > +; SKX-NEXT:    vpmovdb %ymm0, %xmm0
> >  ; SKX-NEXT:    vzeroupper
> >  ; SKX-NEXT:    retq
> >    %x = trunc <8 x i32> %i to <8 x i8>
> > @@ -324,8 +328,7 @@ define void @trunc_db_256_mem(<8 x i32>
> >  ; KNL-LABEL: trunc_db_256_mem:
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> > -; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> > -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> > +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; KNL-NEXT:    vmovq %xmm0, (%rdi)
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> > @@ -343,6 +346,7 @@ define void @trunc_db_256_mem(<8 x i32>
> >  define <4 x i8> @trunc_db_128(<4 x i32> %i) #0 {
> >  ; ALL-LABEL: trunc_db_128:
> >  ; ALL:       ## %bb.0:
> > +; ALL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; ALL-NEXT:    retq
> >    %x = trunc <4 x i32> %i to <4 x i8>
> >    ret <4 x i8> %x
> > @@ -513,6 +517,7 @@ define void @trunc_wb_256_mem(<16 x i16>
> >  define <8 x i8> @trunc_wb_128(<8 x i16> %i) #0 {
> >  ; ALL-LABEL: trunc_wb_128:
> >  ; ALL:       ## %bb.0:
> > +; ALL-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; ALL-NEXT:    retq
> >    %x = trunc <8 x i16> %i to <8 x i8>
> >    ret <8 x i8> %x
> > @@ -691,6 +696,7 @@ define <8 x i8> @usat_trunc_wb_128(<8 x
> >  ; ALL-LABEL: usat_trunc_wb_128:
> >  ; ALL:       ## %bb.0:
> >  ; ALL-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> > +; ALL-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; ALL-NEXT:    retq
> >    %x3 = icmp ult <8 x i16> %i, <i16 255, i16 255, i16 255, i16 255, i16
> 255, i16 255, i16 255, i16 255>
> >    %x5 = select <8 x i1> %x3, <8 x i16> %i, <8 x i16> <i16 255, i16 255,
> i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
> > @@ -716,16 +722,14 @@ define <16 x i8> @usat_trunc_db_256(<8 x
> >  ; KNL:       ## %bb.0:
> >  ; KNL-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [255,255,255,255,255,255,255,255]
> >  ; KNL-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> > -; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> > -; KNL-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> > +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; KNL-NEXT:    vzeroupper
> >  ; KNL-NEXT:    retq
> >  ;
> >  ; SKX-LABEL: usat_trunc_db_256:
> >  ; SKX:       ## %bb.0:
> >  ; SKX-NEXT:    vpminud {{.*}}(%rip){1to8}, %ymm0, %ymm0
> > -; SKX-NEXT:    vpmovdw %ymm0, %xmm0
> > -; SKX-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> > +; SKX-NEXT:    vpmovdb %ymm0, %xmm0
> >  ; SKX-NEXT:    vzeroupper
> >  ; SKX-NEXT:    retq
> >    %tmp1 = icmp ult <8 x i32> %x, <i32 255, i32 255, i32 255, i32 255,
> i32 255, i32 255, i32 255, i32 255>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll Wed Aug  7 09:24:26
> 2019
> > @@ -886,22 +886,14 @@ define <8 x double> @test43(<8 x double>
> >  define <4 x i32> @test44(<4 x i16> %x, <4 x i16> %y) #0 {
> >  ; AVX512-LABEL: test44:
> >  ; AVX512:       ## %bb.0:
> > -; AVX512-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## encoding:
> [0xc5,0xe9,0xef,0xd2]
> > -; AVX512-NEXT:    vpblendw $170, %xmm2, %xmm1, %xmm1 ## encoding:
> [0xc4,0xe3,0x71,0x0e,0xca,0xaa]
> > -; AVX512-NEXT:    ## xmm1 =
> xmm1[0],xmm2[1],xmm1[2],xmm2[3],xmm1[4],xmm2[5],xmm1[6],xmm2[7]
> > -; AVX512-NEXT:    vpblendw $170, %xmm2, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x0e,0xc2,0xaa]
> > -; AVX512-NEXT:    ## xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3],xmm0[4],xmm2[5],xmm0[6],xmm2[7]
> > -; AVX512-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0x76,0xc1]
> > +; AVX512-NEXT:    vpcmpeqw %xmm1, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0x75,0xc1]
> > +; AVX512-NEXT:    vpmovsxwd %xmm0, %xmm0 ## encoding:
> [0xc4,0xe2,0x79,0x23,0xc0]
> >  ; AVX512-NEXT:    retq ## encoding: [0xc3]
> >  ;
> >  ; SKX-LABEL: test44:
> >  ; SKX:       ## %bb.0:
> > -; SKX-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## EVEX TO VEX Compression
> encoding: [0xc5,0xe9,0xef,0xd2]
> > -; SKX-NEXT:    vpblendw $170, %xmm2, %xmm1, %xmm1 ## encoding:
> [0xc4,0xe3,0x71,0x0e,0xca,0xaa]
> > -; SKX-NEXT:    ## xmm1 =
> xmm1[0],xmm2[1],xmm1[2],xmm2[3],xmm1[4],xmm2[5],xmm1[6],xmm2[7]
> > -; SKX-NEXT:    vpblendw $170, %xmm2, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe3,0x79,0x0e,0xc2,0xaa]
> > -; SKX-NEXT:    ## xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3],xmm0[4],xmm2[5],xmm0[6],xmm2[7]
> > -; SKX-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0x76,0xc1]
> > +; SKX-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 ## encoding:
> [0x62,0xf1,0x7d,0x08,0x75,0xc1]
> > +; SKX-NEXT:    vpmovm2d %k0, %xmm0 ## encoding:
> [0x62,0xf2,0x7e,0x08,0x38,0xc0]
> >  ; SKX-NEXT:    retq ## encoding: [0xc3]
> >    %mask = icmp eq <4 x i16> %x, %y
> >    %1 = sext <4 x i1> %mask to <4 x i32>
> > @@ -911,23 +903,17 @@ define <4 x i32> @test44(<4 x i16> %x, <
> >  define <2 x i64> @test45(<2 x i16> %x, <2 x i16> %y) #0 {
> >  ; AVX512-LABEL: test45:
> >  ; AVX512:       ## %bb.0:
> > -; AVX512-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## encoding:
> [0xc5,0xe9,0xef,0xd2]
> > -; AVX512-NEXT:    vpblendw $17, %xmm1, %xmm2, %xmm1 ## encoding:
> [0xc4,0xe3,0x69,0x0e,0xc9,0x11]
> > -; AVX512-NEXT:    ## xmm1 = xmm1[0],xmm2[1,2,3],xmm1[4],xmm2[5,6,7]
> > -; AVX512-NEXT:    vpblendw $17, %xmm0, %xmm2, %xmm0 ## encoding:
> [0xc4,0xe3,0x69,0x0e,0xc0,0x11]
> > -; AVX512-NEXT:    ## xmm0 = xmm0[0],xmm2[1,2,3],xmm0[4],xmm2[5,6,7]
> > -; AVX512-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe2,0x79,0x29,0xc1]
> > -; AVX512-NEXT:    vpsrlq $63, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0x73,0xd0,0x3f]
> > +; AVX512-NEXT:    vpcmpeqw %xmm1, %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0x75,0xc1]
> > +; AVX512-NEXT:    vpmovzxwq %xmm0, %xmm0 ## encoding:
> [0xc4,0xe2,0x79,0x34,0xc0]
> > +; AVX512-NEXT:    ## xmm0 =
> xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
> > +; AVX512-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0 ## encoding:
> [0xc5,0xf9,0xdb,0x05,A,A,A,A]
> > +; AVX512-NEXT:    ## fixup A - offset: 4, value: LCPI46_0-4, kind:
> reloc_riprel_4byte
> >  ; AVX512-NEXT:    retq ## encoding: [0xc3]
> >  ;
> >  ; SKX-LABEL: test45:
> >  ; SKX:       ## %bb.0:
> > -; SKX-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## EVEX TO VEX Compression
> encoding: [0xc5,0xe9,0xef,0xd2]
> > -; SKX-NEXT:    vpblendw $17, %xmm1, %xmm2, %xmm1 ## encoding:
> [0xc4,0xe3,0x69,0x0e,0xc9,0x11]
> > -; SKX-NEXT:    ## xmm1 = xmm1[0],xmm2[1,2,3],xmm1[4],xmm2[5,6,7]
> > -; SKX-NEXT:    vpblendw $17, %xmm0, %xmm2, %xmm0 ## encoding:
> [0xc4,0xe3,0x69,0x0e,0xc0,0x11]
> > -; SKX-NEXT:    ## xmm0 = xmm0[0],xmm2[1,2,3],xmm0[4],xmm2[5,6,7]
> > -; SKX-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0 ## encoding:
> [0xc4,0xe2,0x79,0x29,0xc1]
> > +; SKX-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 ## encoding:
> [0x62,0xf1,0x7d,0x08,0x75,0xc1]
> > +; SKX-NEXT:    vpmovm2q %k0, %xmm0 ## encoding:
> [0x62,0xf2,0xfe,0x08,0x38,0xc0]
> >  ; SKX-NEXT:    vpsrlq $63, %xmm0, %xmm0 ## EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0x73,0xd0,0x3f]
> >  ; SKX-NEXT:    retq ## encoding: [0xc3]
> >    %mask = icmp eq <2 x i16> %x, %y
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll Wed Aug  7 09:24:26
> 2019
> > @@ -6,19 +6,15 @@ define <3 x i8 > @foo(<3 x i8>%x, <3 x i
> >  ; CHECK-LABEL: foo:
> >  ; CHECK:       # %bb.0:
> >  ; CHECK-NEXT:    vmovd %edi, %xmm0
> > -; CHECK-NEXT:    vpinsrd $1, %esi, %xmm0, %xmm0
> > -; CHECK-NEXT:    vpinsrd $2, %edx, %xmm0, %xmm0
> > -; CHECK-NEXT:    vpslld $24, %xmm0, %xmm0
> > +; CHECK-NEXT:    vpinsrb $1, %esi, %xmm0, %xmm0
> > +; CHECK-NEXT:    vpinsrb $2, %edx, %xmm0, %xmm0
> >  ; CHECK-NEXT:    vmovd %ecx, %xmm1
> > -; CHECK-NEXT:    vpinsrd $1, %r8d, %xmm1, %xmm1
> > -; CHECK-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; CHECK-NEXT:    vpinsrd $2, %r9d, %xmm1, %xmm1
> > -; CHECK-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; CHECK-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; CHECK-NEXT:    vpcmpgtd %xmm0, %xmm1, %xmm0
> > +; CHECK-NEXT:    vpinsrb $1, %r8d, %xmm1, %xmm1
> > +; CHECK-NEXT:    vpinsrb $2, %r9d, %xmm1, %xmm1
> > +; CHECK-NEXT:    vpcmpgtb %xmm0, %xmm1, %xmm0
> >  ; CHECK-NEXT:    vpextrb $0, %xmm0, %eax
> > -; CHECK-NEXT:    vpextrb $4, %xmm0, %edx
> > -; CHECK-NEXT:    vpextrb $8, %xmm0, %ecx
> > +; CHECK-NEXT:    vpextrb $1, %xmm0, %edx
> > +; CHECK-NEXT:    vpextrb $2, %xmm0, %ecx
> >  ; CHECK-NEXT:    # kill: def $al killed $al killed $eax
> >  ; CHECK-NEXT:    # kill: def $dl killed $dl killed $edx
> >  ; CHECK-NEXT:    # kill: def $cl killed $cl killed $ecx
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
> (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll Wed
> Aug  7 09:24:26 2019
> > @@ -5133,19 +5133,19 @@ define <8 x i8> @test_cmp_w_128(<8 x i16
> >  ; CHECK-NEXT:    vpcmpgtw %xmm1, %xmm0, %k5 # encoding:
> [0x62,0xf1,0x7d,0x08,0x65,0xe9]
> >  ; CHECK-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.w.128(<8 x i16> %a0, <8 x
> i16> %a1, i32 0, i8 -1)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -5169,7 +5169,7 @@ define <8 x i8> @test_cmp_w_128(<8 x i16
> >  define <8 x i8> @test_mask_cmp_w_128(<8 x i16> %a0, <8 x i16> %a1, i8
> %mask) {
> >  ; X86-LABEL: test_mask_cmp_w_128:
> >  ; X86:       # %bb.0:
> > -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding:
> [0x0f,0xb7,0x44,0x24,0x04]
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding:
> [0x8b,0x44,0x24,0x04]
> >  ; X86-NEXT:    kmovd %eax, %k1 # encoding: [0xc5,0xfb,0x92,0xc8]
> >  ; X86-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x09,0x75,0xc1]
> >  ; X86-NEXT:    vpcmpgtw %xmm0, %xmm1, %k2 {%k1} # encoding:
> [0x62,0xf1,0x75,0x09,0x65,0xd0]
> > @@ -5179,18 +5179,18 @@ define <8 x i8> @test_mask_cmp_w_128(<8
> >  ; X86-NEXT:    vpcmpgtw %xmm1, %xmm0, %k1 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x09,0x65,0xc9]
> >  ; X86-NEXT:    kmovd %k0, %ecx # encoding: [0xc5,0xfb,0x93,0xc8]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> > +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
> >  ; X86-NEXT:    kmovd %k2, %ecx # encoding: [0xc5,0xfb,0x93,0xca]
> > -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> > +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
> >  ; X86-NEXT:    kmovd %k3, %ecx # encoding: [0xc5,0xfb,0x93,0xcb]
> > -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> > +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
> >  ; X86-NEXT:    kmovd %k4, %ecx # encoding: [0xc5,0xfb,0x93,0xcc]
> > -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> > +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
> >  ; X86-NEXT:    kmovd %k5, %ecx # encoding: [0xc5,0xfb,0x93,0xcd]
> > -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> > +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
> >  ; X86-NEXT:    kmovd %k1, %ecx # encoding: [0xc5,0xfb,0x93,0xc9]
> > -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-LABEL: test_mask_cmp_w_128:
> > @@ -5204,18 +5204,18 @@ define <8 x i8> @test_mask_cmp_w_128(<8
> >  ; X64-NEXT:    vpcmpgtw %xmm1, %xmm0, %k1 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x09,0x65,0xc9]
> >  ; X64-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> > -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.w.128(<8 x i16> %a0, <8 x
> i16> %a1, i32 0, i8 %mask)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -5249,19 +5249,19 @@ define <8 x i8> @test_ucmp_w_128(<8 x i1
> >  ; CHECK-NEXT:    vpcmpnleuw %xmm1, %xmm0, %k5 # encoding:
> [0x62,0xf3,0xfd,0x08,0x3e,0xe9,0x06]
> >  ; CHECK-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.w.128(<8 x i16> %a0, <8 x
> i16> %a1, i32 0, i8 -1)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -5285,7 +5285,7 @@ define <8 x i8> @test_ucmp_w_128(<8 x i1
> >  define <8 x i8> @test_mask_ucmp_w_128(<8 x i16> %a0, <8 x i16> %a1, i8
> %mask) {
> >  ; X86-LABEL: test_mask_ucmp_w_128:
> >  ; X86:       # %bb.0:
> > -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding:
> [0x0f,0xb7,0x44,0x24,0x04]
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding:
> [0x8b,0x44,0x24,0x04]
> >  ; X86-NEXT:    kmovd %eax, %k1 # encoding: [0xc5,0xfb,0x92,0xc8]
> >  ; X86-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x09,0x75,0xc1]
> >  ; X86-NEXT:    vpcmpltuw %xmm1, %xmm0, %k2 {%k1} # encoding:
> [0x62,0xf3,0xfd,0x09,0x3e,0xd1,0x01]
> > @@ -5295,18 +5295,18 @@ define <8 x i8> @test_mask_ucmp_w_128(<8
> >  ; X86-NEXT:    vpcmpnleuw %xmm1, %xmm0, %k1 {%k1} # encoding:
> [0x62,0xf3,0xfd,0x09,0x3e,0xc9,0x06]
> >  ; X86-NEXT:    kmovd %k0, %ecx # encoding: [0xc5,0xfb,0x93,0xc8]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> > +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
> >  ; X86-NEXT:    kmovd %k2, %ecx # encoding: [0xc5,0xfb,0x93,0xca]
> > -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> > +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
> >  ; X86-NEXT:    kmovd %k3, %ecx # encoding: [0xc5,0xfb,0x93,0xcb]
> > -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> > +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
> >  ; X86-NEXT:    kmovd %k4, %ecx # encoding: [0xc5,0xfb,0x93,0xcc]
> > -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> > +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
> >  ; X86-NEXT:    kmovd %k5, %ecx # encoding: [0xc5,0xfb,0x93,0xcd]
> > -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> > +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
> >  ; X86-NEXT:    kmovd %k1, %ecx # encoding: [0xc5,0xfb,0x93,0xc9]
> > -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-LABEL: test_mask_ucmp_w_128:
> > @@ -5320,18 +5320,18 @@ define <8 x i8> @test_mask_ucmp_w_128(<8
> >  ; X64-NEXT:    vpcmpnleuw %xmm1, %xmm0, %k1 {%k1} # encoding:
> [0x62,0xf3,0xfd,0x09,0x3e,0xc9,0x06]
> >  ; X64-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> > -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.w.128(<8 x i16> %a0, <8 x
> i16> %a1, i32 0, i8 %mask)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll
> (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll Wed
> Aug  7 09:24:26 2019
> > @@ -3326,6 +3326,8 @@ define <2 x i64> @test_mm256_cvtepi64_ep
> >  ; CHECK-LABEL: test_mm256_cvtepi64_epi8:
> >  ; CHECK:       # %bb.0: # %entry
> >  ; CHECK-NEXT:    vpmovqb %ymm0, %xmm0
> > +; CHECK-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > +; CHECK-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3,4,5,6,7]
> >  ; CHECK-NEXT:    vzeroupper
> >  ; CHECK-NEXT:    ret{{[l|q]}}
> >  entry:
> > @@ -3339,6 +3341,7 @@ define <2 x i64> @test_mm256_cvtepi64_ep
> >  ; CHECK-LABEL: test_mm256_cvtepi64_epi16:
> >  ; CHECK:       # %bb.0: # %entry
> >  ; CHECK-NEXT:    vpmovqw %ymm0, %xmm0
> > +; CHECK-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> >  ; CHECK-NEXT:    vzeroupper
> >  ; CHECK-NEXT:    ret{{[l|q]}}
> >  entry:
> > @@ -3352,6 +3355,7 @@ define <2 x i64> @test_mm256_cvtepi32_ep
> >  ; CHECK-LABEL: test_mm256_cvtepi32_epi8:
> >  ; CHECK:       # %bb.0: # %entry
> >  ; CHECK-NEXT:    vpmovdb %ymm0, %xmm0
> > +; CHECK-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> >  ; CHECK-NEXT:    vzeroupper
> >  ; CHECK-NEXT:    ret{{[l|q]}}
> >  entry:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll Wed Aug
> 7 09:24:26 2019
> > @@ -8069,19 +8069,19 @@ define <8 x i8> @test_cmp_d_256(<8 x i32
> >  ; CHECK-NEXT:    vpcmpgtd %ymm1, %ymm0, %k5 # encoding:
> [0x62,0xf1,0x7d,0x28,0x66,0xe9]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.256(<8 x i32> %a0, <8 x
> i32> %a1, i32 0, i8 -1)
> > @@ -8106,7 +8106,7 @@ define <8 x i8> @test_cmp_d_256(<8 x i32
> >  define <8 x i8> @test_mask_cmp_d_256(<8 x i32> %a0, <8 x i32> %a1, i8
> %mask) {
> >  ; X86-LABEL: test_mask_cmp_d_256:
> >  ; X86:       # %bb.0:
> > -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding:
> [0x0f,0xb7,0x44,0x24,0x04]
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding:
> [0x8b,0x44,0x24,0x04]
> >  ; X86-NEXT:    kmovw %eax, %k1 # encoding: [0xc5,0xf8,0x92,0xc8]
> >  ; X86-NEXT:    vpcmpeqd %ymm1, %ymm0, %k0 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x29,0x76,0xc1]
> >  ; X86-NEXT:    vpcmpgtd %ymm0, %ymm1, %k2 {%k1} # encoding:
> [0x62,0xf1,0x75,0x29,0x66,0xd0]
> > @@ -8116,18 +8116,18 @@ define <8 x i8> @test_mask_cmp_d_256(<8
> >  ; X86-NEXT:    vpcmpgtd %ymm1, %ymm0, %k1 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x29,0x66,0xc9]
> >  ; X86-NEXT:    kmovw %k0, %ecx # encoding: [0xc5,0xf8,0x93,0xc8]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x00]
> > +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x00]
> >  ; X86-NEXT:    kmovw %k2, %ecx # encoding: [0xc5,0xf8,0x93,0xca]
> > -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x01]
> > +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x01]
> >  ; X86-NEXT:    kmovw %k3, %ecx # encoding: [0xc5,0xf8,0x93,0xcb]
> > -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x02]
> > +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x02]
> >  ; X86-NEXT:    kmovw %k4, %ecx # encoding: [0xc5,0xf8,0x93,0xcc]
> > -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x04]
> > +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x04]
> >  ; X86-NEXT:    kmovw %k5, %ecx # encoding: [0xc5,0xf8,0x93,0xcd]
> > -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x05]
> > +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x05]
> >  ; X86-NEXT:    kmovw %k1, %ecx # encoding: [0xc5,0xf8,0x93,0xc9]
> > -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x06]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> > @@ -8142,18 +8142,18 @@ define <8 x i8> @test_mask_cmp_d_256(<8
> >  ; X64-NEXT:    vpcmpgtd %ymm1, %ymm0, %k1 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x29,0x66,0xc9]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc7,0x07]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc7,0x07]
> >  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.256(<8 x i32> %a0, <8 x
> i32> %a1, i32 0, i8 %mask)
> > @@ -8188,19 +8188,19 @@ define <8 x i8> @test_ucmp_d_256(<8 x i3
> >  ; CHECK-NEXT:    vpcmpnleud %ymm1, %ymm0, %k5 # encoding:
> [0x62,0xf3,0x7d,0x28,0x1e,0xe9,0x06]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.256(<8 x i32> %a0, <8 x
> i32> %a1, i32 0, i8 -1)
> > @@ -8225,7 +8225,7 @@ define <8 x i8> @test_ucmp_d_256(<8 x i3
> >  define <8 x i8> @test_mask_ucmp_d_256(<8 x i32> %a0, <8 x i32> %a1, i8
> %mask) {
> >  ; X86-LABEL: test_mask_ucmp_d_256:
> >  ; X86:       # %bb.0:
> > -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding:
> [0x0f,0xb7,0x44,0x24,0x04]
> > +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding:
> [0x8b,0x44,0x24,0x04]
> >  ; X86-NEXT:    kmovw %eax, %k1 # encoding: [0xc5,0xf8,0x92,0xc8]
> >  ; X86-NEXT:    vpcmpeqd %ymm1, %ymm0, %k0 {%k1} # encoding:
> [0x62,0xf1,0x7d,0x29,0x76,0xc1]
> >  ; X86-NEXT:    vpcmpltud %ymm1, %ymm0, %k2 {%k1} # encoding:
> [0x62,0xf3,0x7d,0x29,0x1e,0xd1,0x01]
> > @@ -8235,18 +8235,18 @@ define <8 x i8> @test_mask_ucmp_d_256(<8
> >  ; X86-NEXT:    vpcmpnleud %ymm1, %ymm0, %k1 {%k1} # encoding:
> [0x62,0xf3,0x7d,0x29,0x1e,0xc9,0x06]
> >  ; X86-NEXT:    kmovw %k0, %ecx # encoding: [0xc5,0xf8,0x93,0xc8]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x00]
> > +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x00]
> >  ; X86-NEXT:    kmovw %k2, %ecx # encoding: [0xc5,0xf8,0x93,0xca]
> > -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x01]
> > +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x01]
> >  ; X86-NEXT:    kmovw %k3, %ecx # encoding: [0xc5,0xf8,0x93,0xcb]
> > -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x02]
> > +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x02]
> >  ; X86-NEXT:    kmovw %k4, %ecx # encoding: [0xc5,0xf8,0x93,0xcc]
> > -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x04]
> > +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x04]
> >  ; X86-NEXT:    kmovw %k5, %ecx # encoding: [0xc5,0xf8,0x93,0xcd]
> > -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x05]
> > +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x05]
> >  ; X86-NEXT:    kmovw %k1, %ecx # encoding: [0xc5,0xf8,0x93,0xc9]
> > -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc1,0x06]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> > @@ -8261,18 +8261,18 @@ define <8 x i8> @test_mask_ucmp_d_256(<8
> >  ; X64-NEXT:    vpcmpnleud %ymm1, %ymm0, %k1 {%k1} # encoding:
> [0x62,0xf3,0x7d,0x29,0x1e,0xc9,0x06]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc7,0x07]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc7,0x07]
> >  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.256(<8 x i32> %a0, <8 x
> i32> %a1, i32 0, i8 %mask)
> > @@ -8307,19 +8307,19 @@ define <8 x i8> @test_cmp_q_256(<4 x i64
> >  ; CHECK-NEXT:    vpcmpgtq %ymm1, %ymm0, %k5 # encoding:
> [0x62,0xf2,0xfd,0x28,0x37,0xe9]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.256(<4 x i64> %a0, <4 x
> i64> %a1, i32 0, i8 -1)
> > @@ -8356,19 +8356,19 @@ define <8 x i8> @test_mask_cmp_q_256(<4
> >  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> > @@ -8385,19 +8385,19 @@ define <8 x i8> @test_mask_cmp_q_256(<4
> >  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.256(<4 x i64> %a0, <4 x
> i64> %a1, i32 0, i8 %mask)
> > @@ -8432,19 +8432,19 @@ define <8 x i8> @test_ucmp_q_256(<4 x i6
> >  ; CHECK-NEXT:    vpcmpnleuq %ymm1, %ymm0, %k5 # encoding:
> [0x62,0xf3,0xfd,0x28,0x1e,0xe9,0x06]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.256(<4 x i64> %a0, <4 x
> i64> %a1, i32 0, i8 -1)
> > @@ -8481,19 +8481,19 @@ define <8 x i8> @test_mask_ucmp_q_256(<4
> >  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> > @@ -8510,19 +8510,19 @@ define <8 x i8> @test_mask_ucmp_q_256(<4
> >  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.256(<4 x i64> %a0, <4 x
> i64> %a1, i32 0, i8 %mask)
> > @@ -8557,19 +8557,19 @@ define <8 x i8> @test_cmp_d_128(<4 x i32
> >  ; CHECK-NEXT:    vpcmpgtd %xmm1, %xmm0, %k5 # encoding:
> [0x62,0xf1,0x7d,0x08,0x66,0xe9]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.128(<4 x i32> %a0, <4 x
> i32> %a1, i32 0, i8 -1)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8605,19 +8605,19 @@ define <8 x i8> @test_mask_cmp_d_128(<4
> >  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-LABEL: test_mask_cmp_d_128:
> > @@ -8633,19 +8633,19 @@ define <8 x i8> @test_mask_cmp_d_128(<4
> >  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.128(<4 x i32> %a0, <4 x
> i32> %a1, i32 0, i8 %mask)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8679,19 +8679,19 @@ define <8 x i8> @test_ucmp_d_128(<4 x i3
> >  ; CHECK-NEXT:    vpcmpnleud %xmm1, %xmm0, %k5 # encoding:
> [0x62,0xf3,0x7d,0x08,0x1e,0xe9,0x06]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.128(<4 x i32> %a0, <4 x
> i32> %a1, i32 0, i8 -1)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8727,19 +8727,19 @@ define <8 x i8> @test_mask_ucmp_d_128(<4
> >  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-LABEL: test_mask_ucmp_d_128:
> > @@ -8755,19 +8755,19 @@ define <8 x i8> @test_mask_ucmp_d_128(<4
> >  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.128(<4 x i32> %a0, <4 x
> i32> %a1, i32 0, i8 %mask)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8801,19 +8801,19 @@ define <8 x i8> @test_cmp_q_128(<2 x i64
> >  ; CHECK-NEXT:    vpcmpgtq %xmm1, %xmm0, %k5 # encoding:
> [0x62,0xf2,0xfd,0x08,0x37,0xe9]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $3, %eax # encoding: [0xb8,0x03,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.128(<2 x i64> %a0, <2 x
> i64> %a1, i32 0, i8 -1)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8849,19 +8849,19 @@ define <8 x i8> @test_mask_cmp_q_128(<2
> >  ; X86-NEXT:    kshiftrw $14, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
> >  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-LABEL: test_mask_cmp_q_128:
> > @@ -8877,19 +8877,19 @@ define <8 x i8> @test_mask_cmp_q_128(<2
> >  ; X64-NEXT:    kshiftrw $14, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.128(<2 x i64> %a0, <2 x
> i64> %a1, i32 0, i8 %mask)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8923,19 +8923,19 @@ define <8 x i8> @test_ucmp_q_128(<2 x i6
> >  ; CHECK-NEXT:    vpcmpnleuq %xmm1, %xmm0, %k5 # encoding:
> [0x62,0xf3,0xfd,0x08,0x1e,0xe9,0x06]
> >  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; CHECK-NEXT:    movl $3, %eax # encoding: [0xb8,0x03,0x00,0x00,0x00]
> > -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.128(<2 x i64> %a0, <2 x
> i64> %a1, i32 0, i8 -1)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> > @@ -8971,19 +8971,19 @@ define <8 x i8> @test_mask_ucmp_q_128(<2
> >  ; X86-NEXT:    kshiftrw $14, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
> >  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X86-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-LABEL: test_mask_ucmp_q_128:
> > @@ -8999,19 +8999,19 @@ define <8 x i8> @test_mask_ucmp_q_128(<2
> >  ; X64-NEXT:    kshiftrw $14, %k2, %k2 # encoding:
> [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
> >  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
> >  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression
> encoding: [0xc5,0xf9,0xef,0xc0]
> > -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x00]
> > +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x00]
> >  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> > -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x01]
> > +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x01]
> >  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> > -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x02]
> > +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x02]
> >  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> > -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x04]
> > +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x04]
> >  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> > -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x05]
> > +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x05]
> >  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> > -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x06]
> > +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> >  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> > -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc5,0xf9,0xc4,0xc0,0x07]
> > +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding:
> [0xc4,0xe3,0x79,0x20,0xc0,0x07]
> >  ; X64-NEXT:    retq # encoding: [0xc3]
> >    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.128(<2 x i64> %a0, <2 x
> i64> %a1, i32 0, i8 %mask)
> >    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> >
> > Modified: llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll Wed Aug  7
> 09:24:26 2019
> > @@ -178,144 +178,63 @@ define i16 @v16i8(<16 x i8> %a, <16 x i8
> >  }
> >
> >  define i2 @v2i8(<2 x i8> %a, <2 x i8> %b, <2 x i8> %c, <2 x i8> %d) {
> > -; SSE2-SSSE3-LABEL: v2i8:
> > -; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllq $56, %xmm2
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 =
> xmm2[0],xmm4[0],xmm2[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    psllq $56, %xmm3
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm3, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm3
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm3 =
> xmm3[0],xmm4[0],xmm3[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    psllq $56, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm4[0],xmm0[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    psllq $56, %xmm1
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 =
> xmm1[0],xmm4[0],xmm1[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm1
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm5
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm5
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm5, %xmm1
> > -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm3
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm2
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm3, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm2[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm3
> > -; SSE2-SSSE3-NEXT:    por %xmm2, %xmm3
> > -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm3
> > -; SSE2-SSSE3-NEXT:    movmskpd %xmm3, %eax
> > -; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> > -; SSE2-SSSE3-NEXT:    retq
> > +; SSE2-LABEL: v2i8:
> > +; SSE2:       # %bb.0:
> > +; SSE2-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> > +; SSE2-NEXT:    pcmpgtb %xmm3, %xmm2
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,0,1,1]
> > +; SSE2-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-NEXT:    movmskpd %xmm1, %eax
> > +; SSE2-NEXT:    # kill: def $al killed $al killed $eax
> > +; SSE2-NEXT:    retq
> > +;
> > +; SSSE3-LABEL: v2i8:
> > +; SSSE3:       # %bb.0:
> > +; SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSSE3-NEXT:    movdqa {{.*#+}} xmm1 =
> <u,u,0,0,u,u,0,0,u,u,1,1,u,u,1,1>
> > +; SSSE3-NEXT:    pshufb %xmm1, %xmm0
> > +; SSSE3-NEXT:    pcmpgtb %xmm3, %xmm2
> > +; SSSE3-NEXT:    pshufb %xmm1, %xmm2
> > +; SSSE3-NEXT:    pand %xmm0, %xmm2
> > +; SSSE3-NEXT:    movmskpd %xmm2, %eax
> > +; SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> > +; SSSE3-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: v2i8:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpsllq $56, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm4
> > -; AVX1-NEXT:    vpsrad $24, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm3 =
> xmm3[0,1],xmm4[2,3],xmm3[4,5],xmm4[6,7]
> > -; AVX1-NEXT:    vpsllq $56, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm4
> > -; AVX1-NEXT:    vpsrad $24, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 =
> xmm2[0,1],xmm4[2,3],xmm2[4,5],xmm4[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm3
> > -; AVX1-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm3[2,3],xmm1[4,5],xmm3[6,7]
> > -; AVX1-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm3
> > -; AVX1-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm3[2,3],xmm0[4,5],xmm3[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: v2i8:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpsllq $56, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm4
> > -; AVX2-NEXT:    vpsrad $24, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
> > -; AVX2-NEXT:    vpsllq $56, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm4
> > -; AVX2-NEXT:    vpsrad $24, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm2 = xmm2[0],xmm4[1],xmm2[2],xmm4[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm3
> > -; AVX2-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> > -; AVX2-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm3
> > -; AVX2-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX2-NEXT:    retq
> > +; AVX12-LABEL: v2i8:
> > +; AVX12:       # %bb.0:
> > +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxbq %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm1
> > +; AVX12-NEXT:    vpmovsxbq %xmm1, %xmm1
> > +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> > +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v2i8:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllq $56, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsraq $56, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsllq $56, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsraq $56, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraq $56, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraq $56, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> > -; AVX512F-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k1
> > +; AVX512F-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v2i8:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllq $56, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsraq $56, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsllq $56, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsraq $56, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraq $56, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraq $56, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm3, %xmm2, %k1
> > +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -329,142 +248,47 @@ define i2 @v2i8(<2 x i8> %a, <2 x i8> %b
> >  define i2 @v2i16(<2 x i16> %a, <2 x i16> %b, <2 x i16> %c, <2 x i16>
> %d) {
> >  ; SSE2-SSSE3-LABEL: v2i16:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllq $48, %xmm2
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 =
> xmm2[0],xmm4[0],xmm2[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    psllq $48, %xmm3
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm3, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm3
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm3 =
> xmm3[0],xmm4[0],xmm3[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    psllq $48, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm4[0],xmm0[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    psllq $48, %xmm1
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm4
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 =
> xmm1[0],xmm4[0],xmm1[1],xmm4[1]
> > -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm1
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm5
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm5
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm5, %xmm1
> > -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm3
> > -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm2
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm3, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm2[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm3
> > -; SSE2-SSSE3-NEXT:    por %xmm2, %xmm3
> > -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm3
> > -; SSE2-SSSE3-NEXT:    movmskpd %xmm3, %eax
> > +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> > +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm3, %xmm2
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,0,1,1]
> > +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: v2i16:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpsllq $48, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm4
> > -; AVX1-NEXT:    vpsrad $16, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm3 =
> xmm3[0,1],xmm4[2,3],xmm3[4,5],xmm4[6,7]
> > -; AVX1-NEXT:    vpsllq $48, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm4
> > -; AVX1-NEXT:    vpsrad $16, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 =
> xmm2[0,1],xmm4[2,3],xmm2[4,5],xmm4[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm3
> > -; AVX1-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm3[2,3],xmm1[4,5],xmm3[6,7]
> > -; AVX1-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm3
> > -; AVX1-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm3[2,3],xmm0[4,5],xmm3[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: v2i16:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpsllq $48, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm4
> > -; AVX2-NEXT:    vpsrad $16, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
> > -; AVX2-NEXT:    vpsllq $48, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm4
> > -; AVX2-NEXT:    vpsrad $16, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm2 = xmm2[0],xmm4[1],xmm2[2],xmm4[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm3
> > -; AVX2-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> > -; AVX2-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm3
> > -; AVX2-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX2-NEXT:    retq
> > +; AVX12-LABEL: v2i16:
> > +; AVX12:       # %bb.0:
> > +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxwq %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm1
> > +; AVX12-NEXT:    vpmovsxwq %xmm1, %xmm1
> > +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> > +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v2i16:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllq $48, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsraq $48, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsllq $48, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsraq $48, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraq $48, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraq $48, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> > -; AVX512F-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm0
> > +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k1
> > +; AVX512F-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v2i16:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllq $48, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsraq $48, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsllq $48, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsraq $48, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraq $48, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraq $48, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtw %xmm3, %xmm2, %k1
> > +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -478,118 +302,40 @@ define i2 @v2i16(<2 x i16> %a, <2 x i16>
> >  define i2 @v2i32(<2 x i32> %a, <2 x i32> %b, <2 x i32> %c, <2 x i32>
> %d) {
> >  ; SSE2-SSSE3-LABEL: v2i32:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllq $32, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm4 =
> xmm4[0],xmm2[0],xmm4[1],xmm2[1]
> > -; SSE2-SSSE3-NEXT:    psllq $32, %xmm3
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm3[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm3
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 =
> xmm2[0],xmm3[0],xmm2[1],xmm3[1]
> > -; SSE2-SSSE3-NEXT:    psllq $32, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm3 =
> xmm3[0],xmm0[0],xmm3[1],xmm0[1]
> > -; SSE2-SSSE3-NEXT:    psllq $32, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> > -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm1 = [2147483648,2147483648]
> > -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm3
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm3, %xmm5
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm0, %xmm5
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm0, %xmm3
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm5, %xmm0
> > -; SSE2-SSSE3-NEXT:    por %xmm3, %xmm0
> > -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm4
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm4, %xmm1
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm2, %xmm1
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm2, %xmm4
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm4[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    por %xmm4, %xmm2
> > -; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm2
> > -; SSE2-SSSE3-NEXT:    movmskpd %xmm2, %eax
> > +; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> > +; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[0,0,1,1]
> > +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: v2i32:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpsllq $32, %xmm3, %xmm4
> > -; AVX1-NEXT:    vpsrad $31, %xmm4, %xmm4
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm3 =
> xmm3[0,1],xmm4[2,3],xmm3[4,5],xmm4[6,7]
> > -; AVX1-NEXT:    vpsllq $32, %xmm2, %xmm4
> > -; AVX1-NEXT:    vpsrad $31, %xmm4, %xmm4
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 =
> xmm2[0,1],xmm4[2,3],xmm2[4,5],xmm4[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpsllq $32, %xmm1, %xmm3
> > -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm3[2,3],xmm1[4,5],xmm3[6,7]
> > -; AVX1-NEXT:    vpsllq $32, %xmm0, %xmm3
> > -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm3[2,3],xmm0[4,5],xmm3[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: v2i32:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpsllq $32, %xmm3, %xmm4
> > -; AVX2-NEXT:    vpsrad $31, %xmm4, %xmm4
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
> > -; AVX2-NEXT:    vpsllq $32, %xmm2, %xmm4
> > -; AVX2-NEXT:    vpsrad $31, %xmm4, %xmm4
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm2 = xmm2[0],xmm4[1],xmm2[2],xmm4[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpsllq $32, %xmm1, %xmm3
> > -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> > -; AVX2-NEXT:    vpsllq $32, %xmm0, %xmm3
> > -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX2-NEXT:    retq
> > +; AVX12-LABEL: v2i32:
> > +; AVX12:       # %bb.0:
> > +; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxdq %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm1
> > +; AVX12-NEXT:    vpmovsxdq %xmm1, %xmm1
> > +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> > +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v2i32:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllq $32, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsraq $32, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsllq $32, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsraq $32, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraq $32, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraq $32, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> > -; AVX512F-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtd %xmm3, %xmm2, %k1
> > +; AVX512F-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v2i32:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllq $32, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsraq $32, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsllq $32, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsraq $32, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraq $32, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraq $32, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtd %xmm3, %xmm2, %k1
> > +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -700,66 +446,47 @@ define i2 @v2f64(<2 x double> %a, <2 x d
> >  define i4 @v4i8(<4 x i8> %a, <4 x i8> %b, <4 x i8> %c, <4 x i8> %d) {
> >  ; SSE2-SSSE3-LABEL: v4i8:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    pslld $24, %xmm3
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm3
> > -; SSE2-SSSE3-NEXT:    pslld $24, %xmm2
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> > -; SSE2-SSSE3-NEXT:    pslld $24, %xmm1
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> > -; SSE2-SSSE3-NEXT:    pslld $24, %xmm0
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm3, %xmm2
> > +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> > +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-SSSE3-NEXT:    movmskps %xmm1, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX12-LABEL: v4i8:
> >  ; AVX12:       # %bb.0:
> > -; AVX12-NEXT:    vpslld $24, %xmm3, %xmm3
> > -; AVX12-NEXT:    vpsrad $24, %xmm3, %xmm3
> > -; AVX12-NEXT:    vpslld $24, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpsrad $24, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpslld $24, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxbd %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm1
> > +; AVX12-NEXT:    vpmovsxbd %xmm1, %xmm1
> > +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> >  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
> >  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v4i8:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpslld $24, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsrad $24, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpslld $24, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsrad $24, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpslld $24, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> > -; AVX512F-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k1
> > +; AVX512F-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v4i8:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpslld $24, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsrad $24, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpslld $24, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsrad $24, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpslld $24, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> > -; AVX512BW-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm3, %xmm2, %k1
> > +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -773,66 +500,45 @@ define i4 @v4i8(<4 x i8> %a, <4 x i8> %b
> >  define i4 @v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c, <4 x i16>
> %d) {
> >  ; SSE2-SSSE3-LABEL: v4i16:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    pslld $16, %xmm3
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm3
> > -; SSE2-SSSE3-NEXT:    pslld $16, %xmm2
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> > -; SSE2-SSSE3-NEXT:    pslld $16, %xmm1
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> > -; SSE2-SSSE3-NEXT:    pslld $16, %xmm0
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm3, %xmm2
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> > +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-SSSE3-NEXT:    movmskps %xmm1, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX12-LABEL: v4i16:
> >  ; AVX12:       # %bb.0:
> > -; AVX12-NEXT:    vpslld $16, %xmm3, %xmm3
> > -; AVX12-NEXT:    vpsrad $16, %xmm3, %xmm3
> > -; AVX12-NEXT:    vpslld $16, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpsrad $16, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpslld $16, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpslld $16, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxwd %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm1
> > +; AVX12-NEXT:    vpmovsxwd %xmm1, %xmm1
> > +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> >  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
> >  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v4i16:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpslld $16, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsrad $16, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpslld $16, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsrad $16, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpslld $16, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpslld $16, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> > -; AVX512F-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm0
> > +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k1
> > +; AVX512F-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v4i16:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpslld $16, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsrad $16, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpslld $16, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsrad $16, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpslld $16, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpslld $16, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> > -; AVX512BW-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtw %xmm3, %xmm2, %k1
> > +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -846,35 +552,23 @@ define i4 @v4i16(<4 x i16> %a, <4 x i16>
> >  define i8 @v8i8(<8 x i8> %a, <8 x i8> %b, <8 x i8> %c, <8 x i8> %d) {
> >  ; SSE2-SSSE3-LABEL: v8i8:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllw $8, %xmm3
> > -; SSE2-SSSE3-NEXT:    psraw $8, %xmm3
> > -; SSE2-SSSE3-NEXT:    psllw $8, %xmm2
> > -; SSE2-SSSE3-NEXT:    psraw $8, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpgtw %xmm3, %xmm2
> > -; SSE2-SSSE3-NEXT:    psllw $8, %xmm1
> > -; SSE2-SSSE3-NEXT:    psraw $8, %xmm1
> > -; SSE2-SSSE3-NEXT:    psllw $8, %xmm0
> > -; SSE2-SSSE3-NEXT:    psraw $8, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    packsswb %xmm0, %xmm0
> > -; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm3, %xmm2
> > +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
> > +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-SSSE3-NEXT:    packsswb %xmm0, %xmm1
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm1, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX12-LABEL: v8i8:
> >  ; AVX12:       # %bb.0:
> > -; AVX12-NEXT:    vpsllw $8, %xmm3, %xmm3
> > -; AVX12-NEXT:    vpsraw $8, %xmm3, %xmm3
> > -; AVX12-NEXT:    vpsllw $8, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpsraw $8, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm2
> > -; AVX12-NEXT:    vpsllw $8, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsraw $8, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsllw $8, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpsraw $8, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxbw %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm1
> > +; AVX12-NEXT:    vpmovsxbw %xmm1, %xmm1
> > +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> >  ; AVX12-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> >  ; AVX12-NEXT:    vpmovmskb %xmm0, %eax
> >  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > @@ -882,19 +576,13 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
> >  ;
> >  ; AVX512F-LABEL: v8i8:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllw $8, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsraw $8, %xmm3, %xmm3
> > -; AVX512F-NEXT:    vpsllw $8, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsraw $8, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpsllw $8, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraw $8, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllw $8, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraw $8, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > -; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k1
> > +; AVX512F-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512F-NEXT:    vzeroupper
> > @@ -902,16 +590,9 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
> >  ;
> >  ; AVX512BW-LABEL: v8i8:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllw $8, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsraw $8, %xmm3, %xmm3
> > -; AVX512BW-NEXT:    vpsllw $8, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsraw $8, %xmm2, %xmm2
> > -; AVX512BW-NEXT:    vpsllw $8, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraw $8, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllw $8, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraw $8, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k1
> > -; AVX512BW-NEXT:    vpcmpgtw %xmm3, %xmm2, %k0 {%k1}
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm3, %xmm2, %k1
> > +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll Wed Aug  7 09:24:26
> 2019
> > @@ -144,87 +144,45 @@ define i16 @v16i8(<16 x i8> %a, <16 x i8
> >  }
> >
> >  define i2 @v2i8(<2 x i8> %a, <2 x i8> %b) {
> > -; SSE2-SSSE3-LABEL: v2i8:
> > -; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllq $56, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm2[0],xmm0[1],xmm2[1]
> > -; SSE2-SSSE3-NEXT:    psllq $56, %xmm1
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1]
> > -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm2 = [2147483648,2147483648]
> > -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm1
> > -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm1
> > -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> > -; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
> > -; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> > -; SSE2-SSSE3-NEXT:    retq
> > +; SSE2-LABEL: v2i8:
> > +; SSE2:       # %bb.0:
> > +; SSE2-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> > +; SSE2-NEXT:    movmskpd %xmm0, %eax
> > +; SSE2-NEXT:    # kill: def $al killed $al killed $eax
> > +; SSE2-NEXT:    retq
> > +;
> > +; SSSE3-LABEL: v2i8:
> > +; SSSE3:       # %bb.0:
> > +; SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[u,u,0,0,u,u,0,0,u,u,1,1,u,u,1,1]
> > +; SSSE3-NEXT:    movmskpd %xmm0, %eax
> > +; SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> > +; SSSE3-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: v2i8:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm2
> > -; AVX1-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm2
> > -; AVX1-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: v2i8:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm2
> > -; AVX2-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX2-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm2
> > -; AVX2-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX2-NEXT:    retq
> > +; AVX12-LABEL: v2i8:
> > +; AVX12:       # %bb.0:
> > +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxbq %xmm0, %xmm0
> > +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> > +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v2i8:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraq $56, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraq $56, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v2i8:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllq $56, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraq $56, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllq $56, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraq $56, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -236,85 +194,34 @@ define i2 @v2i8(<2 x i8> %a, <2 x i8> %b
> >  define i2 @v2i16(<2 x i16> %a, <2 x i16> %b) {
> >  ; SSE2-SSSE3-LABEL: v2i16:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllq $48, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm2[0],xmm0[1],xmm2[1]
> > -; SSE2-SSSE3-NEXT:    psllq $48, %xmm1
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 =
> xmm1[0],xmm2[0],xmm1[1],xmm2[1]
> > -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm2 = [2147483648,2147483648]
> > -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm1
> > -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm0
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm1
> > -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> > -; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
> > +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> > +; SSE2-SSSE3-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: v2i16:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm2
> > -; AVX1-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm2
> > -; AVX1-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: v2i16:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm2
> > -; AVX2-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX2-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm2
> > -; AVX2-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX2-NEXT:    retq
> > +; AVX12-LABEL: v2i16:
> > +; AVX12:       # %bb.0:
> > +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxwq %xmm0, %xmm0
> > +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> > +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v2i16:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraq $48, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraq $48, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v2i16:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllq $48, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraq $48, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllq $48, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraq $48, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -326,73 +233,30 @@ define i2 @v2i16(<2 x i16> %a, <2 x i16>
> >  define i2 @v2i32(<2 x i32> %a, <2 x i32> %b) {
> >  ; SSE2-SSSE3-LABEL: v2i32:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllq $32, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm0
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1]
> > -; SSE2-SSSE3-NEXT:    psllq $32, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    psrad $31, %xmm1
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> > -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> > -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm1 = [2147483648,2147483648]
> > -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm2
> > -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm1
> > -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm0, %xmm1
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm0, %xmm2
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,2,2]
> > -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    por %xmm2, %xmm0
> > +; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> >  ; SSE2-SSSE3-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: v2i32:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpsllq $32, %xmm1, %xmm2
> > -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpsllq $32, %xmm0, %xmm2
> > -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: v2i32:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpsllq $32, %xmm1, %xmm2
> > -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX2-NEXT:    vpsllq $32, %xmm0, %xmm2
> > -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> > -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> > -; AVX2-NEXT:    retq
> > +; AVX12-LABEL: v2i32:
> > +; AVX12:       # %bb.0:
> > +; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxdq %xmm0, %xmm0
> > +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> > +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v2i32:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraq $32, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraq $32, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v2i32:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraq $32, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraq $32, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -478,44 +342,34 @@ define i2 @v2f64(<2 x double> %a, <2 x d
> >  define i4 @v4i8(<4 x i8> %a, <4 x i8> %b) {
> >  ; SSE2-SSSE3-LABEL: v4i8:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    pslld $24, %xmm1
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> > -; SSE2-SSSE3-NEXT:    pslld $24, %xmm0
> > -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> >  ; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX12-LABEL: v4i8:
> >  ; AVX12:       # %bb.0:
> > -; AVX12-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpslld $24, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxbd %xmm0, %xmm0
> >  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
> >  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v4i8:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpslld $24, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v4i8:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpslld $24, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsrad $24, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpslld $24, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsrad $24, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -527,44 +381,33 @@ define i4 @v4i8(<4 x i8> %a, <4 x i8> %b
> >  define i4 @v4i16(<4 x i16> %a, <4 x i16> %b) {
> >  ; SSE2-SSSE3-LABEL: v4i16:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    pslld $16, %xmm1
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> > -; SSE2-SSSE3-NEXT:    pslld $16, %xmm0
> > -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> >  ; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> >  ; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX12-LABEL: v4i16:
> >  ; AVX12:       # %bb.0:
> > -; AVX12-NEXT:    vpslld $16, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpslld $16, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxwd %xmm0, %xmm0
> >  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
> >  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX12-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: v4i16:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpslld $16, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpslld $16, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> > +; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> >  ; AVX512BW-LABEL: v4i16:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpslld $16, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsrad $16, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpslld $16, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsrad $16, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> > @@ -576,11 +419,8 @@ define i4 @v4i16(<4 x i16> %a, <4 x i16>
> >  define i8 @v8i8(<8 x i8> %a, <8 x i8> %b) {
> >  ; SSE2-SSSE3-LABEL: v8i8:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    psllw $8, %xmm1
> > -; SSE2-SSSE3-NEXT:    psraw $8, %xmm1
> > -; SSE2-SSSE3-NEXT:    psllw $8, %xmm0
> > -; SSE2-SSSE3-NEXT:    psraw $8, %xmm0
> > -; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> >  ; SSE2-SSSE3-NEXT:    packsswb %xmm0, %xmm0
> >  ; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> > @@ -588,11 +428,8 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
> >  ;
> >  ; AVX12-LABEL: v8i8:
> >  ; AVX12:       # %bb.0:
> > -; AVX12-NEXT:    vpsllw $8, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsraw $8, %xmm1, %xmm1
> > -; AVX12-NEXT:    vpsllw $8, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpsraw $8, %xmm0, %xmm0
> > -; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX12-NEXT:    vpmovsxbw %xmm0, %xmm0
> >  ; AVX12-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> >  ; AVX12-NEXT:    vpmovmskb %xmm0, %eax
> >  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> > @@ -600,13 +437,9 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
> >  ;
> >  ; AVX512F-LABEL: v8i8:
> >  ; AVX512F:       # %bb.0:
> > -; AVX512F-NEXT:    vpsllw $8, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsraw $8, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpsllw $8, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpsraw $8, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> > -; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> > -; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> > +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> > +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512F-NEXT:    vzeroupper
> > @@ -614,11 +447,7 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
> >  ;
> >  ; AVX512BW-LABEL: v8i8:
> >  ; AVX512BW:       # %bb.0:
> > -; AVX512BW-NEXT:    vpsllw $8, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsraw $8, %xmm1, %xmm1
> > -; AVX512BW-NEXT:    vpsllw $8, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpsraw $8, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> > +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> >  ; AVX512BW-NEXT:    kmovd %k0, %eax
> >  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
> >  ; AVX512BW-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll Wed Aug  7
> 09:24:26 2019
> > @@ -151,27 +151,14 @@ define i4 @bitcast_v8i16_to_v2i4(<8 x i1
> >  }
> >
> >  define i8 @bitcast_v16i8_to_v2i8(<16 x i8> %a0) nounwind {
> > -; SSE2-LABEL: bitcast_v16i8_to_v2i8:
> > -; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pmovmskb %xmm0, %eax
> > -; SSE2-NEXT:    movd %eax, %xmm0
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; SSE2-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > -; SSE2-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > -; SSE2-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > -; SSE2-NEXT:    retq
> > -;
> > -; SSSE3-LABEL: bitcast_v16i8_to_v2i8:
> > -; SSSE3:       # %bb.0:
> > -; SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > -; SSSE3-NEXT:    movd %eax, %xmm0
> > -; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,u,u,u,u,u,u,u,1,u,u,u,u,u,u,u]
> > -; SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > -; SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > -; SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > -; SSSE3-NEXT:    retq
> > +; SSE2-SSSE3-LABEL: bitcast_v16i8_to_v2i8:
> > +; SSE2-SSSE3:       # %bb.0:
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> > +; SSE2-SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > +; SSE2-SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > +; SSE2-SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > +; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX12-LABEL: bitcast_v16i8_to_v2i8:
> >  ; AVX12:       # %bb.0:
> > @@ -187,7 +174,7 @@ define i8 @bitcast_v16i8_to_v2i8(<16 x i
> >  ; AVX512:       # %bb.0:
> >  ; AVX512-NEXT:    vpmovb2m %xmm0, %k0
> >  ; AVX512-NEXT:    kmovw %k0, -{{[0-9]+}}(%rsp)
> > -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; AVX512-NEXT:    vpextrb $0, %xmm0, %ecx
> >  ; AVX512-NEXT:    vpextrb $1, %xmm0, %eax
> >  ; AVX512-NEXT:    addb %cl, %al
> > @@ -318,29 +305,15 @@ define i4 @bitcast_v8i32_to_v2i4(<8 x i3
> >  }
> >
> >  define i8 @bitcast_v16i16_to_v2i8(<16 x i16> %a0) nounwind {
> > -; SSE2-LABEL: bitcast_v16i16_to_v2i8:
> > -; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    packsswb %xmm1, %xmm0
> > -; SSE2-NEXT:    pmovmskb %xmm0, %eax
> > -; SSE2-NEXT:    movd %eax, %xmm0
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; SSE2-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > -; SSE2-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > -; SSE2-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > -; SSE2-NEXT:    retq
> > -;
> > -; SSSE3-LABEL: bitcast_v16i16_to_v2i8:
> > -; SSSE3:       # %bb.0:
> > -; SSSE3-NEXT:    packsswb %xmm1, %xmm0
> > -; SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > -; SSSE3-NEXT:    movd %eax, %xmm0
> > -; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,u,u,u,u,u,u,u,1,u,u,u,u,u,u,u]
> > -; SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > -; SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > -; SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > -; SSSE3-NEXT:    retq
> > +; SSE2-SSSE3-LABEL: bitcast_v16i16_to_v2i8:
> > +; SSE2-SSSE3:       # %bb.0:
> > +; SSE2-SSSE3-NEXT:    packsswb %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> > +; SSE2-SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > +; SSE2-SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > +; SSE2-SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > +; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: bitcast_v16i16_to_v2i8:
> >  ; AVX1:       # %bb.0:
> > @@ -374,7 +347,7 @@ define i8 @bitcast_v16i16_to_v2i8(<16 x
> >  ; AVX512:       # %bb.0:
> >  ; AVX512-NEXT:    vpmovw2m %ymm0, %k0
> >  ; AVX512-NEXT:    kmovw %k0, -{{[0-9]+}}(%rsp)
> > -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; AVX512-NEXT:    vpextrb $0, %xmm0, %ecx
> >  ; AVX512-NEXT:    vpextrb $1, %xmm0, %eax
> >  ; AVX512-NEXT:    addb %cl, %al
> > @@ -392,12 +365,10 @@ define i8 @bitcast_v16i16_to_v2i8(<16 x
> >  define i16 @bitcast_v32i8_to_v2i16(<32 x i8> %a0) nounwind {
> >  ; SSE2-SSSE3-LABEL: bitcast_v32i8_to_v2i16:
> >  ; SSE2-SSSE3:       # %bb.0:
> > -; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > -; SSE2-SSSE3-NEXT:    pmovmskb %xmm1, %ecx
> > -; SSE2-SSSE3-NEXT:    shll $16, %ecx
> > -; SSE2-SSSE3-NEXT:    orl %eax, %ecx
> > -; SSE2-SSSE3-NEXT:    movd %ecx, %xmm0
> > -; SSE2-SSSE3-NEXT:    pextrw $0, %xmm0, %ecx
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %ecx
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm1, %eax
> > +; SSE2-SSSE3-NEXT:    shll $16, %eax
> > +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> >  ; SSE2-SSSE3-NEXT:    pextrw $1, %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    addl %ecx, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -411,7 +382,6 @@ define i16 @bitcast_v32i8_to_v2i16(<32 x
> >  ; AVX1-NEXT:    shll $16, %ecx
> >  ; AVX1-NEXT:    orl %eax, %ecx
> >  ; AVX1-NEXT:    vmovd %ecx, %xmm0
> > -; AVX1-NEXT:    vpextrw $0, %xmm0, %ecx
> >  ; AVX1-NEXT:    vpextrw $1, %xmm0, %eax
> >  ; AVX1-NEXT:    addl %ecx, %eax
> >  ; AVX1-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -420,9 +390,8 @@ define i16 @bitcast_v32i8_to_v2i16(<32 x
> >  ;
> >  ; AVX2-LABEL: bitcast_v32i8_to_v2i16:
> >  ; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpmovmskb %ymm0, %eax
> > -; AVX2-NEXT:    vmovd %eax, %xmm0
> > -; AVX2-NEXT:    vpextrw $0, %xmm0, %ecx
> > +; AVX2-NEXT:    vpmovmskb %ymm0, %ecx
> > +; AVX2-NEXT:    vmovd %ecx, %xmm0
> >  ; AVX2-NEXT:    vpextrw $1, %xmm0, %eax
> >  ; AVX2-NEXT:    addl %ecx, %eax
> >  ; AVX2-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -437,8 +406,8 @@ define i16 @bitcast_v32i8_to_v2i16(<32 x
> >  ; AVX512-NEXT:    subq $32, %rsp
> >  ; AVX512-NEXT:    vpmovb2m %ymm0, %k0
> >  ; AVX512-NEXT:    kmovd %k0, (%rsp)
> > -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > -; AVX512-NEXT:    vpextrw $0, %xmm0, %ecx
> > +; AVX512-NEXT:    vmovdqa (%rsp), %xmm0
> > +; AVX512-NEXT:    vmovd %xmm0, %ecx
> >  ; AVX512-NEXT:    vpextrw $1, %xmm0, %eax
> >  ; AVX512-NEXT:    addl %ecx, %eax
> >  ; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -579,33 +548,17 @@ define i4 @bitcast_v8i64_to_v2i4(<8 x i6
> >  }
> >
> >  define i8 @bitcast_v16i32_to_v2i8(<16 x i32> %a0) nounwind {
> > -; SSE2-LABEL: bitcast_v16i32_to_v2i8:
> > -; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    packssdw %xmm3, %xmm2
> > -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > -; SSE2-NEXT:    packsswb %xmm2, %xmm0
> > -; SSE2-NEXT:    pmovmskb %xmm0, %eax
> > -; SSE2-NEXT:    movd %eax, %xmm0
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; SSE2-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > -; SSE2-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > -; SSE2-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > -; SSE2-NEXT:    retq
> > -;
> > -; SSSE3-LABEL: bitcast_v16i32_to_v2i8:
> > -; SSSE3:       # %bb.0:
> > -; SSSE3-NEXT:    packssdw %xmm3, %xmm2
> > -; SSSE3-NEXT:    packssdw %xmm1, %xmm0
> > -; SSSE3-NEXT:    packsswb %xmm2, %xmm0
> > -; SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > -; SSSE3-NEXT:    movd %eax, %xmm0
> > -; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,u,u,u,u,u,u,u,1,u,u,u,u,u,u,u]
> > -; SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > -; SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > -; SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > -; SSSE3-NEXT:    retq
> > +; SSE2-SSSE3-LABEL: bitcast_v16i32_to_v2i8:
> > +; SSE2-SSSE3:       # %bb.0:
> > +; SSE2-SSSE3-NEXT:    packssdw %xmm3, %xmm2
> > +; SSE2-SSSE3-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-SSSE3-NEXT:    packsswb %xmm2, %xmm0
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> > +; SSE2-SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> > +; SSE2-SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> > +; SSE2-SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> > +; SSE2-SSSE3-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: bitcast_v16i32_to_v2i8:
> >  ; AVX1:       # %bb.0:
> > @@ -646,7 +599,7 @@ define i8 @bitcast_v16i32_to_v2i8(<16 x
> >  ; AVX512-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> >  ; AVX512-NEXT:    vpcmpgtd %zmm0, %zmm1, %k0
> >  ; AVX512-NEXT:    kmovw %k0, -{{[0-9]+}}(%rsp)
> > -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; AVX512-NEXT:    vpextrb $0, %xmm0, %ecx
> >  ; AVX512-NEXT:    vpextrb $1, %xmm0, %eax
> >  ; AVX512-NEXT:    addb %cl, %al
> > @@ -665,13 +618,11 @@ define i16 @bitcast_v32i16_to_v2i16(<32
> >  ; SSE2-SSSE3-LABEL: bitcast_v32i16_to_v2i16:
> >  ; SSE2-SSSE3:       # %bb.0:
> >  ; SSE2-SSSE3-NEXT:    packsswb %xmm1, %xmm0
> > -; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %ecx
> >  ; SSE2-SSSE3-NEXT:    packsswb %xmm3, %xmm2
> > -; SSE2-SSSE3-NEXT:    pmovmskb %xmm2, %ecx
> > -; SSE2-SSSE3-NEXT:    shll $16, %ecx
> > -; SSE2-SSSE3-NEXT:    orl %eax, %ecx
> > -; SSE2-SSSE3-NEXT:    movd %ecx, %xmm0
> > -; SSE2-SSSE3-NEXT:    pextrw $0, %xmm0, %ecx
> > +; SSE2-SSSE3-NEXT:    pmovmskb %xmm2, %eax
> > +; SSE2-SSSE3-NEXT:    shll $16, %eax
> > +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> >  ; SSE2-SSSE3-NEXT:    pextrw $1, %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    addl %ecx, %eax
> >  ; SSE2-SSSE3-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -688,7 +639,6 @@ define i16 @bitcast_v32i16_to_v2i16(<32
> >  ; AVX1-NEXT:    shll $16, %ecx
> >  ; AVX1-NEXT:    orl %eax, %ecx
> >  ; AVX1-NEXT:    vmovd %ecx, %xmm0
> > -; AVX1-NEXT:    vpextrw $0, %xmm0, %ecx
> >  ; AVX1-NEXT:    vpextrw $1, %xmm0, %eax
> >  ; AVX1-NEXT:    addl %ecx, %eax
> >  ; AVX1-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -699,9 +649,8 @@ define i16 @bitcast_v32i16_to_v2i16(<32
> >  ; AVX2:       # %bb.0:
> >  ; AVX2-NEXT:    vpacksswb %ymm1, %ymm0, %ymm0
> >  ; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
> > -; AVX2-NEXT:    vpmovmskb %ymm0, %eax
> > -; AVX2-NEXT:    vmovd %eax, %xmm0
> > -; AVX2-NEXT:    vpextrw $0, %xmm0, %ecx
> > +; AVX2-NEXT:    vpmovmskb %ymm0, %ecx
> > +; AVX2-NEXT:    vmovd %ecx, %xmm0
> >  ; AVX2-NEXT:    vpextrw $1, %xmm0, %eax
> >  ; AVX2-NEXT:    addl %ecx, %eax
> >  ; AVX2-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -716,8 +665,8 @@ define i16 @bitcast_v32i16_to_v2i16(<32
> >  ; AVX512-NEXT:    subq $32, %rsp
> >  ; AVX512-NEXT:    vpmovw2m %zmm0, %k0
> >  ; AVX512-NEXT:    kmovd %k0, (%rsp)
> > -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > -; AVX512-NEXT:    vpextrw $0, %xmm0, %ecx
> > +; AVX512-NEXT:    vmovdqa (%rsp), %xmm0
> > +; AVX512-NEXT:    vmovd %xmm0, %ecx
> >  ; AVX512-NEXT:    vpextrw $1, %xmm0, %eax
> >  ; AVX512-NEXT:    addl %ecx, %eax
> >  ; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
> > @@ -984,9 +933,9 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
> >  ; SSE2-SSSE3-NEXT:    orl %ecx, %edx
> >  ; SSE2-SSSE3-NEXT:    orl %eax, %edx
> >  ; SSE2-SSSE3-NEXT:    movw %dx, -{{[0-9]+}}(%rsp)
> > -; SSE2-SSSE3-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; SSE2-SSSE3-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; SSE2-SSSE3-NEXT:    movd %xmm0, %ecx
> > -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,0,1]
> > +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> >  ; SSE2-SSSE3-NEXT:    movd %xmm0, %eax
> >  ; SSE2-SSSE3-NEXT:    addl %ecx, %eax
> >  ; SSE2-SSSE3-NEXT:    retq
> > @@ -1246,7 +1195,7 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
> >  ; AVX1-NEXT:    orl %ecx, %edx
> >  ; AVX1-NEXT:    orl %eax, %edx
> >  ; AVX1-NEXT:    movl %edx, -{{[0-9]+}}(%rsp)
> > -; AVX1-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> > +; AVX1-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; AVX1-NEXT:    vmovd %xmm0, %ecx
> >  ; AVX1-NEXT:    vpextrd $1, %xmm0, %eax
> >  ; AVX1-NEXT:    addl %ecx, %eax
> > @@ -1506,7 +1455,7 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
> >  ; AVX2-NEXT:    orl %ecx, %edx
> >  ; AVX2-NEXT:    orl %eax, %edx
> >  ; AVX2-NEXT:    movl %edx, -{{[0-9]+}}(%rsp)
> > -; AVX2-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> > +; AVX2-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; AVX2-NEXT:    vmovd %xmm0, %ecx
> >  ; AVX2-NEXT:    vpextrd $1, %xmm0, %eax
> >  ; AVX2-NEXT:    addl %ecx, %eax
> > @@ -1517,7 +1466,7 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
> >  ; AVX512:       # %bb.0:
> >  ; AVX512-NEXT:    vpmovb2m %zmm0, %k0
> >  ; AVX512-NEXT:    kmovq %k0, -{{[0-9]+}}(%rsp)
> > -; AVX512-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> > +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; AVX512-NEXT:    vmovd %xmm0, %ecx
> >  ; AVX512-NEXT:    vpextrd $1, %xmm0, %eax
> >  ; AVX512-NEXT:    addl %ecx, %eax
> >
> > Modified: llvm/trunk/test/CodeGen/X86/bitreverse.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitreverse.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/bitreverse.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/bitreverse.ll Wed Aug  7 09:24:26 2019
> > @@ -55,13 +55,11 @@ define <2 x i16> @test_bitreverse_v2i16(
> >  ; X64-NEXT:    pxor %xmm1, %xmm1
> >  ; X64-NEXT:    movdqa %xmm0, %xmm2
> >  ; X64-NEXT:    punpckhbw {{.*#+}} xmm2 =
> xmm2[8],xmm1[8],xmm2[9],xmm1[9],xmm2[10],xmm1[10],xmm2[11],xmm1[11],xmm2[12],xmm1[12],xmm2[13],xmm1[13],xmm2[14],xmm1[14],xmm2[15],xmm1[15]
> > -; X64-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
> > -; X64-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[3,2,1,0,4,5,6,7]
> > -; X64-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,7,6,5,4]
> > +; X64-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[1,0,3,2,4,5,6,7]
> > +; X64-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,5,4,7,6]
> >  ; X64-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> > -; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[3,2,1,0,4,5,6,7]
> > -; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,7,6,5,4]
> > +; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7]
> > +; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6]
> >  ; X64-NEXT:    packuswb %xmm2, %xmm0
> >  ; X64-NEXT:    movdqa %xmm0, %xmm1
> >  ; X64-NEXT:    psllw $4, %xmm1
> > @@ -81,7 +79,6 @@ define <2 x i16> @test_bitreverse_v2i16(
> >  ; X64-NEXT:    pand {{.*}}(%rip), %xmm0
> >  ; X64-NEXT:    psrlw $1, %xmm0
> >  ; X64-NEXT:    por %xmm1, %xmm0
> > -; X64-NEXT:    psrlq $48, %xmm0
> >  ; X64-NEXT:    retq
> >    %b = call <2 x i16> @llvm.bitreverse.v2i16(<2 x i16> %a)
> >    ret <2 x i16> %b
> > @@ -410,7 +407,7 @@ define <2 x i16> @fold_v2i16() {
> >  ;
> >  ; X64-LABEL: fold_v2i16:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    movaps {{.*#+}} xmm0 = [61440,240]
> > +; X64-NEXT:    movaps {{.*#+}} xmm0 = <61440,240,u,u,u,u,u,u>
> >  ; X64-NEXT:    retq
> >    %b = call <2 x i16> @llvm.bitreverse.v2i16(<2 x i16> <i16 15, i16
> 3840>)
> >    ret <2 x i16> %b
> >
> > Modified: llvm/trunk/test/CodeGen/X86/bswap-vector.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bswap-vector.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/bswap-vector.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/bswap-vector.ll Wed Aug  7 09:24:26 2019
> > @@ -291,23 +291,22 @@ define <4 x i16> @test7(<4 x i16> %v) {
> >  ; CHECK-NOSSSE3-NEXT:    pxor %xmm1, %xmm1
> >  ; CHECK-NOSSSE3-NEXT:    movdqa %xmm0, %xmm2
> >  ; CHECK-NOSSSE3-NEXT:    punpckhbw {{.*#+}} xmm2 =
> xmm2[8],xmm1[8],xmm2[9],xmm1[9],xmm2[10],xmm1[10],xmm2[11],xmm1[11],xmm2[12],xmm1[12],xmm2[13],xmm1[13],xmm2[14],xmm1[14],xmm2[15],xmm1[15]
> > -; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[3,2,1,0,4,5,6,7]
> > -; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,7,6,5,4]
> > +; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[1,0,3,2,4,5,6,7]
> > +; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,5,4,7,6]
> >  ; CHECK-NOSSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
> > -; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[3,2,1,0,4,5,6,7]
> > -; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,7,6,5,4]
> > +; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7]
> > +; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6]
> >  ; CHECK-NOSSSE3-NEXT:    packuswb %xmm2, %xmm0
> > -; CHECK-NOSSSE3-NEXT:    psrld $16, %xmm0
> >  ; CHECK-NOSSSE3-NEXT:    retq
> >  ;
> >  ; CHECK-SSSE3-LABEL: test7:
> >  ; CHECK-SSSE3:       # %bb.0: # %entry
> > -; CHECK-SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[1,0],zero,zero,xmm0[5,4],zero,zero,xmm0[9,8],zero,zero,xmm0[13,12],zero,zero
> > +; CHECK-SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
> >  ; CHECK-SSSE3-NEXT:    retq
> >  ;
> >  ; CHECK-AVX-LABEL: test7:
> >  ; CHECK-AVX:       # %bb.0: # %entry
> > -; CHECK-AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[1,0],zero,zero,xmm0[5,4],zero,zero,xmm0[9,8],zero,zero,xmm0[13,12],zero,zero
> > +; CHECK-AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
> >  ; CHECK-AVX-NEXT:    retq
> >  ;
> >  ; CHECK-WIDE-AVX-LABEL: test7:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll Wed Aug  7
> 09:24:26 2019
> > @@ -6,22 +6,29 @@ define void @foo(<3 x float> %in, <4 x i
> >  ; SSE2-LABEL: foo:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    cvttps2dq %xmm0, %xmm0
> > -; SSE2-NEXT:    movl $255, %eax
> > -; SSE2-NEXT:    movd %eax, %xmm1
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[2,0]
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,0]
> > -; SSE2-NEXT:    andps {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
> > +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> > +; SSE2-NEXT:    movl -{{[0-9]+}}(%rsp), %ecx
> > +; SSE2-NEXT:    shll $8, %ecx
> > +; SSE2-NEXT:    orl %eax, %ecx
> > +; SSE2-NEXT:    movd %ecx, %xmm0
> > +; SSE2-NEXT:    movl $65280, %eax # imm = 0xFF00
> > +; SSE2-NEXT:    orl -{{[0-9]+}}(%rsp), %eax
> > +; SSE2-NEXT:    pinsrw $1, %eax, %xmm0
> >  ; SSE2-NEXT:    movd %xmm0, (%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE41-LABEL: foo:
> >  ; SSE41:       # %bb.0:
> >  ; SSE41-NEXT:    cvttps2dq %xmm0, %xmm0
> > +; SSE41-NEXT:    pextrb $8, %xmm0, %eax
> > +; SSE41-NEXT:    pextrb $4, %xmm0, %ecx
> > +; SSE41-NEXT:    pextrb $0, %xmm0, %edx
> > +; SSE41-NEXT:    movd %edx, %xmm0
> > +; SSE41-NEXT:    pinsrb $1, %ecx, %xmm0
> > +; SSE41-NEXT:    pinsrb $2, %eax, %xmm0
> >  ; SSE41-NEXT:    movl $255, %eax
> > -; SSE41-NEXT:    pinsrd $3, %eax, %xmm0
> > -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> > +; SSE41-NEXT:    pinsrb $3, %eax, %xmm0
> >  ; SSE41-NEXT:    movd %xmm0, (%rdi)
> >  ; SSE41-NEXT:    retq
> >    %t0 = fptoui <3 x float> %in to <3 x i8>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll Wed Aug  7
> 09:24:26 2019
> > @@ -101,9 +101,9 @@ define double @test2_mul(double %A, doub
> >  define double @test3_mul(double %A, double %B) {
> >  ; SSE41-LABEL: test3_mul:
> >  ; SSE41:       # %bb.0:
> > -; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm2 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm0 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > -; SSE41-NEXT:    pmullw %xmm2, %xmm0
> > +; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm1 =
> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> > +; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; SSE41-NEXT:    pmullw %xmm1, %xmm0
> >  ; SSE41-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; SSE41-NEXT:    retq
> >    %1 = bitcast double %A to <8 x i8>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/combine-or.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/combine-or.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/combine-or.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/combine-or.ll Wed Aug  7 09:24:26 2019
> > @@ -362,7 +362,7 @@ define <4 x float> @test25(<4 x float> %
> >  define <4 x i8> @test_crash(<4 x i8> %a, <4 x i8> %b) {
> >  ; CHECK-LABEL: test_crash:
> >  ; CHECK:       # %bb.0:
> > -; CHECK-NEXT:    blendps {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]
> > +; CHECK-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm1[0],xmm0[1],xmm1[2,3,4,5,6,7]
> >  ; CHECK-NEXT:    retq
> >    %shuf1 = shufflevector <4 x i8> %a, <4 x i8> zeroinitializer, <4 x
> i32><i32 4, i32 4, i32 2, i32 3>
> >    %shuf2 = shufflevector <4 x i8> %b, <4 x i8> zeroinitializer, <4 x
> i32><i32 0, i32 1, i32 4, i32 4>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/complex-fastmath.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/complex-fastmath.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/complex-fastmath.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/complex-fastmath.ll Wed Aug  7 09:24:26
> 2019
> > @@ -39,7 +39,7 @@ define <2 x float> @complex_square_f32(<
> >  ; FMA-NEXT:    vaddss %xmm0, %xmm0, %xmm2
> >  ; FMA-NEXT:    vmulss %xmm2, %xmm1, %xmm2
> >  ; FMA-NEXT:    vmulss %xmm1, %xmm1, %xmm1
> > -; FMA-NEXT:    vfmsub231ss %xmm0, %xmm0, %xmm1
> > +; FMA-NEXT:    vfmsub231ss {{.*#+}} xmm1 = (xmm0 * xmm0) - xmm1
> >  ; FMA-NEXT:    vinsertps {{.*#+}} xmm0 = xmm1[0],xmm2[0],xmm1[2,3]
> >  ; FMA-NEXT:    retq
> >    %2 = extractelement <2 x float> %0, i32 0
> > @@ -85,7 +85,7 @@ define <2 x double> @complex_square_f64(
> >  ; FMA-NEXT:    vaddsd %xmm0, %xmm0, %xmm2
> >  ; FMA-NEXT:    vmulsd %xmm2, %xmm1, %xmm2
> >  ; FMA-NEXT:    vmulsd %xmm1, %xmm1, %xmm1
> > -; FMA-NEXT:    vfmsub231sd %xmm0, %xmm0, %xmm1
> > +; FMA-NEXT:    vfmsub231sd {{.*#+}} xmm1 = (xmm0 * xmm0) - xmm1
> >  ; FMA-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm2[0]
> >  ; FMA-NEXT:    retq
> >    %2 = extractelement <2 x double> %0, i32 0
> > @@ -137,9 +137,9 @@ define <2 x float> @complex_mul_f32(<2 x
> >  ; FMA-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
> >  ; FMA-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm1[1,1,3,3]
> >  ; FMA-NEXT:    vmulss %xmm2, %xmm1, %xmm4
> > -; FMA-NEXT:    vfmadd231ss %xmm0, %xmm3, %xmm4
> > +; FMA-NEXT:    vfmadd231ss {{.*#+}} xmm4 = (xmm3 * xmm0) + xmm4
> >  ; FMA-NEXT:    vmulss %xmm2, %xmm3, %xmm2
> > -; FMA-NEXT:    vfmsub231ss %xmm0, %xmm1, %xmm2
> > +; FMA-NEXT:    vfmsub231ss {{.*#+}} xmm2 = (xmm1 * xmm0) - xmm2
> >  ; FMA-NEXT:    vinsertps {{.*#+}} xmm0 = xmm2[0],xmm4[0],xmm2[2,3]
> >  ; FMA-NEXT:    retq
> >    %3 = extractelement <2 x float> %0, i32 0
> > @@ -192,9 +192,9 @@ define <2 x double> @complex_mul_f64(<2
> >  ; FMA-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
> >  ; FMA-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
> >  ; FMA-NEXT:    vmulsd %xmm2, %xmm1, %xmm4
> > -; FMA-NEXT:    vfmadd231sd %xmm0, %xmm3, %xmm4
> > +; FMA-NEXT:    vfmadd231sd {{.*#+}} xmm4 = (xmm3 * xmm0) + xmm4
> >  ; FMA-NEXT:    vmulsd %xmm2, %xmm3, %xmm2
> > -; FMA-NEXT:    vfmsub231sd %xmm0, %xmm1, %xmm2
> > +; FMA-NEXT:    vfmsub231sd {{.*#+}} xmm2 = (xmm1 * xmm0) - xmm2
> >  ; FMA-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm2[0],xmm4[0]
> >  ; FMA-NEXT:    retq
> >    %3 = extractelement <2 x double> %0, i32 0
> >
> > Modified: llvm/trunk/test/CodeGen/X86/cvtv2f32.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/cvtv2f32.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/cvtv2f32.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/cvtv2f32.ll Wed Aug  7 09:24:26 2019
> > @@ -42,11 +42,9 @@ define <2 x float> @uitofp_2i32_cvt_buil
> >  define <2 x float> @uitofp_2i32_buildvector_cvt(i32 %x, i32 %y, <2 x
> float> %v) {
> >  ; X32-LABEL: uitofp_2i32_buildvector_cvt:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> > -; X32-NEXT:    movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
> > -; X32-NEXT:    unpcklpd {{.*#+}} xmm2 = xmm2[0],xmm1[0]
> > -; X32-NEXT:    movapd {{.*#+}} xmm1 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > -; X32-NEXT:    orpd %xmm1, %xmm2
> > +; X32-NEXT:    movdqa {{.*#+}} xmm1 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > +; X32-NEXT:    pmovzxdq {{.*#+}} xmm2 = mem[0],zero,mem[1],zero
> > +; X32-NEXT:    por %xmm1, %xmm2
> >  ; X32-NEXT:    subpd %xmm1, %xmm2
> >  ; X32-NEXT:    cvtpd2ps %xmm2, %xmm1
> >  ; X32-NEXT:    mulps %xmm1, %xmm0
> > @@ -54,13 +52,13 @@ define <2 x float> @uitofp_2i32_buildvec
> >  ;
> >  ; X64-LABEL: uitofp_2i32_buildvector_cvt:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    movd %esi, %xmm1
> > -; X64-NEXT:    movd %edi, %xmm2
> > -; X64-NEXT:    punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
> > -; X64-NEXT:    movdqa {{.*#+}} xmm1 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > -; X64-NEXT:    por %xmm1, %xmm2
> > -; X64-NEXT:    subpd %xmm1, %xmm2
> > -; X64-NEXT:    cvtpd2ps %xmm2, %xmm1
> > +; X64-NEXT:    movd %edi, %xmm1
> > +; X64-NEXT:    pinsrd $1, %esi, %xmm1
> > +; X64-NEXT:    pmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
> > +; X64-NEXT:    movdqa {{.*#+}} xmm2 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > +; X64-NEXT:    por %xmm2, %xmm1
> > +; X64-NEXT:    subpd %xmm2, %xmm1
> > +; X64-NEXT:    cvtpd2ps %xmm1, %xmm1
> >  ; X64-NEXT:    mulps %xmm1, %xmm0
> >  ; X64-NEXT:    retq
> >    %t1 = insertelement <2 x i32> undef, i32 %x, i32 0
> > @@ -73,23 +71,21 @@ define <2 x float> @uitofp_2i32_buildvec
> >  define <2 x float> @uitofp_2i32_legalized(<2 x i32> %in, <2 x float>
> %v) {
> >  ; X32-LABEL: uitofp_2i32_legalized:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    xorps %xmm2, %xmm2
> > -; X32-NEXT:    blendps {{.*#+}} xmm2 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; X32-NEXT:    movaps {{.*#+}} xmm0 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > -; X32-NEXT:    orps %xmm0, %xmm2
> > -; X32-NEXT:    subpd %xmm0, %xmm2
> > -; X32-NEXT:    cvtpd2ps %xmm2, %xmm0
> > +; X32-NEXT:    pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> > +; X32-NEXT:    movdqa {{.*#+}} xmm2 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > +; X32-NEXT:    por %xmm2, %xmm0
> > +; X32-NEXT:    subpd %xmm2, %xmm0
> > +; X32-NEXT:    cvtpd2ps %xmm0, %xmm0
> >  ; X32-NEXT:    mulps %xmm1, %xmm0
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: uitofp_2i32_legalized:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    xorps %xmm2, %xmm2
> > -; X64-NEXT:    blendps {{.*#+}} xmm2 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; X64-NEXT:    movaps {{.*#+}} xmm0 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > -; X64-NEXT:    orps %xmm0, %xmm2
> > -; X64-NEXT:    subpd %xmm0, %xmm2
> > -; X64-NEXT:    cvtpd2ps %xmm2, %xmm0
> > +; X64-NEXT:    pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> > +; X64-NEXT:    movdqa {{.*#+}} xmm2 =
> [4.503599627370496E+15,4.503599627370496E+15]
> > +; X64-NEXT:    por %xmm2, %xmm0
> > +; X64-NEXT:    subpd %xmm2, %xmm0
> > +; X64-NEXT:    cvtpd2ps %xmm0, %xmm0
> >  ; X64-NEXT:    mulps %xmm1, %xmm0
> >  ; X64-NEXT:    retq
> >    %t1 = uitofp <2 x i32> %in to <2 x float>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/extract-concat.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/extract-concat.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/extract-concat.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/extract-concat.ll Wed Aug  7 09:24:26
> 2019
> > @@ -5,9 +5,14 @@ define void @foo(<4 x float> %in, <4 x i
> >  ; CHECK-LABEL: foo:
> >  ; CHECK:       # %bb.0:
> >  ; CHECK-NEXT:    cvttps2dq %xmm0, %xmm0
> > +; CHECK-NEXT:    pextrb $8, %xmm0, %eax
> > +; CHECK-NEXT:    pextrb $4, %xmm0, %ecx
> > +; CHECK-NEXT:    pextrb $0, %xmm0, %edx
> > +; CHECK-NEXT:    movd %edx, %xmm0
> > +; CHECK-NEXT:    pinsrb $1, %ecx, %xmm0
> > +; CHECK-NEXT:    pinsrb $2, %eax, %xmm0
> >  ; CHECK-NEXT:    movl $255, %eax
> > -; CHECK-NEXT:    pinsrd $3, %eax, %xmm0
> > -; CHECK-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> > +; CHECK-NEXT:    pinsrb $3, %eax, %xmm0
> >  ; CHECK-NEXT:    movd %xmm0, (%rdi)
> >  ; CHECK-NEXT:    retq
> >    %t0 = fptosi <4 x float> %in to <4 x i32>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/extract-insert.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/extract-insert.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/extract-insert.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/extract-insert.ll Wed Aug  7 09:24:26
> 2019
> > @@ -31,12 +31,10 @@ define i8 @extractelt_bitcast(i32 %x) no
> >  define i8 @extractelt_bitcast_extra_use(i32 %x, <4 x i8>* %p) nounwind {
> >  ; X86-LABEL: extractelt_bitcast_extra_use:
> >  ; X86:       # %bb.0:
> > -; X86-NEXT:    pushl %eax
> >  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %ecx
> >  ; X86-NEXT:    movl %eax, (%ecx)
> >  ; X86-NEXT:    # kill: def $al killed $al killed $eax
> > -; X86-NEXT:    popl %ecx
> >  ; X86-NEXT:    retl
> >  ;
> >  ; X64-LABEL: extractelt_bitcast_extra_use:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll Wed Aug  7 09:24:26
> 2019
> > @@ -268,14 +268,12 @@ define void @test_x86_vcvtps2ph_128_m(<4
> >  ; X32-AVX512VL-LABEL: test_x86_vcvtps2ph_128_m:
> >  ; X32-AVX512VL:       # %bb.0: # %entry
> >  ; X32-AVX512VL-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding:
> [0x8b,0x44,0x24,0x04]
> > -; X32-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x1d,0xc0,0x03]
> > -; X32-AVX512VL-NEXT:    vmovlps %xmm0, (%eax) # EVEX TO VEX Compression
> encoding: [0xc5,0xf8,0x13,0x00]
> > +; X32-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, (%eax) # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x1d,0x00,0x03]
> >  ; X32-AVX512VL-NEXT:    retl # encoding: [0xc3]
> >  ;
> >  ; X64-AVX512VL-LABEL: test_x86_vcvtps2ph_128_m:
> >  ; X64-AVX512VL:       # %bb.0: # %entry
> > -; X64-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, %xmm0 # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x1d,0xc0,0x03]
> > -; X64-AVX512VL-NEXT:    vmovlps %xmm0, (%rdi) # EVEX TO VEX Compression
> encoding: [0xc5,0xf8,0x13,0x07]
> > +; X64-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, (%rdi) # EVEX TO VEX
> Compression encoding: [0xc4,0xe3,0x79,0x1d,0x07,0x03]
> >  ; X64-AVX512VL-NEXT:    retq # encoding: [0xc3]
> >  entry:
> >    %0 = tail call <8 x i16> @llvm.x86.vcvtps2ph.128(<4 x float> %a, i32
> 3)
> >
> > Modified: llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll Wed Aug  7
> 09:24:26 2019
> > @@ -11,12 +11,12 @@
> >  define <4 x i16> @test_sext_4i8_4i16() {
> >  ; X32-LABEL: test_sext_4i8_4i16:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [0,4294967295,2,4294967293]
> > +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <0,65535,2,65533,u,u,u,u>
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: test_sext_4i8_4i16:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [0,4294967295,2,4294967293]
> > +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <0,65535,2,65533,u,u,u,u>
> >  ; X64-NEXT:    retq
> >    %1 = insertelement <4 x i8> undef, i8 0, i32 0
> >    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> > @@ -29,12 +29,12 @@ define <4 x i16> @test_sext_4i8_4i16() {
> >  define <4 x i16> @test_sext_4i8_4i16_undef() {
> >  ; X32-LABEL: test_sext_4i8_4i16_undef:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <u,4294967295,u,4294967293>
> > +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <u,65535,u,65533,u,u,u,u>
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: test_sext_4i8_4i16_undef:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <u,4294967295,u,4294967293>
> > +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <u,65535,u,65533,u,u,u,u>
> >  ; X64-NEXT:    retq
> >    %1 = insertelement <4 x i8> undef, i8 undef, i32 0
> >    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> > @@ -207,12 +207,12 @@ define <8 x i32> @test_sext_8i8_8i32_und
> >  define <4 x i16> @test_zext_4i8_4i16() {
> >  ; X32-LABEL: test_zext_4i8_4i16:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,2,253]
> > +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,2,253,u,u,u,u>
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: test_zext_4i8_4i16:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,2,253]
> > +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,2,253,u,u,u,u>
> >  ; X64-NEXT:    retq
> >    %1 = insertelement <4 x i8> undef, i8 0, i32 0
> >    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> > @@ -261,12 +261,12 @@ define <4 x i64> @test_zext_4i8_4i64() {
> >  define <4 x i16> @test_zext_4i8_4i16_undef() {
> >  ; X32-LABEL: test_zext_4i8_4i16_undef:
> >  ; X32:       # %bb.0:
> > -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,0,253]
> > +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,0,253,u,u,u,u>
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: test_zext_4i8_4i16_undef:
> >  ; X64:       # %bb.0:
> > -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,0,253]
> > +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,0,253,u,u,u,u>
> >  ; X64-NEXT:    retq
> >    %1 = insertelement <4 x i8> undef, i8 undef, i32 0
> >    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> >
> > Modified: llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll Wed Aug  7
> 09:24:26 2019
> > @@ -30,18 +30,10 @@ define <8 x float> @insert_subvector_256
> >  define <8 x i64> @insert_subvector_512(i32 %x0, i32 %x1, <8 x i64> %v)
> nounwind {
> >  ; X86_AVX256-LABEL: insert_subvector_512:
> >  ; X86_AVX256:       # %bb.0:
> > -; X86_AVX256-NEXT:    pushl %ebp
> > -; X86_AVX256-NEXT:    movl %esp, %ebp
> > -; X86_AVX256-NEXT:    andl $-8, %esp
> > -; X86_AVX256-NEXT:    subl $8, %esp
> > -; X86_AVX256-NEXT:    vmovsd {{.*#+}} xmm2 = mem[0],zero
> > -; X86_AVX256-NEXT:    vmovlps %xmm2, (%esp)
> >  ; X86_AVX256-NEXT:    vextracti128 $1, %ymm0, %xmm2
> > -; X86_AVX256-NEXT:    vpinsrd $0, (%esp), %xmm2, %xmm2
> > +; X86_AVX256-NEXT:    vpinsrd $0, {{[0-9]+}}(%esp), %xmm2, %xmm2
> >  ; X86_AVX256-NEXT:    vpinsrd $1, {{[0-9]+}}(%esp), %xmm2, %xmm2
> >  ; X86_AVX256-NEXT:    vinserti128 $1, %xmm2, %ymm0, %ymm0
> > -; X86_AVX256-NEXT:    movl %ebp, %esp
> > -; X86_AVX256-NEXT:    popl %ebp
> >  ; X86_AVX256-NEXT:    retl
> >  ;
> >  ; X64_AVX256-LABEL: insert_subvector_512:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/known-bits.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/known-bits.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/known-bits.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/known-bits.ll Wed Aug  7 09:24:26 2019
> > @@ -5,100 +5,44 @@
> >  define void @knownbits_zext_in_reg(i8*) nounwind {
> >  ; X32-LABEL: knownbits_zext_in_reg:
> >  ; X32:       # %bb.0: # %BB
> > -; X32-NEXT:    pushl %ebp
> >  ; X32-NEXT:    pushl %ebx
> > -; X32-NEXT:    pushl %edi
> > -; X32-NEXT:    pushl %esi
> > -; X32-NEXT:    subl $16, %esp
> >  ; X32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; X32-NEXT:    movzbl (%eax), %ecx
> >  ; X32-NEXT:    imull $101, %ecx, %eax
> >  ; X32-NEXT:    shrl $14, %eax
> > -; X32-NEXT:    imull $177, %ecx, %ecx
> > -; X32-NEXT:    shrl $14, %ecx
> > -; X32-NEXT:    movzbl %al, %eax
> > -; X32-NEXT:    vpxor %xmm0, %xmm0, %xmm0
> > -; X32-NEXT:    vpinsrd $1, %eax, %xmm0, %xmm1
> > -; X32-NEXT:    vbroadcastss {{.*#+}} xmm2 =
> [3.57331108E-43,3.57331108E-43,3.57331108E-43,3.57331108E-43]
> > -; X32-NEXT:    vpand %xmm2, %xmm1, %xmm1
> > -; X32-NEXT:    movzbl %cl, %eax
> > -; X32-NEXT:    vpinsrd $1, %eax, %xmm0, %xmm0
> > -; X32-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; X32-NEXT:    vpextrd $1, %xmm1, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte
> Folded Spill
> > -; X32-NEXT:    vpextrd $1, %xmm0, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte
> Folded Spill
> > -; X32-NEXT:    xorl %ecx, %ecx
> > -; X32-NEXT:    vmovd %xmm1, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded
> Spill
> > -; X32-NEXT:    vmovd %xmm0, (%esp) # 4-byte Folded Spill
> > -; X32-NEXT:    vpextrd $2, %xmm1, %edi
> > -; X32-NEXT:    vpextrd $2, %xmm0, %esi
> > -; X32-NEXT:    vpextrd $3, %xmm1, %ebx
> > -; X32-NEXT:    vpextrd $3, %xmm0, %ebp
> > +; X32-NEXT:    imull $177, %ecx, %edx
> > +; X32-NEXT:    shrl $14, %edx
> > +; X32-NEXT:    movzbl %al, %ecx
> > +; X32-NEXT:    xorl %ebx, %ebx
> >  ; X32-NEXT:    .p2align 4, 0x90
> >  ; X32-NEXT:  .LBB0_1: # %CF
> >  ; X32-NEXT:    # =>This Loop Header: Depth=1
> >  ; X32-NEXT:    # Child Loop BB0_2 Depth 2
> > -; X32-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %eax # 4-byte Reload
> > -; X32-NEXT:    xorl %edx, %edx
> > -; X32-NEXT:    divl {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Reload
> > -; X32-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %eax # 4-byte Reload
> > -; X32-NEXT:    xorl %edx, %edx
> > -; X32-NEXT:    divl (%esp) # 4-byte Folded Reload
> > -; X32-NEXT:    movl %edi, %eax
> > -; X32-NEXT:    xorl %edx, %edx
> > -; X32-NEXT:    divl %esi
> > -; X32-NEXT:    movl %ebx, %eax
> > -; X32-NEXT:    xorl %edx, %edx
> > -; X32-NEXT:    divl %ebp
> > +; X32-NEXT:    movl %ecx, %eax
> > +; X32-NEXT:    divb %dl
> >  ; X32-NEXT:    .p2align 4, 0x90
> >  ; X32-NEXT:  .LBB0_2: # %CF237
> >  ; X32-NEXT:    # Parent Loop BB0_1 Depth=1
> >  ; X32-NEXT:    # => This Inner Loop Header: Depth=2
> > -; X32-NEXT:    testb %cl, %cl
> > +; X32-NEXT:    testb %bl, %bl
> >  ; X32-NEXT:    jne .LBB0_2
> >  ; X32-NEXT:    jmp .LBB0_1
> >  ;
> >  ; X64-LABEL: knownbits_zext_in_reg:
> >  ; X64:       # %bb.0: # %BB
> > -; X64-NEXT:    pushq %rbp
> > -; X64-NEXT:    pushq %rbx
> >  ; X64-NEXT:    movzbl (%rdi), %eax
> >  ; X64-NEXT:    imull $101, %eax, %ecx
> >  ; X64-NEXT:    shrl $14, %ecx
> > -; X64-NEXT:    imull $177, %eax, %eax
> > -; X64-NEXT:    shrl $14, %eax
> > +; X64-NEXT:    imull $177, %eax, %edx
> > +; X64-NEXT:    shrl $14, %edx
> >  ; X64-NEXT:    movzbl %cl, %ecx
> > -; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0
> > -; X64-NEXT:    vpinsrd $1, %ecx, %xmm0, %xmm1
> > -; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 =
> [3.57331108E-43,3.57331108E-43,3.57331108E-43,3.57331108E-43]
> > -; X64-NEXT:    vpand %xmm2, %xmm1, %xmm1
> > -; X64-NEXT:    movzbl %al, %eax
> > -; X64-NEXT:    vpinsrd $1, %eax, %xmm0, %xmm0
> > -; X64-NEXT:    vpand %xmm2, %xmm0, %xmm0
> > -; X64-NEXT:    vpextrd $1, %xmm1, %r8d
> > -; X64-NEXT:    vpextrd $1, %xmm0, %r9d
> >  ; X64-NEXT:    xorl %esi, %esi
> > -; X64-NEXT:    vmovd %xmm1, %r10d
> > -; X64-NEXT:    vmovd %xmm0, %r11d
> > -; X64-NEXT:    vpextrd $2, %xmm1, %edi
> > -; X64-NEXT:    vpextrd $2, %xmm0, %ebx
> > -; X64-NEXT:    vpextrd $3, %xmm1, %ecx
> > -; X64-NEXT:    vpextrd $3, %xmm0, %ebp
> >  ; X64-NEXT:    .p2align 4, 0x90
> >  ; X64-NEXT:  .LBB0_1: # %CF
> >  ; X64-NEXT:    # =>This Loop Header: Depth=1
> >  ; X64-NEXT:    # Child Loop BB0_2 Depth 2
> > -; X64-NEXT:    movl %r8d, %eax
> > -; X64-NEXT:    xorl %edx, %edx
> > -; X64-NEXT:    divl %r9d
> > -; X64-NEXT:    movl %r10d, %eax
> > -; X64-NEXT:    xorl %edx, %edx
> > -; X64-NEXT:    divl %r11d
> > -; X64-NEXT:    movl %edi, %eax
> > -; X64-NEXT:    xorl %edx, %edx
> > -; X64-NEXT:    divl %ebx
> >  ; X64-NEXT:    movl %ecx, %eax
> > -; X64-NEXT:    xorl %edx, %edx
> > -; X64-NEXT:    divl %ebp
> > +; X64-NEXT:    divb %dl
> >  ; X64-NEXT:    .p2align 4, 0x90
> >  ; X64-NEXT:  .LBB0_2: # %CF237
> >  ; X64-NEXT:    # Parent Loop BB0_1 Depth=1
> >
> > Modified: llvm/trunk/test/CodeGen/X86/load-partial.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/load-partial.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/load-partial.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/load-partial.ll Wed Aug  7 09:24:26 2019
> > @@ -145,18 +145,8 @@ define i32 @load_partial_illegal_type()
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    movzwl {{.*}}(%rip), %eax
> >  ; SSE2-NEXT:    movd %eax, %xmm0
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm1
> > -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,0,3]
> > -; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> > -; SSE2-NEXT:    movl $2, %eax
> > -; SSE2-NEXT:    movd %eax, %xmm1
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[3,0]
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
> > -; SSE2-NEXT:    andps {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> > +; SSE2-NEXT:    por {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    movd %xmm0, %eax
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -164,7 +154,8 @@ define i32 @load_partial_illegal_type()
> >  ; SSSE3:       # %bb.0:
> >  ; SSSE3-NEXT:    movzwl {{.*}}(%rip), %eax
> >  ; SSSE3-NEXT:    movd %eax, %xmm0
> > -; SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 =
> xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3]
> > +; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,1],zero,xmm0[3,4,5,6,7,8,9,10,11,12,13,14,15]
> > +; SSSE3-NEXT:    por {{.*}}(%rip), %xmm0
> >  ; SSSE3-NEXT:    movd %xmm0, %eax
> >  ; SSSE3-NEXT:    retq
> >  ;
> > @@ -172,10 +163,8 @@ define i32 @load_partial_illegal_type()
> >  ; SSE41:       # %bb.0:
> >  ; SSE41-NEXT:    movzwl {{.*}}(%rip), %eax
> >  ; SSE41-NEXT:    movd %eax, %xmm0
> > -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,1,2,3,1],zero,zero,zero,xmm0[u,u,u,u,u,u,u,u]
> >  ; SSE41-NEXT:    movl $2, %eax
> > -; SSE41-NEXT:    pinsrd $2, %eax, %xmm0
> > -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,u,u,u,u,u,u,u,u,u,u,u,u,u]
> > +; SSE41-NEXT:    pinsrb $2, %eax, %xmm0
> >  ; SSE41-NEXT:    movd %xmm0, %eax
> >  ; SSE41-NEXT:    retq
> >  ;
> > @@ -183,10 +172,8 @@ define i32 @load_partial_illegal_type()
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    movzwl {{.*}}(%rip), %eax
> >  ; AVX-NEXT:    vmovd %eax, %xmm0
> > -; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,2,3,1],zero,zero,zero,xmm0[u,u,u,u,u,u,u,u]
> >  ; AVX-NEXT:    movl $2, %eax
> > -; AVX-NEXT:    vpinsrd $2, %eax, %xmm0, %xmm0
> > -; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,u,u,u,u,u,u,u,u,u,u,u,u,u]
> > +; AVX-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0
> >  ; AVX-NEXT:    vmovd %xmm0, %eax
> >  ; AVX-NEXT:    retq
> >    %1 = load <2 x i8>, <2 x i8>* bitcast (i8* @h to <2 x i8>*), align 1
> >
> > Modified: llvm/trunk/test/CodeGen/X86/lower-bitcast.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/lower-bitcast.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/lower-bitcast.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/lower-bitcast.ll Wed Aug  7 09:24:26 2019
> > @@ -9,9 +9,7 @@
> >  define double @test1(double %A) {
> >  ; CHECK-LABEL: test1:
> >  ; CHECK:       # %bb.0:
> > -; CHECK-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> >  ; CHECK-NEXT:    paddd {{.*}}(%rip), %xmm0
> > -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; CHECK-NEXT:    retq
> >  ;
> >  ; CHECK-WIDE-LABEL: test1:
> > @@ -68,9 +66,7 @@ define i64 @test4(i64 %A) {
> >  ; CHECK-LABEL: test4:
> >  ; CHECK:       # %bb.0:
> >  ; CHECK-NEXT:    movq %rdi, %xmm0
> > -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> >  ; CHECK-NEXT:    paddd {{.*}}(%rip), %xmm0
> > -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; CHECK-NEXT:    movq %xmm0, %rax
> >  ; CHECK-NEXT:    retq
> >  ;
> > @@ -108,9 +104,7 @@ define double @test5(double %A) {
> >  define double @test6(double %A) {
> >  ; CHECK-LABEL: test6:
> >  ; CHECK:       # %bb.0:
> > -; CHECK-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> >  ; CHECK-NEXT:    paddw {{.*}}(%rip), %xmm0
> > -; CHECK-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; CHECK-NEXT:    retq
> >  ;
> >  ; CHECK-WIDE-LABEL: test6:
> > @@ -147,9 +141,7 @@ define double @test7(double %A, double %
> >  define double @test8(double %A) {
> >  ; CHECK-LABEL: test8:
> >  ; CHECK:       # %bb.0:
> > -; CHECK-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> >  ; CHECK-NEXT:    paddb {{.*}}(%rip), %xmm0
> > -; CHECK-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; CHECK-NEXT:    retq
> >  ;
> >  ; CHECK-WIDE-LABEL: test8:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/madd.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/madd.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/madd.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/madd.ll Wed Aug  7 09:24:26 2019
> > @@ -1876,26 +1876,12 @@ define <4 x i32> @larger_mul(<16 x i16>
> >  ;
> >  ; AVX1-LABEL: larger_mul:
> >  ; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm2
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> > -; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm0
> > -; AVX1-NEXT:    vpackssdw %xmm0, %xmm2, %xmm0
> > -; AVX1-NEXT:    vpmovsxwd %xmm1, %xmm2
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
> > -; AVX1-NEXT:    vpmovsxwd %xmm1, %xmm1
> > -; AVX1-NEXT:    vpackssdw %xmm1, %xmm2, %xmm1
> >  ; AVX1-NEXT:    vpmaddwd %xmm1, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: larger_mul:
> >  ; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpmovsxwd %xmm0, %ymm0
> > -; AVX2-NEXT:    vpmovsxwd %xmm1, %ymm1
> > -; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> > -; AVX2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
> > -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm2
> > -; AVX2-NEXT:    vpackssdw %xmm2, %xmm0, %xmm0
> >  ; AVX2-NEXT:    vpmaddwd %xmm1, %xmm0, %xmm0
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> > @@ -2597,29 +2583,29 @@ define <4 x i32> @pmaddwd_bad_indices(<8
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    movdqa (%rdi), %xmm0
> >  ; SSE2-NEXT:    movdqa (%rsi), %xmm1
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm1[0,2,2,3,4,5,6,7]
> > -; SSE2-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm0[2,1,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,6,5,6,7]
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm0[2,1,2,3,4,5,6,7]
> > -; SSE2-NEXT:    pshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,6,7]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm3[1,0,3,2,4,5,6,7]
> > -; SSE2-NEXT:    movdqa %xmm3, %xmm4
> > -; SSE2-NEXT:    pmulhw %xmm2, %xmm4
> > -; SSE2-NEXT:    pmullw %xmm2, %xmm3
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm3 =
> xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[1,0,3,2,4,5,6,7]
> >  ; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,3,2,3,4,5,6,7]
> >  ; SSE2-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,7,6,7]
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm1[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,4,6,6,7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> >  ; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[3,1,2,3,4,5,6,7]
> >  ; SSE2-NEXT:    pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,7,5,6,7]
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[1,0,3,2,4,5,6,7]
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE2-NEXT:    pmulhw %xmm1, %xmm2
> > +; SSE2-NEXT:    movdqa %xmm2, %xmm4
> > +; SSE2-NEXT:    pmulhw %xmm3, %xmm4
> > +; SSE2-NEXT:    pmullw %xmm3, %xmm2
> > +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm2 =
> xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm3
> > +; SSE2-NEXT:    pmulhw %xmm1, %xmm3
> >  ; SSE2-NEXT:    pmullw %xmm1, %xmm0
> > -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
> > -; SSE2-NEXT:    paddd %xmm3, %xmm0
> > +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> > +; SSE2-NEXT:    paddd %xmm2, %xmm0
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: pmaddwd_bad_indices:
> > @@ -2627,13 +2613,13 @@ define <4 x i32> @pmaddwd_bad_indices(<8
> >  ; AVX-NEXT:    vmovdqa (%rdi), %xmm0
> >  ; AVX-NEXT:    vmovdqa (%rsi), %xmm1
> >  ; AVX-NEXT:    vpshufb {{.*#+}} xmm2 =
> xmm0[2,3,4,5,10,11,12,13,12,13,10,11,12,13,14,15]
> > -; AVX-NEXT:    vpmovsxwd %xmm2, %xmm2
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,6,7,8,9,14,15,8,9,14,15,12,13,14,15]
> >  ; AVX-NEXT:    vpshufb {{.*#+}} xmm3 =
> xmm1[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm1 =
> xmm1[2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
> > +; AVX-NEXT:    vpmovsxwd %xmm2, %xmm2
> >  ; AVX-NEXT:    vpmovsxwd %xmm3, %xmm3
> >  ; AVX-NEXT:    vpmulld %xmm3, %xmm2, %xmm2
> > -; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,6,7,8,9,14,15,8,9,14,15,12,13,14,15]
> >  ; AVX-NEXT:    vpmovsxwd %xmm0, %xmm0
> > -; AVX-NEXT:    vpshufb {{.*#+}} xmm1 =
> xmm1[2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
> >  ; AVX-NEXT:    vpmovsxwd %xmm1, %xmm1
> >  ; AVX-NEXT:    vpmulld %xmm1, %xmm0, %xmm0
> >  ; AVX-NEXT:    vpaddd %xmm0, %xmm2, %xmm0
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_compressstore.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_compressstore.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_compressstore.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_compressstore.ll Wed Aug  7
> 09:24:26 2019
> > @@ -603,11 +603,9 @@ define void @compressstore_v16f64_v16i1(
> >  define void @compressstore_v2f32_v2i32(float* %base, <2 x float> %V, <2
> x i32> %trigger) {
> >  ; SSE2-LABEL: compressstore_v2f32_v2i32:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm1
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne LBB2_1
> > @@ -629,8 +627,8 @@ define void @compressstore_v2f32_v2i32(f
> >  ; SSE42-LABEL: compressstore_v2f32_v2i32:
> >  ; SSE42:       ## %bb.0:
> >  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> > -; SSE42-NEXT:    pblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm1
> > +; SSE42-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm1
> >  ; SSE42-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE42-NEXT:    testb $1, %al
> >  ; SSE42-NEXT:    jne LBB2_1
> > @@ -648,69 +646,54 @@ define void @compressstore_v2f32_v2i32(f
> >  ; SSE42-NEXT:    extractps $1, %xmm0, (%rdi)
> >  ; SSE42-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: compressstore_v2f32_v2i32:
> > -; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> > -; AVX1-NEXT:    vmovmskpd %xmm1, %eax
> > -; AVX1-NEXT:    testb $1, %al
> > -; AVX1-NEXT:    jne LBB2_1
> > -; AVX1-NEXT:  ## %bb.2: ## %else
> > -; AVX1-NEXT:    testb $2, %al
> > -; AVX1-NEXT:    jne LBB2_3
> > -; AVX1-NEXT:  LBB2_4: ## %else2
> > -; AVX1-NEXT:    retq
> > -; AVX1-NEXT:  LBB2_1: ## %cond.store
> > -; AVX1-NEXT:    vmovss %xmm0, (%rdi)
> > -; AVX1-NEXT:    addq $4, %rdi
> > -; AVX1-NEXT:    testb $2, %al
> > -; AVX1-NEXT:    je LBB2_4
> > -; AVX1-NEXT:  LBB2_3: ## %cond.store1
> > -; AVX1-NEXT:    vextractps $1, %xmm0, (%rdi)
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: compressstore_v2f32_v2i32:
> > -; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> > -; AVX2-NEXT:    vmovmskpd %xmm1, %eax
> > -; AVX2-NEXT:    testb $1, %al
> > -; AVX2-NEXT:    jne LBB2_1
> > -; AVX2-NEXT:  ## %bb.2: ## %else
> > -; AVX2-NEXT:    testb $2, %al
> > -; AVX2-NEXT:    jne LBB2_3
> > -; AVX2-NEXT:  LBB2_4: ## %else2
> > -; AVX2-NEXT:    retq
> > -; AVX2-NEXT:  LBB2_1: ## %cond.store
> > -; AVX2-NEXT:    vmovss %xmm0, (%rdi)
> > -; AVX2-NEXT:    addq $4, %rdi
> > -; AVX2-NEXT:    testb $2, %al
> > -; AVX2-NEXT:    je LBB2_4
> > -; AVX2-NEXT:  LBB2_3: ## %cond.store1
> > -; AVX2-NEXT:    vextractps $1, %xmm0, (%rdi)
> > -; AVX2-NEXT:    retq
> > +; AVX1OR2-LABEL: compressstore_v2f32_v2i32:
> > +; AVX1OR2:       ## %bb.0:
> > +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX1OR2-NEXT:    vpmovsxdq %xmm1, %xmm1
> > +; AVX1OR2-NEXT:    vmovmskpd %xmm1, %eax
> > +; AVX1OR2-NEXT:    testb $1, %al
> > +; AVX1OR2-NEXT:    jne LBB2_1
> > +; AVX1OR2-NEXT:  ## %bb.2: ## %else
> > +; AVX1OR2-NEXT:    testb $2, %al
> > +; AVX1OR2-NEXT:    jne LBB2_3
> > +; AVX1OR2-NEXT:  LBB2_4: ## %else2
> > +; AVX1OR2-NEXT:    retq
> > +; AVX1OR2-NEXT:  LBB2_1: ## %cond.store
> > +; AVX1OR2-NEXT:    vmovss %xmm0, (%rdi)
> > +; AVX1OR2-NEXT:    addq $4, %rdi
> > +; AVX1OR2-NEXT:    testb $2, %al
> > +; AVX1OR2-NEXT:    je LBB2_4
> > +; AVX1OR2-NEXT:  LBB2_3: ## %cond.store1
> > +; AVX1OR2-NEXT:    vextractps $1, %xmm0, (%rdi)
> > +; AVX1OR2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: compressstore_v2f32_v2i32:
> >  ; AVX512F:       ## %bb.0:
> > +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm1 =
> xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vptestnmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vcompressps %zmm0, (%rdi) {%k1}
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: compressstore_v2f32_v2i32:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm1 =
> xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm1, %xmm1, %k1
> > -; AVX512VL-NEXT:    vcompressps %xmm0, (%rdi) {%k1}
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: compressstore_v2f32_v2i32:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vcompressps %xmm0, (%rdi) {%k1}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: compressstore_v2f32_v2i32:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vcompressps %xmm0, (%rdi) {%k1}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    call void @llvm.masked.compressstore.v2f32(<2 x float> %V, float*
> %base, <2 x i1> %mask)
> >    ret void
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_expandload.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_expandload.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_expandload.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_expandload.ll Wed Aug  7 09:24:26
> 2019
> > @@ -1117,11 +1117,9 @@ define <16 x double> @expandload_v16f64_
> >  define <2 x float> @expandload_v2f32_v2i1(float* %base, <2 x float>
> %src0, <2 x i32> %trigger) {
> >  ; SSE2-LABEL: expandload_v2f32_v2i1:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm1
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne LBB4_1
> > @@ -1146,8 +1144,8 @@ define <2 x float> @expandload_v2f32_v2i
> >  ; SSE42-LABEL: expandload_v2f32_v2i1:
> >  ; SSE42:       ## %bb.0:
> >  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> > -; SSE42-NEXT:    pblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm1
> > +; SSE42-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm1
> >  ; SSE42-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE42-NEXT:    testb $1, %al
> >  ; SSE42-NEXT:    jne LBB4_1
> > @@ -1166,58 +1164,34 @@ define <2 x float> @expandload_v2f32_v2i
> >  ; SSE42-NEXT:    insertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> >  ; SSE42-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: expandload_v2f32_v2i1:
> > -; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 =
> xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> > -; AVX1-NEXT:    vmovmskpd %xmm1, %eax
> > -; AVX1-NEXT:    testb $1, %al
> > -; AVX1-NEXT:    jne LBB4_1
> > -; AVX1-NEXT:  ## %bb.2: ## %else
> > -; AVX1-NEXT:    testb $2, %al
> > -; AVX1-NEXT:    jne LBB4_3
> > -; AVX1-NEXT:  LBB4_4: ## %else2
> > -; AVX1-NEXT:    retq
> > -; AVX1-NEXT:  LBB4_1: ## %cond.load
> > -; AVX1-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> > -; AVX1-NEXT:    vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
> > -; AVX1-NEXT:    addq $4, %rdi
> > -; AVX1-NEXT:    testb $2, %al
> > -; AVX1-NEXT:    je LBB4_4
> > -; AVX1-NEXT:  LBB4_3: ## %cond.load1
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: expandload_v2f32_v2i1:
> > -; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> > -; AVX2-NEXT:    vmovmskpd %xmm1, %eax
> > -; AVX2-NEXT:    testb $1, %al
> > -; AVX2-NEXT:    jne LBB4_1
> > -; AVX2-NEXT:  ## %bb.2: ## %else
> > -; AVX2-NEXT:    testb $2, %al
> > -; AVX2-NEXT:    jne LBB4_3
> > -; AVX2-NEXT:  LBB4_4: ## %else2
> > -; AVX2-NEXT:    retq
> > -; AVX2-NEXT:  LBB4_1: ## %cond.load
> > -; AVX2-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> > -; AVX2-NEXT:    vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
> > -; AVX2-NEXT:    addq $4, %rdi
> > -; AVX2-NEXT:    testb $2, %al
> > -; AVX2-NEXT:    je LBB4_4
> > -; AVX2-NEXT:  LBB4_3: ## %cond.load1
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> > -; AVX2-NEXT:    retq
> > +; AVX1OR2-LABEL: expandload_v2f32_v2i1:
> > +; AVX1OR2:       ## %bb.0:
> > +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX1OR2-NEXT:    vpmovsxdq %xmm1, %xmm1
> > +; AVX1OR2-NEXT:    vmovmskpd %xmm1, %eax
> > +; AVX1OR2-NEXT:    testb $1, %al
> > +; AVX1OR2-NEXT:    jne LBB4_1
> > +; AVX1OR2-NEXT:  ## %bb.2: ## %else
> > +; AVX1OR2-NEXT:    testb $2, %al
> > +; AVX1OR2-NEXT:    jne LBB4_3
> > +; AVX1OR2-NEXT:  LBB4_4: ## %else2
> > +; AVX1OR2-NEXT:    retq
> > +; AVX1OR2-NEXT:  LBB4_1: ## %cond.load
> > +; AVX1OR2-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> > +; AVX1OR2-NEXT:    vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
> > +; AVX1OR2-NEXT:    addq $4, %rdi
> > +; AVX1OR2-NEXT:    testb $2, %al
> > +; AVX1OR2-NEXT:    je LBB4_4
> > +; AVX1OR2-NEXT:  LBB4_3: ## %cond.load1
> > +; AVX1OR2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> > +; AVX1OR2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: expandload_v2f32_v2i1:
> >  ; AVX512F:       ## %bb.0:
> > +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm1 =
> xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vptestnmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vexpandps (%rdi), %zmm0 {%k1}
> > @@ -1225,13 +1199,21 @@ define <2 x float> @expandload_v2f32_v2i
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: expandload_v2f32_v2i1:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm1 =
> xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm1, %xmm1, %k1
> > -; AVX512VL-NEXT:    vexpandps (%rdi), %xmm0 {%k1}
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: expandload_v2f32_v2i1:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vexpandps (%rdi), %xmm0 {%k1}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: expandload_v2f32_v2i1:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vexpandps (%rdi), %xmm0 {%k1}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    %res = call <2 x float> @llvm.masked.expandload.v2f32(float* %base,
> <2 x i1> %mask, <2 x float> %src0)
> >    ret <2 x float> %res
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll Wed Aug  7
> 09:24:26 2019
> > @@ -915,13 +915,12 @@ define <2 x double> @test17(double* %bas
> >  ; KNL_64-LABEL: test17:
> >  ; KNL_64:       # %bb.0:
> >  ; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; KNL_64-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; KNL_64-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_64-NEXT:    vgatherqpd (%rdi,%zmm0,8), %zmm2 {%k1}
> > +; KNL_64-NEXT:    vgatherdpd (%rdi,%ymm0,8), %zmm2 {%k1}
> >  ; KNL_64-NEXT:    vmovapd %xmm2, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> > @@ -929,36 +928,31 @@ define <2 x double> @test17(double* %bas
> >  ; KNL_32-LABEL: test17:
> >  ; KNL_32:       # %bb.0:
> >  ; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; KNL_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; KNL_32-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vgatherqpd (%eax,%zmm0,8), %zmm2 {%k1}
> > +; KNL_32-NEXT:    vgatherdpd (%eax,%ymm0,8), %zmm2 {%k1}
> >  ; KNL_32-NEXT:    vmovapd %xmm2, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> >  ; SKX-LABEL: test17:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; SKX-NEXT:    vgatherqpd (%rdi,%xmm0,8), %xmm2 {%k1}
> > +; SKX-NEXT:    vgatherdpd (%rdi,%xmm0,8), %xmm2 {%k1}
> >  ; SKX-NEXT:    vmovapd %xmm2, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test17:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; SKX_32-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; SKX_32-NEXT:    vgatherqpd (%eax,%xmm0,8), %xmm2 {%k1}
> > +; SKX_32-NEXT:    vgatherdpd (%eax,%xmm0,8), %xmm2 {%k1}
> >  ; SKX_32-NEXT:    vmovapd %xmm2, %xmm0
> >  ; SKX_32-NEXT:    retl
> >
> > @@ -1080,8 +1074,8 @@ define void @test20(<2 x float>%a1, <2 x
> >  ;
> >  ; KNL_32-LABEL: test20:
> >  ; KNL_32:       # %bb.0:
> > +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> > -; KNL_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; KNL_32-NEXT:    vptestmq %zmm2, %zmm2, %k0
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> > @@ -1099,7 +1093,6 @@ define void @test20(<2 x float>%a1, <2 x
> >  ;
> >  ; SKX_32-LABEL: test20:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; SKX_32-NEXT:    vpmovq2m %xmm2, %k1
> >  ; SKX_32-NEXT:    vscatterdps %xmm0, (,%xmm1) {%k1}
> > @@ -1113,9 +1106,9 @@ define void @test21(<2 x i32>%a1, <2 x i
> >  ; KNL_64-LABEL: test21:
> >  ; KNL_64:       # %bb.0:
> >  ; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; KNL_64-NEXT:    vptestmq %zmm2, %zmm2, %k0
> > -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> >  ; KNL_64-NEXT:    vpscatterqd %ymm0, (,%zmm1) {%k1}
> > @@ -1124,10 +1117,10 @@ define void @test21(<2 x i32>%a1, <2 x i
> >  ;
> >  ; KNL_32-LABEL: test21:
> >  ; KNL_32:       # %bb.0:
> > +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; KNL_32-NEXT:    vptestmq %zmm2, %zmm2, %k0
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> >  ; KNL_32-NEXT:    vpscatterdd %zmm0, (,%zmm1) {%k1}
> > @@ -1138,7 +1131,6 @@ define void @test21(<2 x i32>%a1, <2 x i
> >  ; SKX:       # %bb.0:
> >  ; SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; SKX-NEXT:    vpmovq2m %xmm2, %k1
> > -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SKX-NEXT:    vpscatterqd %xmm0, (,%xmm1) {%k1}
> >  ; SKX-NEXT:    retq
> >  ;
> > @@ -1146,8 +1138,6 @@ define void @test21(<2 x i32>%a1, <2 x i
> >  ; SKX_32:       # %bb.0:
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; SKX_32-NEXT:    vpmovq2m %xmm2, %k1
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; SKX_32-NEXT:    vpscatterdd %xmm0, (,%xmm1) {%k1}
> >  ; SKX_32-NEXT:    retl
> >    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x
> i32*> %ptr, i32 4, <2 x i1> %mask)
> > @@ -1161,7 +1151,7 @@ define <2 x float> @test22(float* %base,
> >  ; KNL_64-LABEL: test22:
> >  ; KNL_64:       # %bb.0:
> >  ; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; KNL_64-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> > @@ -1174,7 +1164,7 @@ define <2 x float> @test22(float* %base,
> >  ; KNL_32-LABEL: test22:
> >  ; KNL_32:       # %bb.0:
> >  ; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; KNL_32-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> > @@ -1187,7 +1177,6 @@ define <2 x float> @test22(float* %base,
> >  ;
> >  ; SKX-LABEL: test22:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> >  ; SKX-NEXT:    vgatherdps (%rdi,%xmm0,4), %xmm2 {%k1}
> > @@ -1196,7 +1185,6 @@ define <2 x float> @test22(float* %base,
> >  ;
> >  ; SKX_32-LABEL: test22:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > @@ -1264,28 +1252,28 @@ declare <2 x i64> @llvm.masked.gather.v2
> >  define <2 x i32> @test23(i32* %base, <2 x i32> %ind, <2 x i1> %mask, <2
> x i32> %src0) {
> >  ; KNL_64-LABEL: test23:
> >  ; KNL_64:       # %bb.0:
> > +; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_64-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm1 {%k1}
> > -; KNL_64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; KNL_64-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm2 {%k1}
> > +; KNL_64-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> >  ;
> >  ; KNL_32-LABEL: test23:
> >  ; KNL_32:       # %bb.0:
> > +; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_32-NEXT:    vpgatherdd (%eax,%zmm0,4), %zmm1 {%k1}
> > -; KNL_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; KNL_32-NEXT:    vpgatherdd (%eax,%zmm0,4), %zmm2 {%k1}
> > +; KNL_32-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> > @@ -1293,10 +1281,8 @@ define <2 x i32> @test23(i32* %base, <2
> >  ; SKX:       # %bb.0:
> >  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> > -; SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm1 {%k1}
> > -; SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm2 {%k1}
> > +; SKX-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test23:
> > @@ -1304,10 +1290,8 @@ define <2 x i32> @test23(i32* %base, <2
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> > -; SKX_32-NEXT:    vpgatherdd (%eax,%xmm0,4), %xmm1 {%k1}
> > -; SKX_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; SKX_32-NEXT:    vpgatherdd (%eax,%xmm0,4), %xmm2 {%k1}
> > +; SKX_32-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; SKX_32-NEXT:    retl
> >    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> >    %gep.random = getelementptr i32, i32* %base, <2 x i64> %sext_ind
> > @@ -1318,28 +1302,28 @@ define <2 x i32> @test23(i32* %base, <2
> >  define <2 x i32> @test23b(i32* %base, <2 x i64> %ind, <2 x i1> %mask,
> <2 x i32> %src0) {
> >  ; KNL_64-LABEL: test23b:
> >  ; KNL_64:       # %bb.0:
> > +; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
> >  ; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_64-NEXT:    vpgatherqd (%rdi,%zmm0,4), %ymm1 {%k1}
> > -; KNL_64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; KNL_64-NEXT:    vpgatherqd (%rdi,%zmm0,4), %ymm2 {%k1}
> > +; KNL_64-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> >  ;
> >  ; KNL_32-LABEL: test23b:
> >  ; KNL_32:       # %bb.0:
> > +; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
> >  ; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_32-NEXT:    vpgatherqd (%eax,%zmm0,4), %ymm1 {%k1}
> > -; KNL_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; KNL_32-NEXT:    vpgatherqd (%eax,%zmm0,4), %ymm2 {%k1}
> > +; KNL_32-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> > @@ -1347,9 +1331,8 @@ define <2 x i32> @test23b(i32* %base, <2
> >  ; SKX:       # %bb.0:
> >  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> > -; SKX-NEXT:    vpgatherqd (%rdi,%xmm0,4), %xmm1 {%k1}
> > -; SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; SKX-NEXT:    vpgatherqd (%rdi,%xmm0,4), %xmm2 {%k1}
> > +; SKX-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test23b:
> > @@ -1357,9 +1340,8 @@ define <2 x i32> @test23b(i32* %base, <2
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> > -; SKX_32-NEXT:    vpgatherqd (%eax,%xmm0,4), %xmm1 {%k1}
> > -; SKX_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; SKX_32-NEXT:    vpgatherqd (%eax,%xmm0,4), %xmm2 {%k1}
> > +; SKX_32-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; SKX_32-NEXT:    retl
> >    %gep.random = getelementptr i32, i32* %base, <2 x i64> %ind
> >    %res = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32*>
> %gep.random, i32 4, <2 x i1> %mask, <2 x i32> %src0)
> > @@ -1369,22 +1351,22 @@ define <2 x i32> @test23b(i32* %base, <2
> >  define <2 x i32> @test24(i32* %base, <2 x i32> %ind) {
> >  ; KNL_64-LABEL: test24:
> >  ; KNL_64:       # %bb.0:
> > -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_64-NEXT:    movw $3, %ax
> >  ; KNL_64-NEXT:    kmovw %eax, %k1
> >  ; KNL_64-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm1 {%k1}
> > -; KNL_64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; KNL_64-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> >  ;
> >  ; KNL_32-LABEL: test24:
> >  ; KNL_32:       # %bb.0:
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; KNL_32-NEXT:    movw $3, %cx
> >  ; KNL_32-NEXT:    kmovw %ecx, %k1
> >  ; KNL_32-NEXT:    vpgatherdd (%eax,%zmm0,4), %zmm1 {%k1}
> > -; KNL_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; KNL_32-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> > @@ -1392,9 +1374,8 @@ define <2 x i32> @test24(i32* %base, <2
> >  ; SKX:       # %bb.0:
> >  ; SKX-NEXT:    movb $3, %al
> >  ; SKX-NEXT:    kmovw %eax, %k1
> > -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm1 {%k1}
> > -; SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; SKX-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test24:
> > @@ -1402,9 +1383,8 @@ define <2 x i32> @test24(i32* %base, <2
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; SKX_32-NEXT:    movb $3, %cl
> >  ; SKX_32-NEXT:    kmovw %ecx, %k1
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SKX_32-NEXT:    vpgatherdd (%eax,%xmm0,4), %xmm1 {%k1}
> > -; SKX_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> > +; SKX_32-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; SKX_32-NEXT:    retl
> >    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> >    %gep.random = getelementptr i32, i32* %base, <2 x i64> %sext_ind
> > @@ -1416,13 +1396,12 @@ define <2 x i64> @test25(i64* %base, <2
> >  ; KNL_64-LABEL: test25:
> >  ; KNL_64:       # %bb.0:
> >  ; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; KNL_64-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; KNL_64-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_64-NEXT:    vpgatherqq (%rdi,%zmm0,8), %zmm2 {%k1}
> > +; KNL_64-NEXT:    vpgatherdq (%rdi,%ymm0,8), %zmm2 {%k1}
> >  ; KNL_64-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> > @@ -1430,36 +1409,31 @@ define <2 x i64> @test25(i64* %base, <2
> >  ; KNL_32-LABEL: test25:
> >  ; KNL_32:       # %bb.0:
> >  ; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; KNL_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; KNL_32-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vpgatherqq (%eax,%zmm0,8), %zmm2 {%k1}
> > +; KNL_32-NEXT:    vpgatherdq (%eax,%ymm0,8), %zmm2 {%k1}
> >  ; KNL_32-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> >  ; SKX-LABEL: test25:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; SKX-NEXT:    vpgatherqq (%rdi,%xmm0,8), %xmm2 {%k1}
> > +; SKX-NEXT:    vpgatherdq (%rdi,%xmm0,8), %xmm2 {%k1}
> >  ; SKX-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test25:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; SKX_32-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; SKX_32-NEXT:    vpgatherqq (%eax,%xmm0,8), %xmm2 {%k1}
> > +; SKX_32-NEXT:    vpgatherdq (%eax,%xmm0,8), %xmm2 {%k1}
> >  ; SKX_32-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; SKX_32-NEXT:    retl
> >    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> > @@ -1472,11 +1446,10 @@ define <2 x i64> @test26(i64* %base, <2
> >  ; KNL_64-LABEL: test26:
> >  ; KNL_64:       # %bb.0:
> >  ; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > -; KNL_64-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; KNL_64-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_64-NEXT:    movb $3, %al
> >  ; KNL_64-NEXT:    kmovw %eax, %k1
> > -; KNL_64-NEXT:    vpgatherqq (%rdi,%zmm0,8), %zmm1 {%k1}
> > +; KNL_64-NEXT:    vpgatherdq (%rdi,%ymm0,8), %zmm1 {%k1}
> >  ; KNL_64-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> > @@ -1484,32 +1457,27 @@ define <2 x i64> @test26(i64* %base, <2
> >  ; KNL_32-LABEL: test26:
> >  ; KNL_32:       # %bb.0:
> >  ; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > -; KNL_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; KNL_32-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; KNL_32-NEXT:    movb $3, %cl
> >  ; KNL_32-NEXT:    kmovw %ecx, %k1
> > -; KNL_32-NEXT:    vpgatherqq (%eax,%zmm0,8), %zmm1 {%k1}
> > +; KNL_32-NEXT:    vpgatherdq (%eax,%ymm0,8), %zmm1 {%k1}
> >  ; KNL_32-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> >  ; SKX-LABEL: test26:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; SKX-NEXT:    kxnorw %k0, %k0, %k1
> > -; SKX-NEXT:    vpgatherqq (%rdi,%xmm0,8), %xmm1 {%k1}
> > +; SKX-NEXT:    vpgatherdq (%rdi,%xmm0,8), %xmm1 {%k1}
> >  ; SKX-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test26:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; SKX_32-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; SKX_32-NEXT:    kxnorw %k0, %k0, %k1
> > -; SKX_32-NEXT:    vpgatherqq (%eax,%xmm0,8), %xmm1 {%k1}
> > +; SKX_32-NEXT:    vpgatherdq (%eax,%xmm0,8), %xmm1 {%k1}
> >  ; SKX_32-NEXT:    vmovdqa %xmm1, %xmm0
> >  ; SKX_32-NEXT:    retl
> >    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> > @@ -1522,40 +1490,40 @@ define <2 x i64> @test26(i64* %base, <2
> >  define <2 x float> @test27(float* %base, <2 x i32> %ind) {
> >  ; KNL_64-LABEL: test27:
> >  ; KNL_64:       # %bb.0:
> > -; KNL_64-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_64-NEXT:    movw $3, %ax
> >  ; KNL_64-NEXT:    kmovw %eax, %k1
> > -; KNL_64-NEXT:    vgatherdps (%rdi,%zmm1,4), %zmm0 {%k1}
> > -; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
> > +; KNL_64-NEXT:    vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}
> > +; KNL_64-NEXT:    vmovaps %xmm1, %xmm0
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> >  ;
> >  ; KNL_32-LABEL: test27:
> >  ; KNL_32:       # %bb.0:
> > -; KNL_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; KNL_32-NEXT:    movw $3, %cx
> >  ; KNL_32-NEXT:    kmovw %ecx, %k1
> > -; KNL_32-NEXT:    vgatherdps (%eax,%zmm1,4), %zmm0 {%k1}
> > -; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
> > +; KNL_32-NEXT:    vgatherdps (%eax,%zmm0,4), %zmm1 {%k1}
> > +; KNL_32-NEXT:    vmovaps %xmm1, %xmm0
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> >  ; SKX-LABEL: test27:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
> >  ; SKX-NEXT:    movb $3, %al
> >  ; SKX-NEXT:    kmovw %eax, %k1
> > -; SKX-NEXT:    vgatherdps (%rdi,%xmm1,4), %xmm0 {%k1}
> > +; SKX-NEXT:    vgatherdps (%rdi,%xmm0,4), %xmm1 {%k1}
> > +; SKX-NEXT:    vmovaps %xmm1, %xmm0
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test27:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> >  ; SKX_32-NEXT:    movb $3, %cl
> >  ; SKX_32-NEXT:    kmovw %ecx, %k1
> > -; SKX_32-NEXT:    vgatherdps (%eax,%xmm1,4), %xmm0 {%k1}
> > +; SKX_32-NEXT:    vgatherdps (%eax,%xmm0,4), %xmm1 {%k1}
> > +; SKX_32-NEXT:    vmovaps %xmm1, %xmm0
> >  ; SKX_32-NEXT:    retl
> >    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> >    %gep.random = getelementptr float, float* %base, <2 x i64> %sext_ind
> > @@ -1568,7 +1536,7 @@ define void @test28(<2 x i32>%a1, <2 x i
> >  ; KNL_64-LABEL: test28:
> >  ; KNL_64:       # %bb.0:
> >  ; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; KNL_64-NEXT:    movb $3, %al
> >  ; KNL_64-NEXT:    kmovw %eax, %k1
> >  ; KNL_64-NEXT:    vpscatterqd %ymm0, (,%zmm1) {%k1}
> > @@ -1577,8 +1545,8 @@ define void @test28(<2 x i32>%a1, <2 x i
> >  ;
> >  ; KNL_32-LABEL: test28:
> >  ; KNL_32:       # %bb.0:
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; KNL_32-NEXT:    movw $3, %ax
> >  ; KNL_32-NEXT:    kmovw %eax, %k1
> >  ; KNL_32-NEXT:    vpscatterdd %zmm0, (,%zmm1) {%k1}
> > @@ -1587,7 +1555,6 @@ define void @test28(<2 x i32>%a1, <2 x i
> >  ;
> >  ; SKX-LABEL: test28:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SKX-NEXT:    kxnorw %k0, %k0, %k1
> >  ; SKX-NEXT:    vpscatterqd %xmm0, (,%xmm1) {%k1}
> >  ; SKX-NEXT:    retq
> > @@ -1596,8 +1563,6 @@ define void @test28(<2 x i32>%a1, <2 x i
> >  ; SKX_32:       # %bb.0:
> >  ; SKX_32-NEXT:    movb $3, %al
> >  ; SKX_32-NEXT:    kmovw %eax, %k1
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; SKX_32-NEXT:    vpscatterdd %xmm0, (,%xmm1) {%k1}
> >  ; SKX_32-NEXT:    retl
> >    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x
> i32*> %ptr, i32 4, <2 x i1> <i1 true, i1 true>)
> > @@ -2673,9 +2638,7 @@ define <16 x float> @sext_i8_index(float
> >  define <8 x float> @sext_v8i8_index(float* %base, <8 x i8> %ind) {
> >  ; KNL_64-LABEL: sext_v8i8_index:
> >  ; KNL_64:       # %bb.0:
> > -; KNL_64-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > -; KNL_64-NEXT:    vpslld $24, %ymm0, %ymm0
> > -; KNL_64-NEXT:    vpsrad $24, %ymm0, %ymm1
> > +; KNL_64-NEXT:    vpmovsxbd %xmm0, %ymm1
> >  ; KNL_64-NEXT:    movw $255, %ax
> >  ; KNL_64-NEXT:    kmovw %eax, %k1
> >  ; KNL_64-NEXT:    vgatherdps (%rdi,%zmm1,4), %zmm0 {%k1}
> > @@ -2684,10 +2647,8 @@ define <8 x float> @sext_v8i8_index(floa
> >  ;
> >  ; KNL_32-LABEL: sext_v8i8_index:
> >  ; KNL_32:       # %bb.0:
> > -; KNL_32-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vpslld $24, %ymm0, %ymm0
> > -; KNL_32-NEXT:    vpsrad $24, %ymm0, %ymm1
> > +; KNL_32-NEXT:    vpmovsxbd %xmm0, %ymm1
> >  ; KNL_32-NEXT:    movw $255, %cx
> >  ; KNL_32-NEXT:    kmovw %ecx, %k1
> >  ; KNL_32-NEXT:    vgatherdps (%eax,%zmm1,4), %zmm0 {%k1}
> > @@ -2696,20 +2657,16 @@ define <8 x float> @sext_v8i8_index(floa
> >  ;
> >  ; SKX-LABEL: sext_v8i8_index:
> >  ; SKX:       # %bb.0:
> > -; SKX-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> > +; SKX-NEXT:    vpmovsxbd %xmm0, %ymm1
> >  ; SKX-NEXT:    kxnorw %k0, %k0, %k1
> > -; SKX-NEXT:    vpslld $24, %ymm0, %ymm0
> > -; SKX-NEXT:    vpsrad $24, %ymm0, %ymm1
> >  ; SKX-NEXT:    vgatherdps (%rdi,%ymm1,4), %ymm0 {%k1}
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: sext_v8i8_index:
> >  ; SKX_32:       # %bb.0:
> > -; SKX_32-NEXT:    vpmovzxwd {{.*#+}} ymm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > +; SKX_32-NEXT:    vpmovsxbd %xmm0, %ymm1
> >  ; SKX_32-NEXT:    kxnorw %k0, %k0, %k1
> > -; SKX_32-NEXT:    vpslld $24, %ymm0, %ymm0
> > -; SKX_32-NEXT:    vpsrad $24, %ymm0, %ymm1
> >  ; SKX_32-NEXT:    vgatherdps (%eax,%ymm1,4), %ymm0 {%k1}
> >  ; SKX_32-NEXT:    retl
> >
> > @@ -2725,28 +2682,26 @@ declare <8 x float> @llvm.masked.gather.
> >  define void @test_scatter_2i32_index(<2 x double> %a1, double* %base,
> <2 x i32> %ind, <2 x i1> %mask) {
> >  ; KNL_64-LABEL: test_scatter_2i32_index:
> >  ; KNL_64:       # %bb.0:
> > +; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $ymm1
> >  ; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> > -; KNL_64-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; KNL_64-NEXT:    vpsraq $32, %zmm1, %zmm1
> >  ; KNL_64-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; KNL_64-NEXT:    vptestmq %zmm2, %zmm2, %k0
> >  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> > -; KNL_64-NEXT:    vscatterqpd %zmm0, (%rdi,%zmm1,8) {%k1}
> > +; KNL_64-NEXT:    vscatterdpd %zmm0, (%rdi,%ymm1,8) {%k1}
> >  ; KNL_64-NEXT:    vzeroupper
> >  ; KNL_64-NEXT:    retq
> >  ;
> >  ; KNL_32-LABEL: test_scatter_2i32_index:
> >  ; KNL_32:       # %bb.0:
> > +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $ymm1
> >  ; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> > -; KNL_32-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; KNL_32-NEXT:    vpsraq $32, %zmm1, %zmm1
> >  ; KNL_32-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; KNL_32-NEXT:    vptestmq %zmm2, %zmm2, %k0
> >  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> >  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> >  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; KNL_32-NEXT:    vscatterqpd %zmm0, (%eax,%zmm1,8) {%k1}
> > +; KNL_32-NEXT:    vscatterdpd %zmm0, (%eax,%ymm1,8) {%k1}
> >  ; KNL_32-NEXT:    vzeroupper
> >  ; KNL_32-NEXT:    retl
> >  ;
> > @@ -2754,19 +2709,15 @@ define void @test_scatter_2i32_index(<2
> >  ; SKX:       # %bb.0:
> >  ; SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; SKX-NEXT:    vpmovq2m %xmm2, %k1
> > -; SKX-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; SKX-NEXT:    vpsraq $32, %xmm1, %xmm1
> > -; SKX-NEXT:    vscatterqpd %xmm0, (%rdi,%xmm1,8) {%k1}
> > +; SKX-NEXT:    vscatterdpd %xmm0, (%rdi,%xmm1,8) {%k1}
> >  ; SKX-NEXT:    retq
> >  ;
> >  ; SKX_32-LABEL: test_scatter_2i32_index:
> >  ; SKX_32:       # %bb.0:
> >  ; SKX_32-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; SKX_32-NEXT:    vpmovq2m %xmm2, %k1
> > -; SKX_32-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; SKX_32-NEXT:    vpsraq $32, %xmm1, %xmm1
> >  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; SKX_32-NEXT:    vscatterqpd %xmm0, (%eax,%xmm1,8) {%k1}
> > +; SKX_32-NEXT:    vscatterdpd %xmm0, (%eax,%xmm1,8) {%k1}
> >  ; SKX_32-NEXT:    retl
> >    %gep = getelementptr double, double *%base, <2 x i32> %ind
> >    call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> %a1, <2 x
> double*> %gep, i32 4, <2 x i1> %mask)
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll Wed Aug
> 7 09:24:26 2019
> > @@ -30,24 +30,21 @@ define <2 x double> @test_gather_v2i32_i
> >  ;
> >  ; PROMOTE_SKX-LABEL: test_gather_v2i32_index:
> >  ; PROMOTE_SKX:       # %bb.0:
> > -; PROMOTE_SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; PROMOTE_SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
> >  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; PROMOTE_SKX-NEXT:    vgatherqpd (%rdi,%xmm0,8), %xmm2 {%k1}
> > +; PROMOTE_SKX-NEXT:    vgatherdpd (%rdi,%xmm0,8), %xmm2 {%k1}
> >  ; PROMOTE_SKX-NEXT:    vmovapd %xmm2, %xmm0
> >  ; PROMOTE_SKX-NEXT:    retq
> >  ;
> >  ; PROMOTE_KNL-LABEL: test_gather_v2i32_index:
> >  ; PROMOTE_KNL:       # %bb.0:
> >  ; PROMOTE_KNL-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > -; PROMOTE_KNL-NEXT:    vpsllq $32, %xmm0, %xmm0
> > -; PROMOTE_KNL-NEXT:    vpsraq $32, %zmm0, %zmm0
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; PROMOTE_KNL-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> > -; PROMOTE_KNL-NEXT:    vgatherqpd (%rdi,%zmm0,8), %zmm2 {%k1}
> > +; PROMOTE_KNL-NEXT:    vgatherdpd (%rdi,%ymm0,8), %zmm2 {%k1}
> >  ; PROMOTE_KNL-NEXT:    vmovapd %xmm2, %xmm0
> >  ; PROMOTE_KNL-NEXT:    vzeroupper
> >  ; PROMOTE_KNL-NEXT:    retq
> > @@ -61,11 +58,8 @@ define <2 x double> @test_gather_v2i32_i
> >  ;
> >  ; PROMOTE_AVX2-LABEL: test_gather_v2i32_index:
> >  ; PROMOTE_AVX2:       # %bb.0:
> > -; PROMOTE_AVX2-NEXT:    vpsllq $32, %xmm0, %xmm3
> > -; PROMOTE_AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; PROMOTE_AVX2-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> >  ; PROMOTE_AVX2-NEXT:    vpsllq $63, %xmm1, %xmm1
> > -; PROMOTE_AVX2-NEXT:    vgatherqpd %xmm1, (%rdi,%xmm0,8), %xmm2
> > +; PROMOTE_AVX2-NEXT:    vgatherdpd %xmm1, (%rdi,%xmm0,8), %xmm2
> >  ; PROMOTE_AVX2-NEXT:    vmovapd %xmm2, %xmm0
> >  ; PROMOTE_AVX2-NEXT:    retq
> >    %gep.random = getelementptr double, double* %base, <2 x i32> %ind
> > @@ -97,21 +91,18 @@ define void @test_scatter_v2i32_index(<2
> >  ; PROMOTE_SKX:       # %bb.0:
> >  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm2, %k1
> > -; PROMOTE_SKX-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; PROMOTE_SKX-NEXT:    vpsraq $32, %xmm1, %xmm1
> > -; PROMOTE_SKX-NEXT:    vscatterqpd %xmm0, (%rdi,%xmm1,8) {%k1}
> > +; PROMOTE_SKX-NEXT:    vscatterdpd %xmm0, (%rdi,%xmm1,8) {%k1}
> >  ; PROMOTE_SKX-NEXT:    retq
> >  ;
> >  ; PROMOTE_KNL-LABEL: test_scatter_v2i32_index:
> >  ; PROMOTE_KNL:       # %bb.0:
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm1 killed $xmm1 def $ymm1
> >  ; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> > -; PROMOTE_KNL-NEXT:    vpsllq $32, %xmm1, %xmm1
> > -; PROMOTE_KNL-NEXT:    vpsraq $32, %zmm1, %zmm1
> >  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; PROMOTE_KNL-NEXT:    vptestmq %zmm2, %zmm2, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> > -; PROMOTE_KNL-NEXT:    vscatterqpd %zmm0, (%rdi,%zmm1,8) {%k1}
> > +; PROMOTE_KNL-NEXT:    vscatterdpd %zmm0, (%rdi,%ymm1,8) {%k1}
> >  ; PROMOTE_KNL-NEXT:    vzeroupper
> >  ; PROMOTE_KNL-NEXT:    retq
> >  ;
> > @@ -143,9 +134,7 @@ define void @test_scatter_v2i32_index(<2
> >  ;
> >  ; PROMOTE_AVX2-LABEL: test_scatter_v2i32_index:
> >  ; PROMOTE_AVX2:       # %bb.0:
> > -; PROMOTE_AVX2-NEXT:    vpsllq $32, %xmm1, %xmm3
> > -; PROMOTE_AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; PROMOTE_AVX2-NEXT:    vpblendd {{.*#+}} xmm1 =
> xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> > +; PROMOTE_AVX2-NEXT:    vpmovsxdq %xmm1, %xmm1
> >  ; PROMOTE_AVX2-NEXT:    vpsllq $3, %xmm1, %xmm1
> >  ; PROMOTE_AVX2-NEXT:    vmovq %rdi, %xmm3
> >  ; PROMOTE_AVX2-NEXT:    vpbroadcastq %xmm3, %xmm3
> > @@ -199,21 +188,20 @@ define <2 x i32> @test_gather_v2i32_data
> >  ; PROMOTE_SKX:       # %bb.0:
> >  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> > -; PROMOTE_SKX-NEXT:    vpgatherqd (,%xmm0), %xmm1 {%k1}
> > -; PROMOTE_SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 =
> xmm1[0],zero,xmm1[1],zero
> > +; PROMOTE_SKX-NEXT:    vpgatherqd (,%xmm0), %xmm2 {%k1}
> > +; PROMOTE_SKX-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; PROMOTE_SKX-NEXT:    retq
> >  ;
> >  ; PROMOTE_KNL-LABEL: test_gather_v2i32_data:
> >  ; PROMOTE_KNL:       # %bb.0:
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
> >  ; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; PROMOTE_KNL-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> >  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> > -; PROMOTE_KNL-NEXT:    vpgatherqd (,%zmm0), %ymm1 {%k1}
> > -; PROMOTE_KNL-NEXT:    vpmovzxdq {{.*#+}} xmm0 =
> xmm1[0],zero,xmm1[1],zero
> > +; PROMOTE_KNL-NEXT:    vpgatherqd (,%zmm0), %ymm2 {%k1}
> > +; PROMOTE_KNL-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; PROMOTE_KNL-NEXT:    vzeroupper
> >  ; PROMOTE_KNL-NEXT:    retq
> >  ;
> > @@ -227,11 +215,10 @@ define <2 x i32> @test_gather_v2i32_data
> >  ;
> >  ; PROMOTE_AVX2-LABEL: test_gather_v2i32_data:
> >  ; PROMOTE_AVX2:       # %bb.0:
> > -; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
> >  ; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; PROMOTE_AVX2-NEXT:    vpslld $31, %xmm1, %xmm1
> >  ; PROMOTE_AVX2-NEXT:    vpgatherqd %xmm1, (,%xmm0), %xmm2
> > -; PROMOTE_AVX2-NEXT:    vpmovzxdq {{.*#+}} xmm0 =
> xmm2[0],zero,xmm2[1],zero
> > +; PROMOTE_AVX2-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; PROMOTE_AVX2-NEXT:    retq
> >    %res = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32*>
> %ptr, i32 4, <2 x i1> %mask, <2 x i32> %src0)
> >    ret <2 x i32>%res
> > @@ -261,16 +248,15 @@ define void @test_scatter_v2i32_data(<2
> >  ; PROMOTE_SKX:       # %bb.0:
> >  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm2, %k1
> > -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; PROMOTE_SKX-NEXT:    vpscatterqd %xmm0, (,%xmm1) {%k1}
> >  ; PROMOTE_SKX-NEXT:    retq
> >  ;
> >  ; PROMOTE_KNL-LABEL: test_scatter_v2i32_data:
> >  ; PROMOTE_KNL:       # %bb.0:
> >  ; PROMOTE_KNL-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
> >  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; PROMOTE_KNL-NEXT:    vptestmq %zmm2, %zmm2, %k0
> > -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> >  ; PROMOTE_KNL-NEXT:    vpscatterqd %ymm0, (,%zmm1) {%k1}
> > @@ -316,7 +302,7 @@ define void @test_scatter_v2i32_data(<2
> >  ; PROMOTE_AVX2-NEXT:    je .LBB3_4
> >  ; PROMOTE_AVX2-NEXT:  .LBB3_3: # %cond.store1
> >  ; PROMOTE_AVX2-NEXT:    vpextrq $1, %xmm1, %rax
> > -; PROMOTE_AVX2-NEXT:    vextractps $2, %xmm0, (%rax)
> > +; PROMOTE_AVX2-NEXT:    vextractps $1, %xmm0, (%rax)
> >  ; PROMOTE_AVX2-NEXT:    retq
> >    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x
> i32*> %ptr, i32 4, <2 x i1> %mask)
> >    ret void
> > @@ -348,22 +334,20 @@ define <2 x i32> @test_gather_v2i32_data
> >  ; PROMOTE_SKX:       # %bb.0:
> >  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm1, %k1
> > -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> > -; PROMOTE_SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm1 {%k1}
> > -; PROMOTE_SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 =
> xmm1[0],zero,xmm1[1],zero
> > +; PROMOTE_SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm2 {%k1}
> > +; PROMOTE_SKX-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; PROMOTE_SKX-NEXT:    retq
> >  ;
> >  ; PROMOTE_KNL-LABEL: test_gather_v2i32_data_index:
> >  ; PROMOTE_KNL:       # %bb.0:
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm1, %xmm1
> >  ; PROMOTE_KNL-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> >  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> > -; PROMOTE_KNL-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm1 {%k1}
> > -; PROMOTE_KNL-NEXT:    vpmovzxdq {{.*#+}} xmm0 =
> xmm1[0],zero,xmm1[1],zero
> > +; PROMOTE_KNL-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm2 {%k1}
> > +; PROMOTE_KNL-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; PROMOTE_KNL-NEXT:    vzeroupper
> >  ; PROMOTE_KNL-NEXT:    retq
> >  ;
> > @@ -377,12 +361,10 @@ define <2 x i32> @test_gather_v2i32_data
> >  ;
> >  ; PROMOTE_AVX2-LABEL: test_gather_v2i32_data_index:
> >  ; PROMOTE_AVX2:       # %bb.0:
> > -; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
> >  ; PROMOTE_AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; PROMOTE_AVX2-NEXT:    vpslld $31, %xmm1, %xmm1
> >  ; PROMOTE_AVX2-NEXT:    vpgatherdd %xmm1, (%rdi,%xmm0,4), %xmm2
> > -; PROMOTE_AVX2-NEXT:    vpmovzxdq {{.*#+}} xmm0 =
> xmm2[0],zero,xmm2[1],zero
> > +; PROMOTE_AVX2-NEXT:    vmovdqa %xmm2, %xmm0
> >  ; PROMOTE_AVX2-NEXT:    retq
> >    %gep.random = getelementptr i32, i32* %base, <2 x i32> %ind
> >    %res = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32*>
> %gep.random, i32 4, <2 x i1> %mask, <2 x i32> %src0)
> > @@ -413,17 +395,15 @@ define void @test_scatter_v2i32_data_ind
> >  ; PROMOTE_SKX:       # %bb.0:
> >  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm2, %k1
> > -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; PROMOTE_SKX-NEXT:    vpscatterdd %xmm0, (%rdi,%xmm1,4) {%k1}
> >  ; PROMOTE_SKX-NEXT:    retq
> >  ;
> >  ; PROMOTE_KNL-LABEL: test_scatter_v2i32_data_index:
> >  ; PROMOTE_KNL:       # %bb.0:
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> > +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm2, %xmm2
> >  ; PROMOTE_KNL-NEXT:    vptestmq %zmm2, %zmm2, %k0
> > -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
> >  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> >  ; PROMOTE_KNL-NEXT:    vpscatterdd %zmm0, (%rdi,%zmm1,4) {%k1}
> > @@ -458,9 +438,7 @@ define void @test_scatter_v2i32_data_ind
> >  ;
> >  ; PROMOTE_AVX2-LABEL: test_scatter_v2i32_data_index:
> >  ; PROMOTE_AVX2:       # %bb.0:
> > -; PROMOTE_AVX2-NEXT:    vpsllq $32, %xmm1, %xmm3
> > -; PROMOTE_AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> > -; PROMOTE_AVX2-NEXT:    vpblendd {{.*#+}} xmm1 =
> xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> > +; PROMOTE_AVX2-NEXT:    vpmovsxdq %xmm1, %xmm1
> >  ; PROMOTE_AVX2-NEXT:    vpsllq $2, %xmm1, %xmm1
> >  ; PROMOTE_AVX2-NEXT:    vmovq %rdi, %xmm3
> >  ; PROMOTE_AVX2-NEXT:    vpbroadcastq %xmm3, %xmm3
> > @@ -481,7 +459,7 @@ define void @test_scatter_v2i32_data_ind
> >  ; PROMOTE_AVX2-NEXT:    je .LBB5_4
> >  ; PROMOTE_AVX2-NEXT:  .LBB5_3: # %cond.store1
> >  ; PROMOTE_AVX2-NEXT:    vpextrq $1, %xmm1, %rax
> > -; PROMOTE_AVX2-NEXT:    vextractps $2, %xmm0, (%rax)
> > +; PROMOTE_AVX2-NEXT:    vextractps $1, %xmm0, (%rax)
> >  ; PROMOTE_AVX2-NEXT:    retq
> >    %gep = getelementptr i32, i32 *%base, <2 x i32> %ind
> >    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x
> i32*> %gep, i32 4, <2 x i1> %mask)
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_load.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_load.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_load.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_load.ll Wed Aug  7 09:24:26 2019
> > @@ -458,38 +458,40 @@ define <8 x double> @load_v8f64_v8i16(<8
> >  ;
> >  ; AVX1-LABEL: load_v8f64_v8i16:
> >  ; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm4 =
> xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> > -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> > -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm5
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm4 = xmm4[2,3,0,1]
> > -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm4
> > -; AVX1-NEXT:    vinsertf128 $1, %xmm4, %ymm5, %ymm4
> > -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm3
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> > +; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> > +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> > +; AVX1-NEXT:    vpmovsxwd %xmm3, %xmm3
> > +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm5
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]
> > +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm3
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm5, %ymm3
> > +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm0
> > +; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm4
> >  ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> >  ; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm3, %ymm0
> > -; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm3
> > -; AVX1-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> > -; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm4, %ymm1
> > -; AVX1-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm4, %ymm0
> > +; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm4
> > +; AVX1-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> > +; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm3, %ymm1
> > +; AVX1-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: load_v8f64_v8i16:
> >  ; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpunpckhwd {{.*#+}} xmm4 =
> xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> > -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> > -; AVX2-NEXT:    vpmovsxdq %xmm4, %ymm4
> > -; AVX2-NEXT:    vpmovzxwd {{.*#+}} xmm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> > +; AVX2-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpmovsxwd %xmm3, %xmm3
> > +; AVX2-NEXT:    vpmovsxdq %xmm3, %ymm3
> > +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpmovsxwd %xmm0, %xmm0
> >  ; AVX2-NEXT:    vpmovsxdq %xmm0, %ymm0
> > -; AVX2-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm3
> > -; AVX2-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> > -; AVX2-NEXT:    vmaskmovpd 32(%rdi), %ymm4, %ymm1
> > -; AVX2-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> > +; AVX2-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm4
> > +; AVX2-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> > +; AVX2-NEXT:    vmaskmovpd 32(%rdi), %ymm3, %ymm1
> > +; AVX2-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: load_v8f64_v8i16:
> > @@ -723,11 +725,9 @@ define <8 x double> @load_v8f64_v8i64(<8
> >  define <2 x float> @load_v2f32_v2i32(<2 x i32> %trigger, <2 x float>*
> %addr, <2 x float> %dst) {
> >  ; SSE2-LABEL: load_v2f32_v2i32:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne LBB7_1
> > @@ -753,8 +753,8 @@ define <2 x float> @load_v2f32_v2i32(<2
> >  ; SSE42-LABEL: load_v2f32_v2i32:
> >  ; SSE42:       ## %bb.0:
> >  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> > -; SSE42-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm0
> > +; SSE42-NEXT:    pcmpeqd %xmm0, %xmm2
> > +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm0
> >  ; SSE42-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE42-NEXT:    testb $1, %al
> >  ; SSE42-NEXT:    jne LBB7_1
> > @@ -774,32 +774,20 @@ define <2 x float> @load_v2f32_v2i32(<2
> >  ; SSE42-NEXT:    movaps %xmm1, %xmm0
> >  ; SSE42-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: load_v2f32_v2i32:
> > -; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX1-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> > -; AVX1-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: load_v2f32_v2i32:
> > -; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> > -; AVX2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> > -; AVX2-NEXT:    retq
> > +; AVX1OR2-LABEL: load_v2f32_v2i32:
> > +; AVX1OR2:       ## %bb.0:
> > +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> > +; AVX1OR2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> > +; AVX1OR2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> > +; AVX1OR2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> > +; AVX1OR2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: load_v2f32_v2i32:
> >  ; AVX512F:       ## %bb.0:
> >  ; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> > -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> > +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vblendmps (%rdi), %zmm1, %zmm0 {%k1}
> > @@ -807,13 +795,21 @@ define <2 x float> @load_v2f32_v2i32(<2
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: load_v2f32_v2i32:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> > -; AVX512VL-NEXT:    vblendmps (%rdi), %xmm1, %xmm0 {%k1}
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: load_v2f32_v2i32:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vblendmps (%rdi), %xmm1, %xmm0 {%k1}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: load_v2f32_v2i32:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vblendmps (%rdi), %xmm1, %xmm0 {%k1}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>*
> %addr, i32 4, <2 x i1> %mask, <2 x float> %dst)
> >    ret <2 x float> %res
> > @@ -822,11 +818,9 @@ define <2 x float> @load_v2f32_v2i32(<2
> >  define <2 x float> @load_v2f32_v2i32_undef(<2 x i32> %trigger, <2 x
> float>* %addr) {
> >  ; SSE2-LABEL: load_v2f32_v2i32_undef:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    pxor %xmm1, %xmm1
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm1, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    ## implicit-def: $xmm0
> > @@ -850,8 +844,8 @@ define <2 x float> @load_v2f32_v2i32_und
> >  ; SSE42-LABEL: load_v2f32_v2i32_undef:
> >  ; SSE42:       ## %bb.0:
> >  ; SSE42-NEXT:    pxor %xmm1, %xmm1
> > -; SSE42-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
> > -; SSE42-NEXT:    pcmpeqq %xmm1, %xmm0
> > +; SSE42-NEXT:    pcmpeqd %xmm0, %xmm1
> > +; SSE42-NEXT:    pmovsxdq %xmm1, %xmm0
> >  ; SSE42-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE42-NEXT:    testb $1, %al
> >  ; SSE42-NEXT:    ## implicit-def: $xmm0
> > @@ -869,29 +863,18 @@ define <2 x float> @load_v2f32_v2i32_und
> >  ; SSE42-NEXT:    insertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> >  ; SSE42-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: load_v2f32_v2i32_undef:
> > -; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX1-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm0
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: load_v2f32_v2i32_undef:
> > -; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm0
> > -; AVX2-NEXT:    retq
> > +; AVX1OR2-LABEL: load_v2f32_v2i32_undef:
> > +; AVX1OR2:       ## %bb.0:
> > +; AVX1OR2-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > +; AVX1OR2-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> > +; AVX1OR2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> > +; AVX1OR2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm0
> > +; AVX1OR2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: load_v2f32_v2i32_undef:
> >  ; AVX512F:       ## %bb.0:
> > -; AVX512F-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> > +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vmovups (%rdi), %zmm0 {%k1} {z}
> > @@ -899,13 +882,21 @@ define <2 x float> @load_v2f32_v2i32_und
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: load_v2f32_v2i32_undef:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> > -; AVX512VL-NEXT:    vmovups (%rdi), %xmm0 {%k1} {z}
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: load_v2f32_v2i32_undef:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vmovups (%rdi), %xmm0 {%k1} {z}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: load_v2f32_v2i32_undef:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vmovups (%rdi), %xmm0 {%k1} {z}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>*
> %addr, i32 4, <2 x i1> %mask, <2 x float>undef)
> >    ret <2 x float> %res
> > @@ -1792,38 +1783,40 @@ define <8 x i64> @load_v8i64_v8i16(<8 x
> >  ;
> >  ; AVX1-LABEL: load_v8i64_v8i16:
> >  ; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> > -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm4 =
> xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> > -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> > -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm5
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm4 = xmm4[2,3,0,1]
> > -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm4
> > -; AVX1-NEXT:    vinsertf128 $1, %xmm4, %ymm5, %ymm4
> > -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm3
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> > +; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> > +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> > +; AVX1-NEXT:    vpmovsxwd %xmm3, %xmm3
> > +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm5
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]
> > +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm3
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm5, %ymm3
> > +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm0
> > +; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm4
> >  ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> >  ; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm3, %ymm0
> > -; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm3
> > -; AVX1-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> > -; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm4, %ymm1
> > -; AVX1-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm4, %ymm0
> > +; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm4
> > +; AVX1-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> > +; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm3, %ymm1
> > +; AVX1-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: load_v8i64_v8i16:
> >  ; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> > -; AVX2-NEXT:    vpunpckhwd {{.*#+}} xmm4 =
> xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> > -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> > -; AVX2-NEXT:    vpmovsxdq %xmm4, %ymm4
> > -; AVX2-NEXT:    vpmovzxwd {{.*#+}} xmm0 =
> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> > -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> > +; AVX2-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpmovsxwd %xmm3, %xmm3
> > +; AVX2-NEXT:    vpmovsxdq %xmm3, %ymm3
> > +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpmovsxwd %xmm0, %xmm0
> >  ; AVX2-NEXT:    vpmovsxdq %xmm0, %ymm0
> > -; AVX2-NEXT:    vpmaskmovq (%rdi), %ymm0, %ymm3
> > -; AVX2-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> > -; AVX2-NEXT:    vpmaskmovq 32(%rdi), %ymm4, %ymm1
> > -; AVX2-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> > +; AVX2-NEXT:    vpmaskmovq (%rdi), %ymm0, %ymm4
> > +; AVX2-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> > +; AVX2-NEXT:    vpmaskmovq 32(%rdi), %ymm3, %ymm1
> > +; AVX2-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: load_v8i64_v8i16:
> > @@ -2061,11 +2054,9 @@ define <8 x i64> @load_v8i64_v8i64(<8 x
> >  define <2 x i32> @load_v2i32_v2i32(<2 x i32> %trigger, <2 x i32>*
> %addr, <2 x i32> %dst) {
> >  ; SSE2-LABEL: load_v2i32_v2i32:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne LBB17_1
> > @@ -2073,26 +2064,26 @@ define <2 x i32> @load_v2i32_v2i32(<2 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    jne LBB17_3
> >  ; SSE2-NEXT:  LBB17_4: ## %else2
> > -; SSE2-NEXT:    movapd %xmm1, %xmm0
> > +; SSE2-NEXT:    movaps %xmm1, %xmm0
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  LBB17_1: ## %cond.load
> > -; SSE2-NEXT:    movl (%rdi), %ecx
> > -; SSE2-NEXT:    movq %rcx, %xmm0
> > -; SSE2-NEXT:    movsd {{.*#+}} xmm1 = xmm0[0],xmm1[1]
> > +; SSE2-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > +; SSE2-NEXT:    movss {{.*#+}} xmm1 = xmm0[0],xmm1[1,2,3]
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je LBB17_4
> >  ; SSE2-NEXT:  LBB17_3: ## %cond.load1
> > -; SSE2-NEXT:    movl 4(%rdi), %eax
> > -; SSE2-NEXT:    movq %rax, %xmm0
> > -; SSE2-NEXT:    unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0]
> > -; SSE2-NEXT:    movapd %xmm1, %xmm0
> > +; SSE2-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
> > +; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0]
> > +; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[2,0],xmm1[2,3]
> > +; SSE2-NEXT:    movaps %xmm0, %xmm1
> > +; SSE2-NEXT:    movaps %xmm1, %xmm0
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE42-LABEL: load_v2i32_v2i32:
> >  ; SSE42:       ## %bb.0:
> >  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> > -; SSE42-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm0
> > +; SSE42-NEXT:    pcmpeqd %xmm0, %xmm2
> > +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm0
> >  ; SSE42-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE42-NEXT:    testb $1, %al
> >  ; SSE42-NEXT:    jne LBB17_1
> > @@ -2103,62 +2094,59 @@ define <2 x i32> @load_v2i32_v2i32(<2 x
> >  ; SSE42-NEXT:    movdqa %xmm1, %xmm0
> >  ; SSE42-NEXT:    retq
> >  ; SSE42-NEXT:  LBB17_1: ## %cond.load
> > -; SSE42-NEXT:    movl (%rdi), %ecx
> > -; SSE42-NEXT:    pinsrq $0, %rcx, %xmm1
> > +; SSE42-NEXT:    pinsrd $0, (%rdi), %xmm1
> >  ; SSE42-NEXT:    testb $2, %al
> >  ; SSE42-NEXT:    je LBB17_4
> >  ; SSE42-NEXT:  LBB17_3: ## %cond.load1
> > -; SSE42-NEXT:    movl 4(%rdi), %eax
> > -; SSE42-NEXT:    pinsrq $1, %rax, %xmm1
> > +; SSE42-NEXT:    pinsrd $1, 4(%rdi), %xmm1
> >  ; SSE42-NEXT:    movdqa %xmm1, %xmm0
> >  ; SSE42-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: load_v2i32_v2i32:
> >  ; AVX1:       ## %bb.0:
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> > +; AVX1-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> >  ; AVX1-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> > -; AVX1-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; AVX1-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> > -; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: load_v2i32_v2i32:
> >  ; AVX2:       ## %bb.0:
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> > +; AVX2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> >  ; AVX2-NEXT:    vpmaskmovd (%rdi), %xmm0, %xmm2
> > -; AVX2-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
> >  ; AVX2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> > -; AVX2-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: load_v2i32_v2i32:
> >  ; AVX512F:       ## %bb.0:
> > -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> > -; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> > +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> > -; AVX512F-NEXT:    vmovdqu32 (%rdi), %zmm0 {%k1}
> > -; AVX512F-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> > +; AVX512F-NEXT:    vpblendmd (%rdi), %zmm1, %zmm0 {%k1}
> > +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: load_v2i32_v2i32:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> > -; AVX512VL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > -; AVX512VL-NEXT:    vmovdqu32 (%rdi), %xmm0 {%k1}
> > -; AVX512VL-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: load_v2i32_v2i32:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vpblendmd (%rdi), %xmm1, %xmm0 {%k1}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: load_v2i32_v2i32:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vpblendmd (%rdi), %xmm1, %xmm0 {%k1}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>*
> %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> >    ret <2 x i32> %res
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_store.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_store.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_store.ll Wed Aug  7 09:24:26 2019
> > @@ -165,11 +165,9 @@ define void @store_v4f64_v4i64(<4 x i64>
> >  define void @store_v2f32_v2i32(<2 x i32> %trigger, <2 x float>* %addr,
> <2 x float> %val) {
> >  ; SSE2-LABEL: store_v2f32_v2i32:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne LBB3_1
> > @@ -190,8 +188,8 @@ define void @store_v2f32_v2i32(<2 x i32>
> >  ; SSE4-LABEL: store_v2f32_v2i32:
> >  ; SSE4:       ## %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > -; SSE4-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; SSE4-NEXT:    pcmpeqq %xmm2, %xmm0
> > +; SSE4-NEXT:    pcmpeqd %xmm0, %xmm2
> > +; SSE4-NEXT:    pmovsxdq %xmm2, %xmm0
> >  ; SSE4-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE4-NEXT:    testb $1, %al
> >  ; SSE4-NEXT:    jne LBB3_1
> > @@ -208,43 +206,40 @@ define void @store_v2f32_v2i32(<2 x i32>
> >  ; SSE4-NEXT:    extractps $1, %xmm1, 4(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: store_v2f32_v2i32:
> > -; AVX1:       ## %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX1-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: store_v2f32_v2i32:
> > -; AVX2:       ## %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX2-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> > -; AVX2-NEXT:    retq
> > +; AVX1OR2-LABEL: store_v2f32_v2i32:
> > +; AVX1OR2:       ## %bb.0:
> > +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> > +; AVX1OR2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> > +; AVX1OR2-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> > +; AVX1OR2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: store_v2f32_v2i32:
> >  ; AVX512F:       ## %bb.0:
> >  ; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> > -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> > +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vmovups %zmm1, (%rdi) {%k1}
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: store_v2f32_v2i32:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> > -; AVX512VL-NEXT:    vmovups %xmm1, (%rdi) {%k1}
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: store_v2f32_v2i32:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vmovups %xmm1, (%rdi) {%k1}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: store_v2f32_v2i32:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vmovups %xmm1, (%rdi) {%k1}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x
> float>* %addr, i32 4, <2 x i1> %mask)
> >    ret void
> > @@ -1046,11 +1041,9 @@ define void @store_v1i32_v1i32(<1 x i32>
> >  define void @store_v2i32_v2i32(<2 x i32> %trigger, <2 x i32>* %addr, <2
> x i32> %val) {
> >  ; SSE2-LABEL: store_v2i32_v2i32:
> >  ; SSE2:       ## %bb.0:
> > -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
> >  ; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne LBB10_1
> > @@ -1064,15 +1057,15 @@ define void @store_v2i32_v2i32(<2 x i32>
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je LBB10_4
> >  ; SSE2-NEXT:  LBB10_3: ## %cond.store1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,1,2,3]
> >  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: store_v2i32_v2i32:
> >  ; SSE4:       ## %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > -; SSE4-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; SSE4-NEXT:    pcmpeqq %xmm2, %xmm0
> > +; SSE4-NEXT:    pcmpeqd %xmm0, %xmm2
> > +; SSE4-NEXT:    pmovsxdq %xmm2, %xmm0
> >  ; SSE4-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE4-NEXT:    testb $1, %al
> >  ; SSE4-NEXT:    jne LBB10_1
> > @@ -1086,48 +1079,51 @@ define void @store_v2i32_v2i32(<2 x i32>
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je LBB10_4
> >  ; SSE4-NEXT:  LBB10_3: ## %cond.store1
> > -; SSE4-NEXT:    extractps $2, %xmm1, 4(%rdi)
> > +; SSE4-NEXT:    extractps $1, %xmm1, 4(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: store_v2i32_v2i32:
> >  ; AVX1:       ## %bb.0:
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 =
> xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> > -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX1-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> > +; AVX1-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> >  ; AVX1-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: store_v2i32_v2i32:
> >  ; AVX2:       ## %bb.0:
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> > -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> > +; AVX2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> >  ; AVX2-NEXT:    vpmaskmovd %xmm1, %xmm0, (%rdi)
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: store_v2i32_v2i32:
> >  ; AVX512F:       ## %bb.0:
> > -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> > -; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> > +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> > +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> > -; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> > +; AVX512F-NEXT:    vmovdqu32 %zmm1, (%rdi) {%k1}
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > -; AVX512VL-LABEL: store_v2i32_v2i32:
> > -; AVX512VL:       ## %bb.0:
> > -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 =
> xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> > -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> > -; AVX512VL-NEXT:    vpmovqd %xmm1, (%rdi) {%k1}
> > -; AVX512VL-NEXT:    retq
> > +; AVX512VLDQ-LABEL: store_v2i32_v2i32:
> > +; AVX512VLDQ:       ## %bb.0:
> > +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> > +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> > +; AVX512VLDQ-NEXT:    vmovdqu32 %xmm1, (%rdi) {%k1}
> > +; AVX512VLDQ-NEXT:    retq
> > +;
> > +; AVX512VLBW-LABEL: store_v2i32_v2i32:
> > +; AVX512VLBW:       ## %bb.0:
> > +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> > +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512VLBW-NEXT:    vmovdqu32 %xmm1, (%rdi) {%k1}
> > +; AVX512VLBW-NEXT:    retq
> >    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> >    call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>*
> %addr, i32 4, <2 x i1> %mask)
> >    ret void
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll Wed Aug  7
> 09:24:26 2019
> > @@ -615,17 +615,15 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-LABEL: truncstore_v8i64_v8i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm6, %xmm6
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm7 = xmm0[0,2,2,3,4,5,6,7]
> > -; SSE2-NEXT:    punpckldq {{.*#+}} xmm7 =
> xmm7[0],xmm1[0],xmm7[1],xmm1[1]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,1,0,2,4,5,6,7]
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> > -; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,1,0,2,4,5,6,7]
> > -; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> > -; SSE2-NEXT:    movsd {{.*#+}} xmm0 = xmm7[0],xmm0[1]
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm7 =
> [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> > +; SSE2-NEXT:    pand %xmm7, %xmm3
> > +; SSE2-NEXT:    pand %xmm7, %xmm2
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm2
> > +; SSE2-NEXT:    pand %xmm7, %xmm1
> > +; SSE2-NEXT:    pand %xmm7, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm2, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm6, %xmm5
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm1, %xmm5
> > @@ -645,17 +643,26 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB2_5
> >  ; SSE2-NEXT:  .LBB2_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB2_7
> > +; SSE2-NEXT:    je .LBB2_8
> > +; SSE2-NEXT:  .LBB2_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB2_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB2_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB2_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB2_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB2_11
> > +; SSE2-NEXT:    je .LBB2_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB2_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB2_13
> > -; SSE2-NEXT:  .LBB2_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB2_15
> >  ; SSE2-NEXT:  .LBB2_16: # %else14
> > @@ -665,50 +672,36 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB2_4
> >  ; SSE2-NEXT:  .LBB2_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB2_6
> >  ; SSE2-NEXT:  .LBB2_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB2_8
> > -; SSE2-NEXT:  .LBB2_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB2_10
> > -; SSE2-NEXT:  .LBB2_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB2_12
> > -; SSE2-NEXT:  .LBB2_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB2_14
> > +; SSE2-NEXT:    jne .LBB2_7
> > +; SSE2-NEXT:    jmp .LBB2_8
> >  ; SSE2-NEXT:  .LBB2_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB2_16
> >  ; SSE2-NEXT:  .LBB2_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i64_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm6, %xmm6
> > -; SSE4-NEXT:    pblendw {{.*#+}} xmm3 =
> xmm3[0],xmm6[1,2,3],xmm3[4],xmm6[5,6,7]
> > -; SSE4-NEXT:    pblendw {{.*#+}} xmm2 =
> xmm2[0],xmm6[1,2,3],xmm2[4],xmm6[5,6,7]
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm7 =
> [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> > +; SSE4-NEXT:    pand %xmm7, %xmm3
> > +; SSE4-NEXT:    pand %xmm7, %xmm2
> >  ; SSE4-NEXT:    packusdw %xmm3, %xmm2
> > -; SSE4-NEXT:    pblendw {{.*#+}} xmm1 =
> xmm1[0],xmm6[1,2,3],xmm1[4],xmm6[5,6,7]
> > -; SSE4-NEXT:    pblendw {{.*#+}} xmm0 =
> xmm0[0],xmm6[1,2,3],xmm0[4],xmm6[5,6,7]
> > +; SSE4-NEXT:    pand %xmm7, %xmm1
> > +; SSE4-NEXT:    pand %xmm7, %xmm0
> >  ; SSE4-NEXT:    packusdw %xmm1, %xmm0
> >  ; SSE4-NEXT:    packusdw %xmm2, %xmm0
> > +; SSE4-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE4-NEXT:    pcmpeqd %xmm6, %xmm5
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm1, %xmm5
> > @@ -747,36 +740,36 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB2_4
> >  ; SSE4-NEXT:  .LBB2_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB2_6
> >  ; SSE4-NEXT:  .LBB2_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB2_8
> >  ; SSE4-NEXT:  .LBB2_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB2_10
> >  ; SSE4-NEXT:  .LBB2_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB2_12
> >  ; SSE4-NEXT:  .LBB2_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB2_14
> >  ; SSE4-NEXT:  .LBB2_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB2_16
> >  ; SSE4-NEXT:  .LBB2_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v8i64_v8i8:
> >  ; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vmovaps {{.*#+}} ymm3 = [65535,65535,65535,65535]
> > +; AVX1-NEXT:    vmovaps {{.*#+}} ymm3 = [255,255,255,255]
> >  ; AVX1-NEXT:    vandps %ymm3, %ymm1, %ymm1
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm4
> >  ; AVX1-NEXT:    vpackusdw %xmm4, %xmm1, %xmm1
> > @@ -784,6 +777,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> >  ; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm1
> >  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
> > @@ -822,44 +816,48 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB2_4
> >  ; AVX1-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB2_6
> >  ; AVX1-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB2_8
> >  ; AVX1-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    testb $16, %al
> >  ; AVX1-NEXT:    je .LBB2_10
> >  ; AVX1-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $32, %al
> >  ; AVX1-NEXT:    je .LBB2_12
> >  ; AVX1-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX1-NEXT:    testb $64, %al
> >  ; AVX1-NEXT:    je .LBB2_14
> >  ; AVX1-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    testb $-128, %al
> >  ; AVX1-NEXT:    je .LBB2_16
> >  ; AVX1-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: truncstore_v8i64_v8i8:
> >  ; AVX2:       # %bb.0:
> >  ; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> > -; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm4
> > -; AVX2-NEXT:    vshufps {{.*#+}} xmm1 = xmm1[0,2],xmm4[0,2]
> > -; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm4
> > -; AVX2-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm4[0,2]
> > -; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpshufb {{.*#+}} ymm0 =
> ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
> > -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
> > +; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm4
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 =
> <u,u,0,8,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm1, %xmm1
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm4
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
> > +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
> >  ; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm2, %ymm1
> >  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
> >  ; AVX2-NEXT:    notl %eax
> > @@ -894,31 +892,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB2_4
> >  ; AVX2-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB2_6
> >  ; AVX2-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB2_8
> >  ; AVX2-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    testb $16, %al
> >  ; AVX2-NEXT:    je .LBB2_10
> >  ; AVX2-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $32, %al
> >  ; AVX2-NEXT:    je .LBB2_12
> >  ; AVX2-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX2-NEXT:    testb $64, %al
> >  ; AVX2-NEXT:    je .LBB2_14
> >  ; AVX2-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    testb $-128, %al
> >  ; AVX2-NEXT:    je .LBB2_16
> >  ; AVX2-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -926,7 +924,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB2_1
> > @@ -959,31 +957,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB2_4
> >  ; AVX512F-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB2_6
> >  ; AVX512F-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB2_8
> >  ; AVX512F-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB2_10
> >  ; AVX512F-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB2_12
> >  ; AVX512F-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB2_14
> >  ; AVX512F-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB2_16
> >  ; AVX512F-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1147,7 +1145,11 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-LABEL: truncstore_v4i64_v4i16:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
> >  ; SSE2-NEXT:    movmskps %xmm3, %eax
> >  ; SSE2-NEXT:    xorl $15, %eax
> > @@ -1170,24 +1172,28 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB4_4
> >  ; SSE2-NEXT:  .LBB4_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 2(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB4_6
> >  ; SSE2-NEXT:  .LBB4_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 4(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> >  ; SSE2-NEXT:    je .LBB4_8
> >  ; SSE2-NEXT:  .LBB4_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 6(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i64_v4i16:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm3, %xmm3
> > -; SSE4-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm3
> >  ; SSE4-NEXT:    movmskps %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -1209,62 +1215,109 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB4_4
> >  ; SSE4-NEXT:  .LBB4_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB4_6
> >  ; SSE4-NEXT:  .LBB4_5: # %cond.store3
> > -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB4_8
> >  ; SSE4-NEXT:  .LBB4_7: # %cond.store5
> > -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> > -; AVX-LABEL: truncstore_v4i64_v4i16:
> > -; AVX:       # %bb.0:
> > -; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > -; AVX-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm3[0,2]
> > -; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > -; AVX-NEXT:    vmovmskps %xmm1, %eax
> > -; AVX-NEXT:    xorl $15, %eax
> > -; AVX-NEXT:    testb $1, %al
> > -; AVX-NEXT:    jne .LBB4_1
> > -; AVX-NEXT:  # %bb.2: # %else
> > -; AVX-NEXT:    testb $2, %al
> > -; AVX-NEXT:    jne .LBB4_3
> > -; AVX-NEXT:  .LBB4_4: # %else2
> > -; AVX-NEXT:    testb $4, %al
> > -; AVX-NEXT:    jne .LBB4_5
> > -; AVX-NEXT:  .LBB4_6: # %else4
> > -; AVX-NEXT:    testb $8, %al
> > -; AVX-NEXT:    jne .LBB4_7
> > -; AVX-NEXT:  .LBB4_8: # %else6
> > -; AVX-NEXT:    vzeroupper
> > -; AVX-NEXT:    retq
> > -; AVX-NEXT:  .LBB4_1: # %cond.store
> > -; AVX-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > -; AVX-NEXT:    testb $2, %al
> > -; AVX-NEXT:    je .LBB4_4
> > -; AVX-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > -; AVX-NEXT:    testb $4, %al
> > -; AVX-NEXT:    je .LBB4_6
> > -; AVX-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > -; AVX-NEXT:    testb $8, %al
> > -; AVX-NEXT:    je .LBB4_8
> > -; AVX-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > -; AVX-NEXT:    vzeroupper
> > -; AVX-NEXT:    retq
> > +; AVX1-LABEL: truncstore_v4i64_v4i16:
> > +; AVX1:       # %bb.0:
> > +; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> > +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX1-NEXT:    vmovmskps %xmm1, %eax
> > +; AVX1-NEXT:    xorl $15, %eax
> > +; AVX1-NEXT:    testb $1, %al
> > +; AVX1-NEXT:    jne .LBB4_1
> > +; AVX1-NEXT:  # %bb.2: # %else
> > +; AVX1-NEXT:    testb $2, %al
> > +; AVX1-NEXT:    jne .LBB4_3
> > +; AVX1-NEXT:  .LBB4_4: # %else2
> > +; AVX1-NEXT:    testb $4, %al
> > +; AVX1-NEXT:    jne .LBB4_5
> > +; AVX1-NEXT:  .LBB4_6: # %else4
> > +; AVX1-NEXT:    testb $8, %al
> > +; AVX1-NEXT:    jne .LBB4_7
> > +; AVX1-NEXT:  .LBB4_8: # %else6
> > +; AVX1-NEXT:    vzeroupper
> > +; AVX1-NEXT:    retq
> > +; AVX1-NEXT:  .LBB4_1: # %cond.store
> > +; AVX1-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > +; AVX1-NEXT:    testb $2, %al
> > +; AVX1-NEXT:    je .LBB4_4
> > +; AVX1-NEXT:  .LBB4_3: # %cond.store1
> > +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    testb $4, %al
> > +; AVX1-NEXT:    je .LBB4_6
> > +; AVX1-NEXT:  .LBB4_5: # %cond.store3
> > +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    testb $8, %al
> > +; AVX1-NEXT:    je .LBB4_8
> > +; AVX1-NEXT:  .LBB4_7: # %cond.store5
> > +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vzeroupper
> > +; AVX1-NEXT:    retq
> > +;
> > +; AVX2-LABEL: truncstore_v4i64_v4i16:
> > +; AVX2:       # %bb.0:
> > +; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> > +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX2-NEXT:    vmovmskps %xmm1, %eax
> > +; AVX2-NEXT:    xorl $15, %eax
> > +; AVX2-NEXT:    testb $1, %al
> > +; AVX2-NEXT:    jne .LBB4_1
> > +; AVX2-NEXT:  # %bb.2: # %else
> > +; AVX2-NEXT:    testb $2, %al
> > +; AVX2-NEXT:    jne .LBB4_3
> > +; AVX2-NEXT:  .LBB4_4: # %else2
> > +; AVX2-NEXT:    testb $4, %al
> > +; AVX2-NEXT:    jne .LBB4_5
> > +; AVX2-NEXT:  .LBB4_6: # %else4
> > +; AVX2-NEXT:    testb $8, %al
> > +; AVX2-NEXT:    jne .LBB4_7
> > +; AVX2-NEXT:  .LBB4_8: # %else6
> > +; AVX2-NEXT:    vzeroupper
> > +; AVX2-NEXT:    retq
> > +; AVX2-NEXT:  .LBB4_1: # %cond.store
> > +; AVX2-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > +; AVX2-NEXT:    testb $2, %al
> > +; AVX2-NEXT:    je .LBB4_4
> > +; AVX2-NEXT:  .LBB4_3: # %cond.store1
> > +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    testb $4, %al
> > +; AVX2-NEXT:    je .LBB4_6
> > +; AVX2-NEXT:  .LBB4_5: # %cond.store3
> > +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    testb $8, %al
> > +; AVX2-NEXT:    je .LBB4_8
> > +; AVX2-NEXT:  .LBB4_7: # %cond.store5
> > +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vzeroupper
> > +; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i64_v4i16:
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB4_1
> > @@ -1285,15 +1338,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB4_4
> >  ; AVX512F-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB4_6
> >  ; AVX512F-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB4_8
> >  ; AVX512F-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1302,10 +1355,9 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> > +; AVX512BW-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -1326,47 +1378,55 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE2-LABEL: truncstore_v4i64_v4i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> > +; SSE2-NEXT:    pand %xmm4, %xmm1
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
> > -; SSE2-NEXT:    movmskps %xmm3, %eax
> > -; SSE2-NEXT:    xorl $15, %eax
> > -; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movmskps %xmm3, %ecx
> > +; SSE2-NEXT:    xorl $15, %ecx
> > +; SSE2-NEXT:    testb $1, %cl
> > +; SSE2-NEXT:    movd %xmm0, %eax
> >  ; SSE2-NEXT:    jne .LBB5_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    jne .LBB5_3
> >  ; SSE2-NEXT:  .LBB5_4: # %else2
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    jne .LBB5_5
> >  ; SSE2-NEXT:  .LBB5_6: # %else4
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    jne .LBB5_7
> >  ; SSE2-NEXT:  .LBB5_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB5_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, (%rdi)
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    movb %al, (%rdi)
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    je .LBB5_4
> >  ; SSE2-NEXT:  .LBB5_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    movb %ah, 1(%rdi)
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    je .LBB5_6
> >  ; SSE2-NEXT:  .LBB5_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    movl %eax, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    je .LBB5_8
> >  ; SSE2-NEXT:  .LBB5_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    shrl $24, %eax
> >  ; SSE2-NEXT:    movb %al, 3(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i64_v4i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm3, %xmm3
> > -; SSE4-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; SSE4-NEXT:    pshufb %xmm4, %xmm1
> > +; SSE4-NEXT:    pshufb %xmm4, %xmm0
> > +; SSE4-NEXT:    punpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
> >  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm3
> >  ; SSE4-NEXT:    movmskps %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -1388,62 +1448,107 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB5_4
> >  ; SSE4-NEXT:  .LBB5_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB5_6
> >  ; SSE4-NEXT:  .LBB5_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB5_8
> >  ; SSE4-NEXT:  .LBB5_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> > -; AVX-LABEL: truncstore_v4i64_v4i8:
> > -; AVX:       # %bb.0:
> > -; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > -; AVX-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm3[0,2]
> > -; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > -; AVX-NEXT:    vmovmskps %xmm1, %eax
> > -; AVX-NEXT:    xorl $15, %eax
> > -; AVX-NEXT:    testb $1, %al
> > -; AVX-NEXT:    jne .LBB5_1
> > -; AVX-NEXT:  # %bb.2: # %else
> > -; AVX-NEXT:    testb $2, %al
> > -; AVX-NEXT:    jne .LBB5_3
> > -; AVX-NEXT:  .LBB5_4: # %else2
> > -; AVX-NEXT:    testb $4, %al
> > -; AVX-NEXT:    jne .LBB5_5
> > -; AVX-NEXT:  .LBB5_6: # %else4
> > -; AVX-NEXT:    testb $8, %al
> > -; AVX-NEXT:    jne .LBB5_7
> > -; AVX-NEXT:  .LBB5_8: # %else6
> > -; AVX-NEXT:    vzeroupper
> > -; AVX-NEXT:    retq
> > -; AVX-NEXT:  .LBB5_1: # %cond.store
> > -; AVX-NEXT:    vpextrb $0, %xmm0, (%rdi)
> > -; AVX-NEXT:    testb $2, %al
> > -; AVX-NEXT:    je .LBB5_4
> > -; AVX-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > -; AVX-NEXT:    testb $4, %al
> > -; AVX-NEXT:    je .LBB5_6
> > -; AVX-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > -; AVX-NEXT:    testb $8, %al
> > -; AVX-NEXT:    je .LBB5_8
> > -; AVX-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > -; AVX-NEXT:    vzeroupper
> > -; AVX-NEXT:    retq
> > +; AVX1-LABEL: truncstore_v4i64_v4i8:
> > +; AVX1:       # %bb.0:
> > +; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> > +; AVX1-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> > +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX1-NEXT:    vmovmskps %xmm1, %eax
> > +; AVX1-NEXT:    xorl $15, %eax
> > +; AVX1-NEXT:    testb $1, %al
> > +; AVX1-NEXT:    jne .LBB5_1
> > +; AVX1-NEXT:  # %bb.2: # %else
> > +; AVX1-NEXT:    testb $2, %al
> > +; AVX1-NEXT:    jne .LBB5_3
> > +; AVX1-NEXT:  .LBB5_4: # %else2
> > +; AVX1-NEXT:    testb $4, %al
> > +; AVX1-NEXT:    jne .LBB5_5
> > +; AVX1-NEXT:  .LBB5_6: # %else4
> > +; AVX1-NEXT:    testb $8, %al
> > +; AVX1-NEXT:    jne .LBB5_7
> > +; AVX1-NEXT:  .LBB5_8: # %else6
> > +; AVX1-NEXT:    vzeroupper
> > +; AVX1-NEXT:    retq
> > +; AVX1-NEXT:  .LBB5_1: # %cond.store
> > +; AVX1-NEXT:    vpextrb $0, %xmm0, (%rdi)
> > +; AVX1-NEXT:    testb $2, %al
> > +; AVX1-NEXT:    je .LBB5_4
> > +; AVX1-NEXT:  .LBB5_3: # %cond.store1
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    testb $4, %al
> > +; AVX1-NEXT:    je .LBB5_6
> > +; AVX1-NEXT:  .LBB5_5: # %cond.store3
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    testb $8, %al
> > +; AVX1-NEXT:    je .LBB5_8
> > +; AVX1-NEXT:  .LBB5_7: # %cond.store5
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vzeroupper
> > +; AVX1-NEXT:    retq
> > +;
> > +; AVX2-LABEL: truncstore_v4i64_v4i8:
> > +; AVX2:       # %bb.0:
> > +; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> > +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX2-NEXT:    vmovmskps %xmm1, %eax
> > +; AVX2-NEXT:    xorl $15, %eax
> > +; AVX2-NEXT:    testb $1, %al
> > +; AVX2-NEXT:    jne .LBB5_1
> > +; AVX2-NEXT:  # %bb.2: # %else
> > +; AVX2-NEXT:    testb $2, %al
> > +; AVX2-NEXT:    jne .LBB5_3
> > +; AVX2-NEXT:  .LBB5_4: # %else2
> > +; AVX2-NEXT:    testb $4, %al
> > +; AVX2-NEXT:    jne .LBB5_5
> > +; AVX2-NEXT:  .LBB5_6: # %else4
> > +; AVX2-NEXT:    testb $8, %al
> > +; AVX2-NEXT:    jne .LBB5_7
> > +; AVX2-NEXT:  .LBB5_8: # %else6
> > +; AVX2-NEXT:    vzeroupper
> > +; AVX2-NEXT:    retq
> > +; AVX2-NEXT:  .LBB5_1: # %cond.store
> > +; AVX2-NEXT:    vpextrb $0, %xmm0, (%rdi)
> > +; AVX2-NEXT:    testb $2, %al
> > +; AVX2-NEXT:    je .LBB5_4
> > +; AVX2-NEXT:  .LBB5_3: # %cond.store1
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    testb $4, %al
> > +; AVX2-NEXT:    je .LBB5_6
> > +; AVX2-NEXT:  .LBB5_5: # %cond.store3
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    testb $8, %al
> > +; AVX2-NEXT:    je .LBB5_8
> > +; AVX2-NEXT:  .LBB5_7: # %cond.store5
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vzeroupper
> > +; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i64_v4i8:
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB5_1
> > @@ -1464,15 +1569,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB5_4
> >  ; AVX512F-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB5_6
> >  ; AVX512F-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB5_8
> >  ; AVX512F-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1481,10 +1586,9 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> > +; AVX512BW-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -1505,6 +1609,7 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE2-LABEL: truncstore_v2i64_v2i32:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> >  ; SSE2-NEXT:    pand %xmm2, %xmm1
> > @@ -1522,13 +1627,14 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB6_4
> >  ; SSE2-NEXT:  .LBB6_3: # %cond.store1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> >  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i32:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskpd %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -1540,11 +1646,11 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE4-NEXT:  .LBB6_4: # %else2
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB6_1: # %cond.store
> > -; SSE4-NEXT:    movss %xmm0, (%rdi)
> > +; SSE4-NEXT:    movd %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB6_4
> >  ; SSE4-NEXT:  .LBB6_3: # %cond.store1
> > -; SSE4-NEXT:    extractps $2, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrd $1, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v2i64_v2i32:
> > @@ -1573,9 +1679,9 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> > @@ -1590,9 +1696,9 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
> > +; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX512BW-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -1606,6 +1712,8 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE2-LABEL: truncstore_v2i64_v2i16:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> >  ; SSE2-NEXT:    pand %xmm2, %xmm1
> > @@ -1624,13 +1732,15 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB7_4
> >  ; SSE2-NEXT:  .LBB7_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 2(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i16:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskpd %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -1646,12 +1756,14 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB7_4
> >  ; SSE4-NEXT:  .LBB7_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v2i64_v2i16:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
> >  ; AVX-NEXT:    xorl $3, %eax
> > @@ -1667,13 +1779,15 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB7_4
> >  ; AVX-NEXT:  .LBB7_3: # %cond.store1
> > -; AVX-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v2i64_v2i16:
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX512F-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB7_1
> > @@ -1688,7 +1802,7 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB7_4
> >  ; AVX512F-NEXT:  .LBB7_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1696,10 +1810,10 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
> > +; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -1719,12 +1833,17 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE2-LABEL: truncstore_v2i64_v2i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> >  ; SSE2-NEXT:    pand %xmm2, %xmm1
> >  ; SSE2-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-NEXT:    xorl $3, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movd %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB8_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> >  ; SSE2-NEXT:    testb $2, %al
> > @@ -1732,18 +1851,17 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE2-NEXT:  .LBB8_4: # %else2
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB8_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB8_4
> >  ; SSE2-NEXT:  .LBB8_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskpd %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -1759,12 +1877,13 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB8_4
> >  ; SSE4-NEXT:  .LBB8_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v2i64_v2i8:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
> >  ; AVX-NEXT:    xorl $3, %eax
> > @@ -1780,13 +1899,14 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB8_4
> >  ; AVX-NEXT:  .LBB8_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v2i64_v2i8:
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB8_1
> > @@ -1801,7 +1921,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB8_4
> >  ; AVX512F-NEXT:  .LBB8_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1809,9 +1929,9 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
> > +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -3593,11 +3713,11 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-LABEL: truncstore_v8i32_v8i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm4, %xmm4
> > -; SSE2-NEXT:    pslld $16, %xmm1
> > -; SSE2-NEXT:    psrad $16, %xmm1
> > -; SSE2-NEXT:    pslld $16, %xmm0
> > -; SSE2-NEXT:    psrad $16, %xmm0
> > -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm5 =
> [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
> > +; SSE2-NEXT:    pand %xmm5, %xmm1
> > +; SSE2-NEXT:    pand %xmm5, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm1, %xmm3
> > @@ -3617,17 +3737,26 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB12_5
> >  ; SSE2-NEXT:  .LBB12_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB12_7
> > +; SSE2-NEXT:    je .LBB12_8
> > +; SSE2-NEXT:  .LBB12_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB12_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB12_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB12_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB12_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB12_11
> > +; SSE2-NEXT:    je .LBB12_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB12_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB12_13
> > -; SSE2-NEXT:  .LBB12_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB12_15
> >  ; SSE2-NEXT:  .LBB12_16: # %else14
> > @@ -3637,47 +3766,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB12_4
> >  ; SSE2-NEXT:  .LBB12_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB12_6
> >  ; SSE2-NEXT:  .LBB12_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB12_8
> > -; SSE2-NEXT:  .LBB12_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB12_10
> > -; SSE2-NEXT:  .LBB12_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB12_12
> > -; SSE2-NEXT:  .LBB12_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB12_14
> > +; SSE2-NEXT:    jne .LBB12_7
> > +; SSE2-NEXT:    jmp .LBB12_8
> >  ; SSE2-NEXT:  .LBB12_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB12_16
> >  ; SSE2-NEXT:  .LBB12_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i32_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> > -; SSE4-NEXT:    movdqa {{.*#+}} xmm5 =
> [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm5 =
> <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> >  ; SSE4-NEXT:    pshufb %xmm5, %xmm1
> >  ; SSE4-NEXT:    pshufb %xmm5, %xmm0
> > -; SSE4-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
> > +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE4-NEXT:    pcmpeqd %xmm4, %xmm3
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm1, %xmm3
> > @@ -3716,40 +3829,40 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB12_4
> >  ; SSE4-NEXT:  .LBB12_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB12_6
> >  ; SSE4-NEXT:  .LBB12_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB12_8
> >  ; SSE4-NEXT:  .LBB12_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB12_10
> >  ; SSE4-NEXT:  .LBB12_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB12_12
> >  ; SSE4-NEXT:  .LBB12_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB12_14
> >  ; SSE4-NEXT:  .LBB12_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB12_16
> >  ; SSE4-NEXT:  .LBB12_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v8i32_v8i8:
> >  ; AVX1:       # %bb.0:
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
> > -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 =
> [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 =
> <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> >  ; AVX1-NEXT:    vpshufb %xmm3, %xmm2, %xmm2
> >  ; AVX1-NEXT:    vpshufb %xmm3, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
> > +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm2[0],xmm0[1],xmm2[1]
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
> >  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm2, %xmm2
> > @@ -3788,39 +3901,42 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB12_4
> >  ; AVX1-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB12_6
> >  ; AVX1-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB12_8
> >  ; AVX1-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    testb $16, %al
> >  ; AVX1-NEXT:    je .LBB12_10
> >  ; AVX1-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $32, %al
> >  ; AVX1-NEXT:    je .LBB12_12
> >  ; AVX1-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX1-NEXT:    testb $64, %al
> >  ; AVX1-NEXT:    je .LBB12_14
> >  ; AVX1-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    testb $-128, %al
> >  ; AVX1-NEXT:    je .LBB12_16
> >  ; AVX1-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: truncstore_v8i32_v8i8:
> >  ; AVX2:       # %bb.0:
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpshufb {{.*#+}} ymm0 =
> ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
> > -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
> > +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> >  ; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
> >  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
> >  ; AVX2-NEXT:    notl %eax
> > @@ -3855,31 +3971,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB12_4
> >  ; AVX2-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB12_6
> >  ; AVX2-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB12_8
> >  ; AVX2-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    testb $16, %al
> >  ; AVX2-NEXT:    je .LBB12_10
> >  ; AVX2-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $32, %al
> >  ; AVX2-NEXT:    je .LBB12_12
> >  ; AVX2-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX2-NEXT:    testb $64, %al
> >  ; AVX2-NEXT:    je .LBB12_14
> >  ; AVX2-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    testb $-128, %al
> >  ; AVX2-NEXT:    je .LBB12_16
> >  ; AVX2-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -3888,7 +4004,7 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512F-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
> >  ; AVX512F-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB12_1
> > @@ -3921,31 +4037,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB12_4
> >  ; AVX512F-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB12_6
> >  ; AVX512F-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB12_8
> >  ; AVX512F-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB12_10
> >  ; AVX512F-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB12_12
> >  ; AVX512F-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB12_14
> >  ; AVX512F-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB12_16
> >  ; AVX512F-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -3954,10 +4070,9 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512BW-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpmovdw %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> > +; AVX512BW-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -3978,6 +4093,9 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE2-LABEL: truncstore_v4i32_v4i16:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    movmskps %xmm2, %eax
> >  ; SSE2-NEXT:    xorl $15, %eax
> > @@ -4000,23 +4118,24 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB13_4
> >  ; SSE2-NEXT:  .LBB13_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 2(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB13_6
> >  ; SSE2-NEXT:  .LBB13_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 4(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> >  ; SSE2-NEXT:    je .LBB13_8
> >  ; SSE2-NEXT:  .LBB13_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 6(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i32_v4i16:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskps %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -4038,20 +4157,21 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB13_4
> >  ; SSE4-NEXT:  .LBB13_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB13_6
> >  ; SSE4-NEXT:  .LBB13_5: # %cond.store3
> > -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB13_8
> >  ; SSE4-NEXT:  .LBB13_7: # %cond.store5
> > -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v4i32_v4i16:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX-NEXT:    xorl $15, %eax
> > @@ -4073,21 +4193,22 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB13_4
> >  ; AVX-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    testb $4, %al
> >  ; AVX-NEXT:    je .LBB13_6
> >  ; AVX-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX-NEXT:    testb $8, %al
> >  ; AVX-NEXT:    je .LBB13_8
> >  ; AVX-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i32_v4i16:
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB13_1
> > @@ -4108,15 +4229,15 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB13_4
> >  ; AVX512F-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB13_6
> >  ; AVX512F-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB13_8
> >  ; AVX512F-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -4124,9 +4245,9 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> > +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -4146,45 +4267,49 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE2-LABEL: truncstore_v4i32_v4i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-NEXT:    movmskps %xmm2, %eax
> > -; SSE2-NEXT:    xorl $15, %eax
> > -; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movmskps %xmm2, %ecx
> > +; SSE2-NEXT:    xorl $15, %ecx
> > +; SSE2-NEXT:    testb $1, %cl
> > +; SSE2-NEXT:    movd %xmm0, %eax
> >  ; SSE2-NEXT:    jne .LBB14_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    jne .LBB14_3
> >  ; SSE2-NEXT:  .LBB14_4: # %else2
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    jne .LBB14_5
> >  ; SSE2-NEXT:  .LBB14_6: # %else4
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    jne .LBB14_7
> >  ; SSE2-NEXT:  .LBB14_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB14_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, (%rdi)
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    movb %al, (%rdi)
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    je .LBB14_4
> >  ; SSE2-NEXT:  .LBB14_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    movb %ah, 1(%rdi)
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    je .LBB14_6
> >  ; SSE2-NEXT:  .LBB14_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    movl %eax, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    je .LBB14_8
> >  ; SSE2-NEXT:  .LBB14_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    shrl $24, %eax
> >  ; SSE2-NEXT:    movb %al, 3(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i32_v4i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskps %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -4206,20 +4331,21 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB14_4
> >  ; SSE4-NEXT:  .LBB14_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB14_6
> >  ; SSE4-NEXT:  .LBB14_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB14_8
> >  ; SSE4-NEXT:  .LBB14_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v4i32_v4i8:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX-NEXT:    xorl $15, %eax
> > @@ -4241,21 +4367,22 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB14_4
> >  ; AVX-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    testb $4, %al
> >  ; AVX-NEXT:    je .LBB14_6
> >  ; AVX-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    testb $8, %al
> >  ; AVX-NEXT:    je .LBB14_8
> >  ; AVX-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i32_v4i8:
> >  ; AVX512F:       # %bb.0:
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB14_1
> > @@ -4276,15 +4403,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB14_4
> >  ; AVX512F-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB14_6
> >  ; AVX512F-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB14_8
> >  ; AVX512F-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -4292,9 +4419,9 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> > +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -6147,6 +6274,8 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-LABEL: truncstore_v8i16_v8i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqw %xmm1, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm1
> > @@ -6163,17 +6292,26 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB17_5
> >  ; SSE2-NEXT:  .LBB17_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB17_7
> > +; SSE2-NEXT:    je .LBB17_8
> > +; SSE2-NEXT:  .LBB17_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB17_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB17_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB17_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB17_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB17_11
> > +; SSE2-NEXT:    je .LBB17_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB17_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB17_13
> > -; SSE2-NEXT:  .LBB17_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB17_15
> >  ; SSE2-NEXT:  .LBB17_16: # %else14
> > @@ -6183,43 +6321,28 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB17_4
> >  ; SSE2-NEXT:  .LBB17_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB17_6
> >  ; SSE2-NEXT:  .LBB17_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB17_8
> > -; SSE2-NEXT:  .LBB17_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB17_10
> > -; SSE2-NEXT:  .LBB17_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB17_12
> > -; SSE2-NEXT:  .LBB17_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB17_14
> > +; SSE2-NEXT:    jne .LBB17_7
> > +; SSE2-NEXT:    jmp .LBB17_8
> >  ; SSE2-NEXT:  .LBB17_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB17_16
> >  ; SSE2-NEXT:  .LBB17_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i16_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqw %xmm1, %xmm2
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm1
> > @@ -6255,36 +6378,37 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB17_4
> >  ; SSE4-NEXT:  .LBB17_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB17_6
> >  ; SSE4-NEXT:  .LBB17_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB17_8
> >  ; SSE4-NEXT:  .LBB17_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB17_10
> >  ; SSE4-NEXT:  .LBB17_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB17_12
> >  ; SSE4-NEXT:  .LBB17_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB17_14
> >  ; SSE4-NEXT:  .LBB17_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB17_16
> >  ; SSE4-NEXT:  .LBB17_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v8i16_v8i8:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; AVX-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > @@ -6320,31 +6444,31 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB17_4
> >  ; AVX-NEXT:  .LBB17_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    testb $4, %al
> >  ; AVX-NEXT:    je .LBB17_6
> >  ; AVX-NEXT:  .LBB17_5: # %cond.store3
> > -; AVX-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    testb $8, %al
> >  ; AVX-NEXT:    je .LBB17_8
> >  ; AVX-NEXT:  .LBB17_7: # %cond.store5
> > -; AVX-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX-NEXT:    testb $16, %al
> >  ; AVX-NEXT:    je .LBB17_10
> >  ; AVX-NEXT:  .LBB17_9: # %cond.store7
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX-NEXT:    testb $32, %al
> >  ; AVX-NEXT:    je .LBB17_12
> >  ; AVX-NEXT:  .LBB17_11: # %cond.store9
> > -; AVX-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX-NEXT:    testb $64, %al
> >  ; AVX-NEXT:    je .LBB17_14
> >  ; AVX-NEXT:  .LBB17_13: # %cond.store11
> > -; AVX-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX-NEXT:    testb $-128, %al
> >  ; AVX-NEXT:    je .LBB17_16
> >  ; AVX-NEXT:  .LBB17_15: # %cond.store13
> > -; AVX-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v8i16_v8i8:
> > @@ -6354,6 +6478,7 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512F-NEXT:    vpternlogq $15, %zmm1, %zmm1, %zmm1
> >  ; AVX512F-NEXT:    vpmovsxwq %xmm1, %zmm1
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB17_1
> > @@ -6386,31 +6511,31 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB17_4
> >  ; AVX512F-NEXT:  .LBB17_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB17_6
> >  ; AVX512F-NEXT:  .LBB17_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB17_8
> >  ; AVX512F-NEXT:  .LBB17_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB17_10
> >  ; AVX512F-NEXT:  .LBB17_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB17_12
> >  ; AVX512F-NEXT:  .LBB17_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB17_14
> >  ; AVX512F-NEXT:  .LBB17_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB17_16
> >  ; AVX512F-NEXT:  .LBB17_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -6418,9 +6543,9 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmw %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> > +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll Wed Aug  7
> 09:24:26 2019
> > @@ -948,7 +948,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    pxor %xmm8, %xmm8
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm9 = [127,127]
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm11 = [2147483648,2147483648]
> > -; SSE2-NEXT:    movdqa %xmm2, %xmm6
> > +; SSE2-NEXT:    movdqa %xmm3, %xmm6
> >  ; SSE2-NEXT:    pxor %xmm11, %xmm6
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [2147483775,2147483775]
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm7
> > @@ -959,23 +959,10 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    pand %xmm12, %xmm6
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm13 = xmm7[1,1,3,3]
> >  ; SSE2-NEXT:    por %xmm6, %xmm13
> > -; SSE2-NEXT:    pand %xmm13, %xmm2
> > +; SSE2-NEXT:    pand %xmm13, %xmm3
> >  ; SSE2-NEXT:    pandn %xmm9, %xmm13
> > -; SSE2-NEXT:    por %xmm2, %xmm13
> > -; SSE2-NEXT:    movdqa %xmm3, %xmm2
> > -; SSE2-NEXT:    pxor %xmm11, %xmm2
> > -; SSE2-NEXT:    movdqa %xmm10, %xmm6
> > -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm6
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm2[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm12, %xmm7
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm6[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm7, %xmm2
> > -; SSE2-NEXT:    pand %xmm2, %xmm3
> > -; SSE2-NEXT:    pandn %xmm9, %xmm2
> > -; SSE2-NEXT:    por %xmm3, %xmm2
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm3
> > +; SSE2-NEXT:    por %xmm3, %xmm13
> > +; SSE2-NEXT:    movdqa %xmm2, %xmm3
> >  ; SSE2-NEXT:    pxor %xmm11, %xmm3
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm6
> >  ; SSE2-NEXT:    pcmpgtd %xmm3, %xmm6
> > @@ -985,78 +972,97 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    pand %xmm12, %xmm7
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm6[1,1,3,3]
> >  ; SSE2-NEXT:    por %xmm7, %xmm3
> > -; SSE2-NEXT:    pand %xmm3, %xmm0
> > +; SSE2-NEXT:    pand %xmm3, %xmm2
> >  ; SSE2-NEXT:    pandn %xmm9, %xmm3
> > -; SSE2-NEXT:    por %xmm0, %xmm3
> > -; SSE2-NEXT:    movdqa %xmm1, %xmm0
> > -; SSE2-NEXT:    pxor %xmm11, %xmm0
> > +; SSE2-NEXT:    por %xmm2, %xmm3
> > +; SSE2-NEXT:    movdqa %xmm1, %xmm2
> > +; SSE2-NEXT:    pxor %xmm11, %xmm2
> > +; SSE2-NEXT:    movdqa %xmm10, %xmm6
> > +; SSE2-NEXT:    pcmpgtd %xmm2, %xmm6
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm2[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm12, %xmm7
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm6[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm7, %xmm2
> > +; SSE2-NEXT:    pand %xmm2, %xmm1
> > +; SSE2-NEXT:    pandn %xmm9, %xmm2
> > +; SSE2-NEXT:    por %xmm1, %xmm2
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> > +; SSE2-NEXT:    pxor %xmm11, %xmm1
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm6
> > -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm6
> > +; SSE2-NEXT:    pcmpgtd %xmm1, %xmm6
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm6[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm7, %xmm0
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm7, %xmm1
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm6[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm0, %xmm6
> > -; SSE2-NEXT:    pand %xmm6, %xmm1
> > -; SSE2-NEXT:    pandn %xmm9, %xmm6
> >  ; SSE2-NEXT:    por %xmm1, %xmm6
> > +; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    pandn %xmm9, %xmm6
> > +; SSE2-NEXT:    por %xmm0, %xmm6
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm9 =
> [18446744073709551488,18446744073709551488]
> >  ; SSE2-NEXT:    movdqa %xmm6, %xmm0
> >  ; SSE2-NEXT:    pxor %xmm11, %xmm0
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 =
> [18446744071562067840,18446744071562067840]
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm7
> > -; SSE2-NEXT:    pcmpgtd %xmm10, %xmm7
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm1, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm0, %xmm1
> > -; SSE2-NEXT:    pand %xmm1, %xmm6
> > -; SSE2-NEXT:    pandn %xmm9, %xmm1
> > -; SSE2-NEXT:    por %xmm6, %xmm1
> > -; SSE2-NEXT:    movdqa %xmm3, %xmm0
> > -; SSE2-NEXT:    pxor %xmm11, %xmm0
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm6
> > -; SSE2-NEXT:    pcmpgtd %xmm10, %xmm6
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> > +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm1[0,0,2,2]
> >  ; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm0[1,1,3,3]
> >  ; SSE2-NEXT:    pand %xmm12, %xmm7
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm6[1,1,3,3]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
> >  ; SSE2-NEXT:    por %xmm7, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm3
> > +; SSE2-NEXT:    pand %xmm0, %xmm6
> >  ; SSE2-NEXT:    pandn %xmm9, %xmm0
> > -; SSE2-NEXT:    por %xmm3, %xmm0
> > -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-NEXT:    por %xmm6, %xmm0
> >  ; SSE2-NEXT:    movdqa %xmm2, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm11, %xmm1
> > -; SSE2-NEXT:    movdqa %xmm1, %xmm3
> > +; SSE2-NEXT:    movdqa %xmm1, %xmm6
> > +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm6
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm12, %xmm7
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm6[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm7, %xmm1
> > +; SSE2-NEXT:    pand %xmm1, %xmm2
> > +; SSE2-NEXT:    pandn %xmm9, %xmm1
> > +; SSE2-NEXT:    por %xmm2, %xmm1
> > +; SSE2-NEXT:    movdqa %xmm3, %xmm2
> > +; SSE2-NEXT:    pxor %xmm11, %xmm2
> > +; SSE2-NEXT:    movdqa %xmm2, %xmm6
> > +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm6
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm2[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm12, %xmm7
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm6[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm7, %xmm2
> > +; SSE2-NEXT:    pand %xmm2, %xmm3
> > +; SSE2-NEXT:    pandn %xmm9, %xmm2
> > +; SSE2-NEXT:    por %xmm3, %xmm2
> > +; SSE2-NEXT:    pxor %xmm13, %xmm11
> > +; SSE2-NEXT:    movdqa %xmm11, %xmm3
> >  ; SSE2-NEXT:    pcmpgtd %xmm10, %xmm3
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm1
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm11
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm11[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm7
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm1, %xmm3
> > -; SSE2-NEXT:    pand %xmm3, %xmm2
> > +; SSE2-NEXT:    por %xmm7, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm13
> >  ; SSE2-NEXT:    pandn %xmm9, %xmm3
> > -; SSE2-NEXT:    por %xmm2, %xmm3
> > -; SSE2-NEXT:    pxor %xmm13, %xmm11
> > -; SSE2-NEXT:    movdqa %xmm11, %xmm1
> > -; SSE2-NEXT:    pcmpgtd %xmm10, %xmm1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm1[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm11
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm11[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm2, %xmm6
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm6, %xmm1
> > -; SSE2-NEXT:    pand %xmm1, %xmm13
> > -; SSE2-NEXT:    pandn %xmm9, %xmm1
> > -; SSE2-NEXT:    por %xmm13, %xmm1
> > -; SSE2-NEXT:    packssdw %xmm3, %xmm1
> > -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-NEXT:    por %xmm13, %xmm3
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm6 =
> [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> > +; SSE2-NEXT:    pand %xmm6, %xmm3
> > +; SSE2-NEXT:    pand %xmm6, %xmm2
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm2
> > +; SSE2-NEXT:    pand %xmm6, %xmm1
> > +; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm2, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm8, %xmm5
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm1, %xmm5
> > @@ -1076,17 +1082,26 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB2_5
> >  ; SSE2-NEXT:  .LBB2_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB2_7
> > +; SSE2-NEXT:    je .LBB2_8
> > +; SSE2-NEXT:  .LBB2_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB2_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB2_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB2_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB2_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB2_11
> > +; SSE2-NEXT:    je .LBB2_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB2_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB2_13
> > -; SSE2-NEXT:  .LBB2_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB2_15
> >  ; SSE2-NEXT:  .LBB2_16: # %else14
> > @@ -1096,38 +1111,22 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB2_4
> >  ; SSE2-NEXT:  .LBB2_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB2_6
> >  ; SSE2-NEXT:  .LBB2_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB2_8
> > -; SSE2-NEXT:  .LBB2_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB2_10
> > -; SSE2-NEXT:  .LBB2_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB2_12
> > -; SSE2-NEXT:  .LBB2_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB2_14
> > +; SSE2-NEXT:    jne .LBB2_7
> > +; SSE2-NEXT:    jmp .LBB2_8
> >  ; SSE2-NEXT:  .LBB2_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB2_16
> >  ; SSE2-NEXT:  .LBB2_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i64_v8i8:
> > @@ -1136,39 +1135,45 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE4-NEXT:    pxor %xmm8, %xmm8
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm7 = [127,127]
> >  ; SSE4-NEXT:    movdqa %xmm7, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
> > -; SSE4-NEXT:    movdqa %xmm7, %xmm10
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm10
> > -; SSE4-NEXT:    movdqa %xmm7, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > -; SSE4-NEXT:    movdqa %xmm7, %xmm2
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
> > +; SSE4-NEXT:    movdqa %xmm7, %xmm10
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm10
> >  ; SSE4-NEXT:    movdqa %xmm7, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm9, %xmm0
> > +; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
> >  ; SSE4-NEXT:    movdqa %xmm7, %xmm3
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm9, %xmm3
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> >  ; SSE4-NEXT:    movdqa %xmm7, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm7
> > +; SSE4-NEXT:    movdqa %xmm7, %xmm2
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
> > +; SSE4-NEXT:    movdqa %xmm7, %xmm0
> > +; SSE4-NEXT:    pcmpgtq %xmm9, %xmm0
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm9, %xmm7
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm1 =
> [18446744073709551488,18446744073709551488]
> >  ; SSE4-NEXT:    movapd %xmm7, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> >  ; SSE4-NEXT:    movdqa %xmm1, %xmm6
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm7, %xmm6
> > -; SSE4-NEXT:    movapd %xmm3, %xmm0
> > +; SSE4-NEXT:    movapd %xmm2, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> >  ; SSE4-NEXT:    movdqa %xmm1, %xmm7
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm7
> > -; SSE4-NEXT:    packssdw %xmm6, %xmm7
> > -; SSE4-NEXT:    movapd %xmm2, %xmm0
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm7
> > +; SSE4-NEXT:    movapd %xmm3, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> > +; SSE4-NEXT:    movdqa %xmm1, %xmm2
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
> >  ; SSE4-NEXT:    movapd %xmm10, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm10, %xmm1
> > -; SSE4-NEXT:    packssdw %xmm3, %xmm1
> > -; SSE4-NEXT:    packssdw %xmm1, %xmm7
> > +; SSE4-NEXT:    movapd {{.*#+}} xmm0 =
> [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> > +; SSE4-NEXT:    andpd %xmm0, %xmm1
> > +; SSE4-NEXT:    andpd %xmm0, %xmm2
> > +; SSE4-NEXT:    packusdw %xmm1, %xmm2
> > +; SSE4-NEXT:    andpd %xmm0, %xmm7
> > +; SSE4-NEXT:    andpd %xmm0, %xmm6
> > +; SSE4-NEXT:    packusdw %xmm7, %xmm6
> > +; SSE4-NEXT:    packusdw %xmm2, %xmm6
> > +; SSE4-NEXT:    packuswb %xmm6, %xmm6
> >  ; SSE4-NEXT:    pcmpeqd %xmm8, %xmm5
> >  ; SSE4-NEXT:    pcmpeqd %xmm0, %xmm0
> >  ; SSE4-NEXT:    pxor %xmm0, %xmm5
> > @@ -1203,62 +1208,74 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE4-NEXT:  .LBB2_16: # %else14
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB2_1: # %cond.store
> > -; SSE4-NEXT:    pextrb $0, %xmm7, (%rdi)
> > +; SSE4-NEXT:    pextrb $0, %xmm6, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB2_4
> >  ; SSE4-NEXT:  .LBB2_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm7, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm6, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB2_6
> >  ; SSE4-NEXT:  .LBB2_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm7, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm6, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB2_8
> >  ; SSE4-NEXT:  .LBB2_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm7, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm6, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB2_10
> >  ; SSE4-NEXT:  .LBB2_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm7, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm6, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB2_12
> >  ; SSE4-NEXT:  .LBB2_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm7, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm6, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB2_14
> >  ; SSE4-NEXT:  .LBB2_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm7, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm6, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB2_16
> >  ; SSE4-NEXT:  .LBB2_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm7, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm6, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v8i64_v8i8:
> >  ; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm3
> > -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [127,127]
> > -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm4, %xmm8
> > -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm4, %xmm9
> > -; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm7
> > -; AVX1-NEXT:    vpcmpgtq %xmm7, %xmm4, %xmm5
> > -; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm4, %xmm6
> > -; AVX1-NEXT:    vblendvpd %xmm6, %xmm0, %xmm4, %xmm0
> > -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm6 =
> [18446744073709551488,18446744073709551488]
> > -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm0, %xmm10
> > -; AVX1-NEXT:    vblendvpd %xmm5, %xmm7, %xmm4, %xmm5
> > -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm5, %xmm11
> > -; AVX1-NEXT:    vblendvpd %xmm9, %xmm1, %xmm4, %xmm1
> > -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm1, %xmm7
> > -; AVX1-NEXT:    vblendvpd %xmm8, %xmm3, %xmm4, %xmm3
> > -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm3, %xmm4
> > -; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm6, %xmm3
> > -; AVX1-NEXT:    vblendvpd %xmm7, %xmm1, %xmm6, %xmm1
> > -; AVX1-NEXT:    vpackssdw %xmm3, %xmm1, %xmm1
> > -; AVX1-NEXT:    vblendvpd %xmm11, %xmm5, %xmm6, %xmm3
> > -; AVX1-NEXT:    vblendvpd %xmm10, %xmm0, %xmm6, %xmm0
> > -; AVX1-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
> > +; AVX1-NEXT:    vmovapd {{.*#+}} ymm9 = [127,127,127,127]
> > +; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm10
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm5 = [127,127]
> > +; AVX1-NEXT:    vpcmpgtq %xmm10, %xmm5, %xmm6
> > +; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm5, %xmm7
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm6, %ymm7, %ymm8
> > +; AVX1-NEXT:    vblendvpd %ymm8, %ymm1, %ymm9, %ymm8
> > +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > +; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm5, %xmm4
> > +; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm5, %xmm11
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm4, %ymm11, %ymm12
> > +; AVX1-NEXT:    vblendvpd %ymm12, %ymm0, %ymm9, %ymm9
> > +; AVX1-NEXT:    vmovapd {{.*#+}} ymm12 =
> [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
> > +; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm5, %xmm3
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 =
> [18446744073709551488,18446744073709551488]
> > +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm3, %xmm3
> > +; AVX1-NEXT:    vblendvpd %xmm11, %xmm0, %xmm5, %xmm0
> > +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm0, %xmm0
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm0, %ymm0
> > +; AVX1-NEXT:    vblendvpd %ymm0, %ymm9, %ymm12, %ymm0
> > +; AVX1-NEXT:    vblendvpd %xmm6, %xmm10, %xmm5, %xmm3
> > +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm3, %xmm3
> > +; AVX1-NEXT:    vblendvpd %xmm7, %xmm1, %xmm5, %xmm1
> > +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm1, %xmm1
> > +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm1, %ymm1
> > +; AVX1-NEXT:    vblendvpd %ymm1, %ymm8, %ymm12, %ymm1
> > +; AVX1-NEXT:    vmovapd {{.*#+}} ymm3 = [255,255,255,255]
> > +; AVX1-NEXT:    vandpd %ymm3, %ymm1, %ymm1
> > +; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm4
> > +; AVX1-NEXT:    vpackusdw %xmm4, %xmm1, %xmm1
> > +; AVX1-NEXT:    vandpd %ymm3, %ymm0, %ymm0
> > +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > +; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm1
> >  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
> > @@ -1297,31 +1314,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB2_4
> >  ; AVX1-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB2_6
> >  ; AVX1-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB2_8
> >  ; AVX1-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    testb $16, %al
> >  ; AVX1-NEXT:    je .LBB2_10
> >  ; AVX1-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $32, %al
> >  ; AVX1-NEXT:    je .LBB2_12
> >  ; AVX1-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX1-NEXT:    testb $64, %al
> >  ; AVX1-NEXT:    je .LBB2_14
> >  ; AVX1-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    testb $-128, %al
> >  ; AVX1-NEXT:    je .LBB2_16
> >  ; AVX1-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -1329,19 +1346,26 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX2:       # %bb.0:
> >  ; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm4 = [127,127,127,127]
> > -; AVX2-NEXT:    vpcmpgtq %ymm0, %ymm4, %ymm5
> > -; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
> >  ; AVX2-NEXT:    vpcmpgtq %ymm1, %ymm4, %ymm5
> >  ; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> > +; AVX2-NEXT:    vpcmpgtq %ymm0, %ymm4, %ymm5
> > +; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
> >  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm4 =
> [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
> > -; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm1, %ymm5
> > -; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> >  ; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm0, %ymm5
> >  ; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
> > -; AVX2-NEXT:    vpackssdw %ymm1, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
> > -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm1
> > -; AVX2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm1, %ymm5
> > +; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> > +; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm4
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 =
> <u,u,0,8,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm1, %xmm1
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
> > +; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm4
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
> > +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
> >  ; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm2, %ymm1
> >  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
> >  ; AVX2-NEXT:    notl %eax
> > @@ -1376,31 +1400,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB2_4
> >  ; AVX2-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB2_6
> >  ; AVX2-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB2_8
> >  ; AVX2-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    testb $16, %al
> >  ; AVX2-NEXT:    je .LBB2_10
> >  ; AVX2-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $32, %al
> >  ; AVX2-NEXT:    je .LBB2_12
> >  ; AVX2-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX2-NEXT:    testb $64, %al
> >  ; AVX2-NEXT:    je .LBB2_14
> >  ; AVX2-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    testb $-128, %al
> >  ; AVX2-NEXT:    je .LBB2_16
> >  ; AVX2-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -1410,7 +1434,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpminsq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vpmaxsq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> > -; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB2_1
> > @@ -1443,31 +1467,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB2_4
> >  ; AVX512F-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB2_6
> >  ; AVX512F-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB2_8
> >  ; AVX512F-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB2_10
> >  ; AVX512F-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB2_12
> >  ; AVX512F-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB2_14
> >  ; AVX512F-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB2_16
> >  ; AVX512F-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1744,7 +1768,7 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:    pxor %xmm9, %xmm9
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [32767,32767]
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm5
> > +; SSE2-NEXT:    movdqa %xmm1, %xmm5
> >  ; SSE2-NEXT:    pxor %xmm4, %xmm5
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [2147516415,2147516415]
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm7
> > @@ -1755,50 +1779,54 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:    pand %xmm3, %xmm6
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm7[1,1,3,3]
> >  ; SSE2-NEXT:    por %xmm6, %xmm5
> > -; SSE2-NEXT:    pand %xmm5, %xmm0
> > +; SSE2-NEXT:    pand %xmm5, %xmm1
> >  ; SSE2-NEXT:    pandn %xmm8, %xmm5
> > -; SSE2-NEXT:    por %xmm0, %xmm5
> > -; SSE2-NEXT:    movdqa %xmm1, %xmm0
> > -; SSE2-NEXT:    pxor %xmm4, %xmm0
> > +; SSE2-NEXT:    por %xmm1, %xmm5
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> > +; SSE2-NEXT:    pxor %xmm4, %xmm1
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm3
> > -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm3
> > +; SSE2-NEXT:    pcmpgtd %xmm1, %xmm3
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm1
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm0, %xmm3
> > -; SSE2-NEXT:    pand %xmm3, %xmm1
> > -; SSE2-NEXT:    pandn %xmm8, %xmm3
> >  ; SSE2-NEXT:    por %xmm1, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm0
> > +; SSE2-NEXT:    pandn %xmm8, %xmm3
> > +; SSE2-NEXT:    por %xmm0, %xmm3
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 =
> [18446744073709518848,18446744073709518848]
> > -; SSE2-NEXT:    movdqa %xmm3, %xmm0
> > -; SSE2-NEXT:    pxor %xmm4, %xmm0
> > +; SSE2-NEXT:    movdqa %xmm3, %xmm1
> > +; SSE2-NEXT:    pxor %xmm4, %xmm1
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm6 =
> [18446744071562035200,18446744071562035200]
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm7
> > +; SSE2-NEXT:    movdqa %xmm1, %xmm7
> >  ; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm6, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm1, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm0, %xmm1
> > -; SSE2-NEXT:    pand %xmm1, %xmm3
> > -; SSE2-NEXT:    pandn %xmm8, %xmm1
> > -; SSE2-NEXT:    por %xmm3, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm7[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm6, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm7[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm1, %xmm0
> > +; SSE2-NEXT:    pand %xmm0, %xmm3
> > +; SSE2-NEXT:    pandn %xmm8, %xmm0
> > +; SSE2-NEXT:    por %xmm3, %xmm0
> >  ; SSE2-NEXT:    pxor %xmm5, %xmm4
> > -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,0,2,2]
> > +; SSE2-NEXT:    movdqa %xmm4, %xmm1
> > +; SSE2-NEXT:    pcmpgtd %xmm6, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm1[0,0,2,2]
> >  ; SSE2-NEXT:    pcmpeqd %xmm6, %xmm4
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,1,3,3]
> >  ; SSE2-NEXT:    pand %xmm3, %xmm4
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm4, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm5
> > -; SSE2-NEXT:    pandn %xmm8, %xmm0
> > -; SSE2-NEXT:    por %xmm5, %xmm0
> > -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm4, %xmm1
> > +; SSE2-NEXT:    pand %xmm1, %xmm5
> > +; SSE2-NEXT:    pandn %xmm8, %xmm1
> > +; SSE2-NEXT:    por %xmm5, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm9
> >  ; SSE2-NEXT:    movmskps %xmm9, %eax
> >  ; SSE2-NEXT:    xorl $15, %eax
> > @@ -1821,17 +1849,17 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB4_4
> >  ; SSE2-NEXT:  .LBB4_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 2(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB4_6
> >  ; SSE2-NEXT:  .LBB4_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 4(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> >  ; SSE2-NEXT:    je .LBB4_8
> >  ; SSE2-NEXT:  .LBB4_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 6(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -1841,12 +1869,12 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [32767,32767]
> >  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > +; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> >  ; SSE4-NEXT:    movdqa %xmm5, %xmm6
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm6
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm6
> >  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm5
> > +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm5
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm1 =
> [18446744073709518848,18446744073709518848]
> >  ; SSE4-NEXT:    movapd %xmm5, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > @@ -1855,7 +1883,11 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE4-NEXT:    movapd %xmm6, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm6, %xmm1
> > -; SSE4-NEXT:    packssdw %xmm3, %xmm1
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm4
> >  ; SSE4-NEXT:    movmskps %xmm4, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -1873,19 +1905,19 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE4-NEXT:  .LBB4_8: # %else6
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB4_1: # %cond.store
> > -; SSE4-NEXT:    pextrw $0, %xmm1, (%rdi)
> > +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB4_4
> >  ; SSE4-NEXT:  .LBB4_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $2, %xmm1, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB4_6
> >  ; SSE4-NEXT:  .LBB4_5: # %cond.store3
> > -; SSE4-NEXT:    pextrw $4, %xmm1, 4(%rdi)
> > +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB4_8
> >  ; SSE4-NEXT:  .LBB4_7: # %cond.store5
> > -; SSE4-NEXT:    pextrw $6, %xmm1, 6(%rdi)
> > +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i64_v4i16:
> > @@ -1901,8 +1933,12 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX1-NEXT:    vblendvpd %xmm5, %xmm3, %xmm4, %xmm3
> >  ; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm3, %xmm4
> >  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm6, %xmm3
> > +; AVX1-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> >  ; AVX1-NEXT:    vblendvpd %xmm7, %xmm0, %xmm6, %xmm0
> > -; AVX1-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -1925,15 +1961,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB4_4
> >  ; AVX1-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB4_6
> >  ; AVX1-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB4_8
> >  ; AVX1-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -1947,7 +1983,11 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX2-NEXT:    vpcmpgtq %ymm3, %ymm0, %ymm4
> >  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
> >  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > -; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -1970,15 +2010,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB4_4
> >  ; AVX2-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB4_6
> >  ; AVX2-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB4_8
> >  ; AVX2-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -1991,7 +2031,7 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [18446744073709518848,18446744073709518848,18446744073709518848,18446744073709518848]
> >  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> > -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB4_1
> > @@ -2012,15 +2052,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB4_4
> >  ; AVX512F-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB4_6
> >  ; AVX512F-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB4_8
> >  ; AVX512F-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2029,14 +2069,13 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [32767,32767,32767,32767]
> >  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [18446744073709518848,18446744073709518848,18446744073709518848,18446744073709518848]
> >  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> > -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> > +; AVX512BW-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2065,7 +2104,7 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE2-NEXT:    pxor %xmm9, %xmm9
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [127,127]
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm5
> > +; SSE2-NEXT:    movdqa %xmm1, %xmm5
> >  ; SSE2-NEXT:    pxor %xmm4, %xmm5
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [2147483775,2147483775]
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm7
> > @@ -2076,83 +2115,88 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE2-NEXT:    pand %xmm3, %xmm6
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm7[1,1,3,3]
> >  ; SSE2-NEXT:    por %xmm6, %xmm5
> > -; SSE2-NEXT:    pand %xmm5, %xmm0
> > +; SSE2-NEXT:    pand %xmm5, %xmm1
> >  ; SSE2-NEXT:    pandn %xmm8, %xmm5
> > -; SSE2-NEXT:    por %xmm0, %xmm5
> > -; SSE2-NEXT:    movdqa %xmm1, %xmm0
> > -; SSE2-NEXT:    pxor %xmm4, %xmm0
> > +; SSE2-NEXT:    por %xmm1, %xmm5
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> > +; SSE2-NEXT:    pxor %xmm4, %xmm1
> >  ; SSE2-NEXT:    movdqa %xmm10, %xmm3
> > -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm3
> > +; SSE2-NEXT:    pcmpgtd %xmm1, %xmm3
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm1
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm0, %xmm3
> > -; SSE2-NEXT:    pand %xmm3, %xmm1
> > -; SSE2-NEXT:    pandn %xmm8, %xmm3
> >  ; SSE2-NEXT:    por %xmm1, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm0
> > +; SSE2-NEXT:    pandn %xmm8, %xmm3
> > +; SSE2-NEXT:    por %xmm0, %xmm3
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 =
> [18446744073709551488,18446744073709551488]
> >  ; SSE2-NEXT:    movdqa %xmm3, %xmm0
> >  ; SSE2-NEXT:    pxor %xmm4, %xmm0
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm6 =
> [18446744071562067840,18446744071562067840]
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm10 =
> [18446744071562067840,18446744071562067840]
> >  ; SSE2-NEXT:    movdqa %xmm0, %xmm7
> > -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
> > +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm7
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm6, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm1, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm0, %xmm1
> > -; SSE2-NEXT:    pand %xmm1, %xmm3
> > -; SSE2-NEXT:    pandn %xmm8, %xmm1
> > -; SSE2-NEXT:    por %xmm3, %xmm1
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm0[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm1, %xmm6
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm7[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm6, %xmm0
> > +; SSE2-NEXT:    pand %xmm0, %xmm3
> > +; SSE2-NEXT:    pandn %xmm8, %xmm0
> > +; SSE2-NEXT:    por %xmm3, %xmm0
> >  ; SSE2-NEXT:    pxor %xmm5, %xmm4
> > -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm6, %xmm4
> > +; SSE2-NEXT:    movdqa %xmm4, %xmm1
> > +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm1[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm4
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,1,3,3]
> >  ; SSE2-NEXT:    pand %xmm3, %xmm4
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm4, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm5
> > -; SSE2-NEXT:    pandn %xmm8, %xmm0
> > -; SSE2-NEXT:    por %xmm5, %xmm0
> > -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm4, %xmm1
> > +; SSE2-NEXT:    pand %xmm1, %xmm5
> > +; SSE2-NEXT:    pandn %xmm8, %xmm1
> > +; SSE2-NEXT:    por %xmm5, %xmm1
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> > +; SSE2-NEXT:    pand %xmm3, %xmm1
> > +; SSE2-NEXT:    pand %xmm3, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm9
> > -; SSE2-NEXT:    movmskps %xmm9, %eax
> > -; SSE2-NEXT:    xorl $15, %eax
> > -; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movmskps %xmm9, %ecx
> > +; SSE2-NEXT:    xorl $15, %ecx
> > +; SSE2-NEXT:    testb $1, %cl
> > +; SSE2-NEXT:    movd %xmm0, %eax
> >  ; SSE2-NEXT:    jne .LBB5_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    jne .LBB5_3
> >  ; SSE2-NEXT:  .LBB5_4: # %else2
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    jne .LBB5_5
> >  ; SSE2-NEXT:  .LBB5_6: # %else4
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    jne .LBB5_7
> >  ; SSE2-NEXT:  .LBB5_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB5_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, (%rdi)
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    movb %al, (%rdi)
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    je .LBB5_4
> >  ; SSE2-NEXT:  .LBB5_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    movb %ah, 1(%rdi)
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    je .LBB5_6
> >  ; SSE2-NEXT:  .LBB5_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    movl %eax, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    je .LBB5_8
> >  ; SSE2-NEXT:  .LBB5_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    shrl $24, %eax
> >  ; SSE2-NEXT:    movb %al, 3(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -2162,21 +2206,24 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [127,127]
> >  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > +; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> >  ; SSE4-NEXT:    movdqa %xmm5, %xmm6
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm6
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm6
> >  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm5
> > -; SSE4-NEXT:    movdqa {{.*#+}} xmm1 =
> [18446744073709551488,18446744073709551488]
> > +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm5
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm3 =
> [18446744073709551488,18446744073709551488]
> >  ; SSE4-NEXT:    movapd %xmm5, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm5, %xmm3
> > +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > +; SSE4-NEXT:    movdqa %xmm3, %xmm1
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm5, %xmm1
> >  ; SSE4-NEXT:    movapd %xmm6, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm6, %xmm1
> > -; SSE4-NEXT:    packssdw %xmm3, %xmm1
> > +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm6, %xmm3
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; SSE4-NEXT:    pshufb %xmm0, %xmm3
> > +; SSE4-NEXT:    pshufb %xmm0, %xmm1
> > +; SSE4-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3]
> >  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm4
> >  ; SSE4-NEXT:    movmskps %xmm4, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -2198,15 +2245,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB5_4
> >  ; SSE4-NEXT:  .LBB5_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $4, %xmm1, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm1, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB5_6
> >  ; SSE4-NEXT:  .LBB5_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $8, %xmm1, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm1, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB5_8
> >  ; SSE4-NEXT:  .LBB5_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $12, %xmm1, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm1, 3(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i64_v4i8:
> > @@ -2222,8 +2269,11 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX1-NEXT:    vblendvpd %xmm5, %xmm3, %xmm4, %xmm3
> >  ; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm3, %xmm4
> >  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm6, %xmm3
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vblendvpd %xmm7, %xmm0, %xmm6, %xmm0
> > -; AVX1-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -2246,15 +2296,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB5_4
> >  ; AVX1-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB5_6
> >  ; AVX1-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB5_8
> >  ; AVX1-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -2268,7 +2318,10 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX2-NEXT:    vpcmpgtq %ymm3, %ymm0, %ymm4
> >  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
> >  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > -; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -2291,15 +2344,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB5_4
> >  ; AVX2-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB5_6
> >  ; AVX2-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB5_8
> >  ; AVX2-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -2312,7 +2365,7 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
> >  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> > -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB5_1
> > @@ -2333,15 +2386,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB5_4
> >  ; AVX512F-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB5_6
> >  ; AVX512F-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB5_8
> >  ; AVX512F-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2350,14 +2403,13 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [127,127,127,127]
> >  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
> >  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> > -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> > -; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> > +; AVX512BW-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2405,13 +2457,14 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm4[0,0,2,2]
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm3, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm5
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    por %xmm5, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm4[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm0, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm5
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm3
> > +; SSE2-NEXT:    por %xmm5, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> >  ; SSE2-NEXT:    pand %xmm2, %xmm1
> > @@ -2429,7 +2482,7 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB6_4
> >  ; SSE2-NEXT:  .LBB6_3: # %cond.store1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> >  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -2445,6 +2498,7 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE4-NEXT:    movapd %xmm4, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> >  ; SSE4-NEXT:    movmskpd %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -2456,11 +2510,11 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE4-NEXT:  .LBB6_4: # %else2
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB6_1: # %cond.store
> > -; SSE4-NEXT:    movss %xmm2, (%rdi)
> > +; SSE4-NEXT:    movd %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB6_4
> >  ; SSE4-NEXT:  .LBB6_3: # %cond.store1
> > -; SSE4-NEXT:    extractps $2, %xmm2, 4(%rdi)
> > +; SSE4-NEXT:    pextrd $1, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v2i64_v2i32:
> > @@ -2469,6 +2523,7 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > +; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [2147483647,2147483647]
> >  ; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm2, %xmm3
> >  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> > @@ -2476,7 +2531,6 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX1-NEXT:    vpcmpgtq %xmm2, %xmm0, %xmm3
> >  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> >  ; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX1-NEXT:    vmaskmovps %xmm0, %xmm1, (%rdi)
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -2486,6 +2540,7 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > +; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX2-NEXT:    vmovdqa {{.*#+}} xmm2 = [2147483647,2147483647]
> >  ; AVX2-NEXT:    vpcmpgtq %xmm0, %xmm2, %xmm3
> >  ; AVX2-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> > @@ -2493,7 +2548,6 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX2-NEXT:    vpcmpgtq %xmm2, %xmm0, %xmm3
> >  ; AVX2-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> >  ; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX2-NEXT:    vpmaskmovd %xmm0, %xmm1, (%rdi)
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -2502,13 +2556,13 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [2147483647,2147483647]
> >  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 =
> [18446744071562067968,18446744071562067968]
> >  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> > -; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> > @@ -2526,13 +2580,13 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [2147483647,2147483647]
> >  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 =
> [18446744071562067968,18446744071562067968]
> >  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2571,13 +2625,15 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm4[0,0,2,2]
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm3, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm5
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    por %xmm5, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm4[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm0, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm5
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm3
> > +; SSE2-NEXT:    por %xmm5, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> >  ; SSE2-NEXT:    pand %xmm2, %xmm1
> > @@ -2596,7 +2652,7 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB7_4
> >  ; SSE2-NEXT:  .LBB7_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 2(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -2612,6 +2668,8 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE4-NEXT:    movapd %xmm4, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> >  ; SSE4-NEXT:    movmskpd %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -2623,11 +2681,11 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE4-NEXT:  .LBB7_4: # %else2
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB7_1: # %cond.store
> > -; SSE4-NEXT:    pextrw $0, %xmm2, (%rdi)
> > +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB7_4
> >  ; SSE4-NEXT:  .LBB7_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $4, %xmm2, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v2i64_v2i16:
> > @@ -2639,6 +2697,8 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm3 =
> [18446744073709518848,18446744073709518848]
> >  ; AVX-NEXT:    vpcmpgtq %xmm3, %xmm0, %xmm4
> >  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> > +; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
> >  ; AVX-NEXT:    xorl $3, %eax
> > @@ -2654,7 +2714,7 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB7_4
> >  ; AVX-NEXT:  .LBB7_3: # %cond.store1
> > -; AVX-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v2i64_v2i16:
> > @@ -2666,6 +2726,8 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 =
> [18446744073709518848,18446744073709518848]
> >  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> > +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX512F-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB7_1
> > @@ -2680,7 +2742,7 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB7_4
> >  ; AVX512F-NEXT:  .LBB7_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2689,14 +2751,14 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [32767,32767]
> >  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 =
> [18446744073709518848,18446744073709518848]
> >  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > -; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2743,19 +2805,24 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm4[0,0,2,2]
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm3, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm5
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    por %xmm5, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm4[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm0, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm5
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm3
> > +; SSE2-NEXT:    por %xmm5, %xmm3
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm3
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm2, %xmm1
> > -; SSE2-NEXT:    movmskpd %xmm1, %eax
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> > +; SSE2-NEXT:    pand %xmm2, %xmm0
> > +; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    xorl $3, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movd %xmm3, %ecx
> >  ; SSE2-NEXT:    jne .LBB8_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> >  ; SSE2-NEXT:    testb $2, %al
> > @@ -2763,13 +2830,11 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE2-NEXT:  .LBB8_4: # %else2
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB8_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB8_4
> >  ; SSE2-NEXT:  .LBB8_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i8:
> > @@ -2784,6 +2849,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE4-NEXT:    movapd %xmm4, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm2 =
> xmm2[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> >  ; SSE4-NEXT:    movmskpd %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -2799,7 +2865,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB8_4
> >  ; SSE4-NEXT:  .LBB8_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $8, %xmm2, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm2, 1(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v2i64_v2i8:
> > @@ -2811,6 +2877,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm3 =
> [18446744073709551488,18446744073709551488]
> >  ; AVX-NEXT:    vpcmpgtq %xmm3, %xmm0, %xmm4
> >  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
> >  ; AVX-NEXT:    xorl $3, %eax
> > @@ -2826,7 +2893,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB8_4
> >  ; AVX-NEXT:  .LBB8_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v2i64_v2i8:
> > @@ -2838,6 +2905,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 =
> [18446744073709551488,18446744073709551488]
> >  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB8_1
> > @@ -2852,7 +2920,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB8_4
> >  ; AVX512F-NEXT:  .LBB8_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2861,13 +2929,13 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [127,127]
> >  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 =
> [18446744073709551488,18446744073709551488]
> >  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> > -; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -4642,29 +4710,8 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-LABEL: truncstore_v8i32_v8i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm4, %xmm4
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [127,127,127,127]
> > -; SSE2-NEXT:    movdqa %xmm5, %xmm6
> > -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm6
> > -; SSE2-NEXT:    pand %xmm6, %xmm0
> > -; SSE2-NEXT:    pandn %xmm5, %xmm6
> > -; SSE2-NEXT:    por %xmm0, %xmm6
> > -; SSE2-NEXT:    movdqa %xmm5, %xmm0
> > -; SSE2-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm1
> > -; SSE2-NEXT:    pandn %xmm5, %xmm0
> > -; SSE2-NEXT:    por %xmm1, %xmm0
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm1 =
> [4294967168,4294967168,4294967168,4294967168]
> > -; SSE2-NEXT:    movdqa %xmm0, %xmm5
> > -; SSE2-NEXT:    pcmpgtd %xmm1, %xmm5
> > -; SSE2-NEXT:    pand %xmm5, %xmm0
> > -; SSE2-NEXT:    pandn %xmm1, %xmm5
> > -; SSE2-NEXT:    por %xmm0, %xmm5
> > -; SSE2-NEXT:    movdqa %xmm6, %xmm0
> > -; SSE2-NEXT:    pcmpgtd %xmm1, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm6
> > -; SSE2-NEXT:    pandn %xmm1, %xmm0
> > -; SSE2-NEXT:    por %xmm6, %xmm0
> > -; SSE2-NEXT:    packssdw %xmm5, %xmm0
> > +; SSE2-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE2-NEXT:    packsswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm1, %xmm3
> > @@ -4684,17 +4731,26 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB12_5
> >  ; SSE2-NEXT:  .LBB12_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB12_7
> > +; SSE2-NEXT:    je .LBB12_8
> > +; SSE2-NEXT:  .LBB12_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB12_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB12_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB12_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB12_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB12_11
> > +; SSE2-NEXT:    je .LBB12_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB12_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB12_13
> > -; SSE2-NEXT:  .LBB12_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB12_15
> >  ; SSE2-NEXT:  .LBB12_16: # %else14
> > @@ -4704,50 +4760,29 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB12_4
> >  ; SSE2-NEXT:  .LBB12_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB12_6
> >  ; SSE2-NEXT:  .LBB12_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB12_8
> > -; SSE2-NEXT:  .LBB12_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB12_10
> > -; SSE2-NEXT:  .LBB12_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB12_12
> > -; SSE2-NEXT:  .LBB12_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB12_14
> > +; SSE2-NEXT:    jne .LBB12_7
> > +; SSE2-NEXT:    jmp .LBB12_8
> >  ; SSE2-NEXT:  .LBB12_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB12_16
> >  ; SSE2-NEXT:  .LBB12_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i32_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> > -; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [127,127,127,127]
> > -; SSE4-NEXT:    pminsd %xmm5, %xmm0
> > -; SSE4-NEXT:    pminsd %xmm5, %xmm1
> > -; SSE4-NEXT:    movdqa {{.*#+}} xmm5 =
> [4294967168,4294967168,4294967168,4294967168]
> > -; SSE4-NEXT:    pmaxsd %xmm5, %xmm1
> > -; SSE4-NEXT:    pmaxsd %xmm5, %xmm0
> >  ; SSE4-NEXT:    packssdw %xmm1, %xmm0
> > +; SSE4-NEXT:    packsswb %xmm0, %xmm0
> >  ; SSE4-NEXT:    pcmpeqd %xmm4, %xmm3
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm1, %xmm3
> > @@ -4786,43 +4821,38 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB12_4
> >  ; SSE4-NEXT:  .LBB12_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB12_6
> >  ; SSE4-NEXT:  .LBB12_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB12_8
> >  ; SSE4-NEXT:  .LBB12_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB12_10
> >  ; SSE4-NEXT:  .LBB12_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB12_12
> >  ; SSE4-NEXT:  .LBB12_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB12_14
> >  ; SSE4-NEXT:  .LBB12_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB12_16
> >  ; SSE4-NEXT:  .LBB12_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v8i32_v8i8:
> >  ; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [127,127,127,127]
> > -; AVX1-NEXT:    vpminsd %xmm2, %xmm0, %xmm3
> > -; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
> > -; AVX1-NEXT:    vpminsd %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 =
> [4294967168,4294967168,4294967168,4294967168]
> > -; AVX1-NEXT:    vpmaxsd %xmm2, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpmaxsd %xmm2, %xmm3, %xmm2
> > -; AVX1-NEXT:    vpackssdw %xmm0, %xmm2, %xmm0
> > +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
> > +; AVX1-NEXT:    vpackssdw %xmm2, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
> >  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm2, %xmm2
> > @@ -4861,43 +4891,40 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB12_4
> >  ; AVX1-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB12_6
> >  ; AVX1-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB12_8
> >  ; AVX1-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    testb $16, %al
> >  ; AVX1-NEXT:    je .LBB12_10
> >  ; AVX1-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $32, %al
> >  ; AVX1-NEXT:    je .LBB12_12
> >  ; AVX1-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX1-NEXT:    testb $64, %al
> >  ; AVX1-NEXT:    je .LBB12_14
> >  ; AVX1-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    testb $-128, %al
> >  ; AVX1-NEXT:    je .LBB12_16
> >  ; AVX1-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: truncstore_v8i32_v8i8:
> >  ; AVX2:       # %bb.0:
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm3 =
> [127,127,127,127,127,127,127,127]
> > -; AVX2-NEXT:    vpminsd %ymm3, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm3 =
> [4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168]
> > -; AVX2-NEXT:    vpmaxsd %ymm3, %ymm0, %ymm0
> >  ; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> >  ; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> >  ; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
> >  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
> >  ; AVX2-NEXT:    notl %eax
> > @@ -4932,31 +4959,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB12_4
> >  ; AVX2-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB12_6
> >  ; AVX2-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB12_8
> >  ; AVX2-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    testb $16, %al
> >  ; AVX2-NEXT:    je .LBB12_10
> >  ; AVX2-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $32, %al
> >  ; AVX2-NEXT:    je .LBB12_12
> >  ; AVX2-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX2-NEXT:    testb $64, %al
> >  ; AVX2-NEXT:    je .LBB12_14
> >  ; AVX2-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    testb $-128, %al
> >  ; AVX2-NEXT:    je .LBB12_16
> >  ; AVX2-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -4968,7 +4995,7 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512F-NEXT:    vpminsd %ymm1, %ymm0, %ymm0
> >  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168]
> >  ; AVX512F-NEXT:    vpmaxsd %ymm1, %ymm0, %ymm0
> > -; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB12_1
> > @@ -5001,31 +5028,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB12_4
> >  ; AVX512F-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB12_6
> >  ; AVX512F-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB12_8
> >  ; AVX512F-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB12_10
> >  ; AVX512F-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB12_12
> >  ; AVX512F-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB12_14
> >  ; AVX512F-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB12_16
> >  ; AVX512F-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -5033,14 +5060,13 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [127,127,127,127,127,127,127,127]
> >  ; AVX512BW-NEXT:    vpminsd %ymm1, %ymm0, %ymm0
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168]
> >  ; AVX512BW-NEXT:    vpmaxsd %ymm1, %ymm0, %ymm0
> > -; AVX512BW-NEXT:    vpmovdw %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> > +; AVX512BW-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -5067,18 +5093,7 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE2-LABEL: truncstore_v4i32_v4i16:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [32767,32767,32767,32767]
> > -; SSE2-NEXT:    movdqa %xmm3, %xmm4
> > -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
> > -; SSE2-NEXT:    pand %xmm4, %xmm0
> > -; SSE2-NEXT:    pandn %xmm3, %xmm4
> > -; SSE2-NEXT:    por %xmm0, %xmm4
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [4294934528,4294934528,4294934528,4294934528]
> > -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE2-NEXT:    pcmpgtd %xmm3, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm4
> > -; SSE2-NEXT:    pandn %xmm3, %xmm0
> > -; SSE2-NEXT:    por %xmm4, %xmm0
> > +; SSE2-NEXT:    packssdw %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE2-NEXT:    movmskps %xmm2, %eax
> >  ; SSE2-NEXT:    xorl $15, %eax
> > @@ -5101,25 +5116,24 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB13_4
> >  ; SSE2-NEXT:  .LBB13_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 2(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB13_6
> >  ; SSE2-NEXT:  .LBB13_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 4(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> >  ; SSE2-NEXT:    je .LBB13_8
> >  ; SSE2-NEXT:  .LBB13_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 6(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i32_v4i16:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > -; SSE4-NEXT:    pminsd {{.*}}(%rip), %xmm0
> > -; SSE4-NEXT:    pmaxsd {{.*}}(%rip), %xmm0
> > +; SSE4-NEXT:    packssdw %xmm0, %xmm0
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskps %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -5141,92 +5155,52 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB13_4
> >  ; SSE4-NEXT:  .LBB13_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB13_6
> >  ; SSE4-NEXT:  .LBB13_5: # %cond.store3
> > -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB13_8
> >  ; SSE4-NEXT:  .LBB13_7: # %cond.store5
> > -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> > -; AVX1-LABEL: truncstore_v4i32_v4i16:
> > -; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpminsd {{.*}}(%rip), %xmm0, %xmm0
> > -; AVX1-NEXT:    vpmaxsd {{.*}}(%rip), %xmm0, %xmm0
> > -; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > -; AVX1-NEXT:    vmovmskps %xmm1, %eax
> > -; AVX1-NEXT:    xorl $15, %eax
> > -; AVX1-NEXT:    testb $1, %al
> > -; AVX1-NEXT:    jne .LBB13_1
> > -; AVX1-NEXT:  # %bb.2: # %else
> > -; AVX1-NEXT:    testb $2, %al
> > -; AVX1-NEXT:    jne .LBB13_3
> > -; AVX1-NEXT:  .LBB13_4: # %else2
> > -; AVX1-NEXT:    testb $4, %al
> > -; AVX1-NEXT:    jne .LBB13_5
> > -; AVX1-NEXT:  .LBB13_6: # %else4
> > -; AVX1-NEXT:    testb $8, %al
> > -; AVX1-NEXT:    jne .LBB13_7
> > -; AVX1-NEXT:  .LBB13_8: # %else6
> > -; AVX1-NEXT:    retq
> > -; AVX1-NEXT:  .LBB13_1: # %cond.store
> > -; AVX1-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > -; AVX1-NEXT:    testb $2, %al
> > -; AVX1-NEXT:    je .LBB13_4
> > -; AVX1-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > -; AVX1-NEXT:    testb $4, %al
> > -; AVX1-NEXT:    je .LBB13_6
> > -; AVX1-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > -; AVX1-NEXT:    testb $8, %al
> > -; AVX1-NEXT:    je .LBB13_8
> > -; AVX1-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > -; AVX1-NEXT:    retq
> > -;
> > -; AVX2-LABEL: truncstore_v4i32_v4i16:
> > -; AVX2:       # %bb.0:
> > -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [32767,32767,32767,32767]
> > -; AVX2-NEXT:    vpminsd %xmm3, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 =
> [4294934528,4294934528,4294934528,4294934528]
> > -; AVX2-NEXT:    vpmaxsd %xmm3, %xmm0, %xmm0
> > -; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > -; AVX2-NEXT:    vmovmskps %xmm1, %eax
> > -; AVX2-NEXT:    xorl $15, %eax
> > -; AVX2-NEXT:    testb $1, %al
> > -; AVX2-NEXT:    jne .LBB13_1
> > -; AVX2-NEXT:  # %bb.2: # %else
> > -; AVX2-NEXT:    testb $2, %al
> > -; AVX2-NEXT:    jne .LBB13_3
> > -; AVX2-NEXT:  .LBB13_4: # %else2
> > -; AVX2-NEXT:    testb $4, %al
> > -; AVX2-NEXT:    jne .LBB13_5
> > -; AVX2-NEXT:  .LBB13_6: # %else4
> > -; AVX2-NEXT:    testb $8, %al
> > -; AVX2-NEXT:    jne .LBB13_7
> > -; AVX2-NEXT:  .LBB13_8: # %else6
> > -; AVX2-NEXT:    retq
> > -; AVX2-NEXT:  .LBB13_1: # %cond.store
> > -; AVX2-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > -; AVX2-NEXT:    testb $2, %al
> > -; AVX2-NEXT:    je .LBB13_4
> > -; AVX2-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > -; AVX2-NEXT:    testb $4, %al
> > -; AVX2-NEXT:    je .LBB13_6
> > -; AVX2-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > -; AVX2-NEXT:    testb $8, %al
> > -; AVX2-NEXT:    je .LBB13_8
> > -; AVX2-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > -; AVX2-NEXT:    retq
> > +; AVX-LABEL: truncstore_v4i32_v4i16:
> > +; AVX:       # %bb.0:
> > +; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > +; AVX-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> > +; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> > +; AVX-NEXT:    vmovmskps %xmm1, %eax
> > +; AVX-NEXT:    xorl $15, %eax
> > +; AVX-NEXT:    testb $1, %al
> > +; AVX-NEXT:    jne .LBB13_1
> > +; AVX-NEXT:  # %bb.2: # %else
> > +; AVX-NEXT:    testb $2, %al
> > +; AVX-NEXT:    jne .LBB13_3
> > +; AVX-NEXT:  .LBB13_4: # %else2
> > +; AVX-NEXT:    testb $4, %al
> > +; AVX-NEXT:    jne .LBB13_5
> > +; AVX-NEXT:  .LBB13_6: # %else4
> > +; AVX-NEXT:    testb $8, %al
> > +; AVX-NEXT:    jne .LBB13_7
> > +; AVX-NEXT:  .LBB13_8: # %else6
> > +; AVX-NEXT:    retq
> > +; AVX-NEXT:  .LBB13_1: # %cond.store
> > +; AVX-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > +; AVX-NEXT:    testb $2, %al
> > +; AVX-NEXT:    je .LBB13_4
> > +; AVX-NEXT:  .LBB13_3: # %cond.store1
> > +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    testb $4, %al
> > +; AVX-NEXT:    je .LBB13_6
> > +; AVX-NEXT:  .LBB13_5: # %cond.store3
> > +; AVX-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> > +; AVX-NEXT:    testb $8, %al
> > +; AVX-NEXT:    je .LBB13_8
> > +; AVX-NEXT:  .LBB13_7: # %cond.store5
> > +; AVX-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> > +; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i32_v4i16:
> >  ; AVX512F:       # %bb.0:
> > @@ -5236,6 +5210,7 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512F-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
> >  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [4294934528,4294934528,4294934528,4294934528]
> >  ; AVX512F-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB13_1
> > @@ -5256,15 +5231,15 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB13_4
> >  ; AVX512F-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB13_6
> >  ; AVX512F-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB13_8
> >  ; AVX512F-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -5272,13 +5247,13 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [32767,32767,32767,32767]
> >  ; AVX512BW-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [4294934528,4294934528,4294934528,4294934528]
> >  ; AVX512BW-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -5310,45 +5285,48 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE2-NEXT:    pand %xmm4, %xmm0
> >  ; SSE2-NEXT:    pandn %xmm3, %xmm4
> >  ; SSE2-NEXT:    por %xmm0, %xmm4
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [4294967168,4294967168,4294967168,4294967168]
> > -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE2-NEXT:    pcmpgtd %xmm3, %xmm0
> > -; SSE2-NEXT:    pand %xmm0, %xmm4
> > -; SSE2-NEXT:    pandn %xmm3, %xmm0
> > -; SSE2-NEXT:    por %xmm4, %xmm0
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm0 =
> [4294967168,4294967168,4294967168,4294967168]
> > +; SSE2-NEXT:    movdqa %xmm4, %xmm3
> > +; SSE2-NEXT:    pcmpgtd %xmm0, %xmm3
> > +; SSE2-NEXT:    pand %xmm3, %xmm4
> > +; SSE2-NEXT:    pandn %xmm0, %xmm3
> > +; SSE2-NEXT:    por %xmm4, %xmm3
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm3
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> > +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > -; SSE2-NEXT:    movmskps %xmm2, %eax
> > -; SSE2-NEXT:    xorl $15, %eax
> > -; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movmskps %xmm2, %ecx
> > +; SSE2-NEXT:    xorl $15, %ecx
> > +; SSE2-NEXT:    testb $1, %cl
> > +; SSE2-NEXT:    movd %xmm3, %eax
> >  ; SSE2-NEXT:    jne .LBB14_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    jne .LBB14_3
> >  ; SSE2-NEXT:  .LBB14_4: # %else2
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    jne .LBB14_5
> >  ; SSE2-NEXT:  .LBB14_6: # %else4
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    jne .LBB14_7
> >  ; SSE2-NEXT:  .LBB14_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB14_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, (%rdi)
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    movb %al, (%rdi)
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    je .LBB14_4
> >  ; SSE2-NEXT:  .LBB14_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    movb %ah, 1(%rdi)
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    je .LBB14_6
> >  ; SSE2-NEXT:  .LBB14_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    movl %eax, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    je .LBB14_8
> >  ; SSE2-NEXT:  .LBB14_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> > +; SSE2-NEXT:    shrl $24, %eax
> >  ; SSE2-NEXT:    movb %al, 3(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -5357,6 +5335,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE4-NEXT:    pminsd {{.*}}(%rip), %xmm0
> >  ; SSE4-NEXT:    pmaxsd {{.*}}(%rip), %xmm0
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskps %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -5378,15 +5357,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB14_4
> >  ; SSE4-NEXT:  .LBB14_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB14_6
> >  ; SSE4-NEXT:  .LBB14_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB14_8
> >  ; SSE4-NEXT:  .LBB14_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i32_v4i8:
> > @@ -5394,6 +5373,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> >  ; AVX1-NEXT:    vpminsd {{.*}}(%rip), %xmm0, %xmm0
> >  ; AVX1-NEXT:    vpmaxsd {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX1-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -5415,15 +5395,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB14_4
> >  ; AVX1-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB14_6
> >  ; AVX1-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB14_8
> >  ; AVX1-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: truncstore_v4i32_v4i8:
> > @@ -5433,6 +5413,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX2-NEXT:    vpminsd %xmm3, %xmm0, %xmm0
> >  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 =
> [4294967168,4294967168,4294967168,4294967168]
> >  ; AVX2-NEXT:    vpmaxsd %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -5454,15 +5435,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB14_4
> >  ; AVX2-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB14_6
> >  ; AVX2-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB14_8
> >  ; AVX2-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i32_v4i8:
> > @@ -5473,6 +5454,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512F-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
> >  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [4294967168,4294967168,4294967168,4294967168]
> >  ; AVX512F-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB14_1
> > @@ -5493,15 +5475,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB14_4
> >  ; AVX512F-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB14_6
> >  ; AVX512F-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB14_8
> >  ; AVX512F-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -5509,13 +5491,13 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [127,127,127,127]
> >  ; AVX512BW-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [4294967168,4294967168,4294967168,4294967168]
> >  ; AVX512BW-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> > -; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -7373,8 +7355,7 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-LABEL: truncstore_v8i16_v8i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > -; SSE2-NEXT:    pminsw {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    pmaxsw {{.*}}(%rip), %xmm0
> > +; SSE2-NEXT:    packsswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqw %xmm1, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm1
> > @@ -7391,17 +7372,26 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB17_5
> >  ; SSE2-NEXT:  .LBB17_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB17_7
> > +; SSE2-NEXT:    je .LBB17_8
> > +; SSE2-NEXT:  .LBB17_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB17_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB17_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB17_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB17_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB17_11
> > +; SSE2-NEXT:    je .LBB17_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB17_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB17_13
> > -; SSE2-NEXT:  .LBB17_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB17_15
> >  ; SSE2-NEXT:  .LBB17_16: # %else14
> > @@ -7411,45 +7401,28 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB17_4
> >  ; SSE2-NEXT:  .LBB17_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB17_6
> >  ; SSE2-NEXT:  .LBB17_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB17_8
> > -; SSE2-NEXT:  .LBB17_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB17_10
> > -; SSE2-NEXT:  .LBB17_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB17_12
> > -; SSE2-NEXT:  .LBB17_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB17_14
> > +; SSE2-NEXT:    jne .LBB17_7
> > +; SSE2-NEXT:    jmp .LBB17_8
> >  ; SSE2-NEXT:  .LBB17_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB17_16
> >  ; SSE2-NEXT:  .LBB17_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i16_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> > -; SSE4-NEXT:    pminsw {{.*}}(%rip), %xmm0
> > -; SSE4-NEXT:    pmaxsw {{.*}}(%rip), %xmm0
> > +; SSE4-NEXT:    packsswb %xmm0, %xmm0
> >  ; SSE4-NEXT:    pcmpeqw %xmm1, %xmm2
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm1
> > @@ -7485,38 +7458,37 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB17_4
> >  ; SSE4-NEXT:  .LBB17_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB17_6
> >  ; SSE4-NEXT:  .LBB17_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB17_8
> >  ; SSE4-NEXT:  .LBB17_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB17_10
> >  ; SSE4-NEXT:  .LBB17_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB17_12
> >  ; SSE4-NEXT:  .LBB17_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB17_14
> >  ; SSE4-NEXT:  .LBB17_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB17_16
> >  ; SSE4-NEXT:  .LBB17_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v8i16_v8i8:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> > -; AVX-NEXT:    vpminsw {{.*}}(%rip), %xmm0, %xmm0
> > -; AVX-NEXT:    vpmaxsw {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> >  ; AVX-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > @@ -7552,31 +7524,31 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB17_4
> >  ; AVX-NEXT:  .LBB17_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    testb $4, %al
> >  ; AVX-NEXT:    je .LBB17_6
> >  ; AVX-NEXT:  .LBB17_5: # %cond.store3
> > -; AVX-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    testb $8, %al
> >  ; AVX-NEXT:    je .LBB17_8
> >  ; AVX-NEXT:  .LBB17_7: # %cond.store5
> > -; AVX-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX-NEXT:    testb $16, %al
> >  ; AVX-NEXT:    je .LBB17_10
> >  ; AVX-NEXT:  .LBB17_9: # %cond.store7
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX-NEXT:    testb $32, %al
> >  ; AVX-NEXT:    je .LBB17_12
> >  ; AVX-NEXT:  .LBB17_11: # %cond.store9
> > -; AVX-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX-NEXT:    testb $64, %al
> >  ; AVX-NEXT:    je .LBB17_14
> >  ; AVX-NEXT:  .LBB17_13: # %cond.store11
> > -; AVX-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX-NEXT:    testb $-128, %al
> >  ; AVX-NEXT:    je .LBB17_16
> >  ; AVX-NEXT:  .LBB17_15: # %cond.store13
> > -; AVX-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v8i16_v8i8:
> > @@ -7588,6 +7560,7 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpminsw {{.*}}(%rip), %xmm0, %xmm0
> >  ; AVX512F-NEXT:    vpmaxsw {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB17_1
> > @@ -7620,31 +7593,31 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB17_4
> >  ; AVX512F-NEXT:  .LBB17_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB17_6
> >  ; AVX512F-NEXT:  .LBB17_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB17_8
> >  ; AVX512F-NEXT:  .LBB17_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB17_10
> >  ; AVX512F-NEXT:  .LBB17_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB17_12
> >  ; AVX512F-NEXT:  .LBB17_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB17_14
> >  ; AVX512F-NEXT:  .LBB17_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB17_16
> >  ; AVX512F-NEXT:  .LBB17_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -7652,11 +7625,11 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmw %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> >  ; AVX512BW-NEXT:    vpminsw {{.*}}(%rip), %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpmaxsw {{.*}}(%rip), %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll Wed Aug  7
> 09:24:26 2019
> > @@ -872,6 +872,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    por %xmm2, %xmm0
> >  ; SSE2-NEXT:    packuswb %xmm1, %xmm0
> >  ; SSE2-NEXT:    packuswb %xmm0, %xmm7
> > +; SSE2-NEXT:    packuswb %xmm7, %xmm7
> >  ; SSE2-NEXT:    pcmpeqd %xmm8, %xmm5
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm0
> >  ; SSE2-NEXT:    pxor %xmm0, %xmm5
> > @@ -891,17 +892,26 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB2_5
> >  ; SSE2-NEXT:  .LBB2_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB2_7
> > +; SSE2-NEXT:    je .LBB2_8
> > +; SSE2-NEXT:  .LBB2_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB2_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB2_9
> > +; SSE2-NEXT:    pextrw $2, %xmm7, %ecx
> > +; SSE2-NEXT:    je .LBB2_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB2_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB2_11
> > +; SSE2-NEXT:    je .LBB2_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB2_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm7, %ecx
> >  ; SSE2-NEXT:    jne .LBB2_13
> > -; SSE2-NEXT:  .LBB2_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB2_15
> >  ; SSE2-NEXT:  .LBB2_16: # %else14
> > @@ -911,38 +921,22 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB2_4
> >  ; SSE2-NEXT:  .LBB2_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB2_6
> >  ; SSE2-NEXT:  .LBB2_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm7, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB2_8
> > -; SSE2-NEXT:  .LBB2_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm7, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB2_10
> > -; SSE2-NEXT:  .LBB2_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm7, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB2_12
> > -; SSE2-NEXT:  .LBB2_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm7, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB2_14
> > +; SSE2-NEXT:    jne .LBB2_7
> > +; SSE2-NEXT:    jmp .LBB2_8
> >  ; SSE2-NEXT:  .LBB2_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm7, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB2_16
> >  ; SSE2-NEXT:  .LBB2_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm7, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i64_v8i8:
> > @@ -977,6 +971,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm6
> >  ; SSE4-NEXT:    packusdw %xmm7, %xmm6
> >  ; SSE4-NEXT:    packusdw %xmm6, %xmm1
> > +; SSE4-NEXT:    packuswb %xmm1, %xmm1
> >  ; SSE4-NEXT:    pcmpeqd %xmm8, %xmm5
> >  ; SSE4-NEXT:    pcmpeqd %xmm0, %xmm0
> >  ; SSE4-NEXT:    pxor %xmm0, %xmm5
> > @@ -1015,31 +1010,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB2_4
> >  ; SSE4-NEXT:  .LBB2_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm1, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm1, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB2_6
> >  ; SSE4-NEXT:  .LBB2_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm1, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm1, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB2_8
> >  ; SSE4-NEXT:  .LBB2_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm1, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm1, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB2_10
> >  ; SSE4-NEXT:  .LBB2_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm1, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm1, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB2_12
> >  ; SSE4-NEXT:  .LBB2_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm1, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm1, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB2_14
> >  ; SSE4-NEXT:  .LBB2_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm1, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm1, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB2_16
> >  ; SSE4-NEXT:  .LBB2_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm1, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm1, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v8i64_v8i8:
> > @@ -1064,6 +1059,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX1-NEXT:    vblendvpd %xmm8, %xmm0, %xmm5, %xmm0
> >  ; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm1
> >  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
> > @@ -1102,31 +1098,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB2_4
> >  ; AVX1-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB2_6
> >  ; AVX1-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB2_8
> >  ; AVX1-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    testb $16, %al
> >  ; AVX1-NEXT:    je .LBB2_10
> >  ; AVX1-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $32, %al
> >  ; AVX1-NEXT:    je .LBB2_12
> >  ; AVX1-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX1-NEXT:    testb $64, %al
> >  ; AVX1-NEXT:    je .LBB2_14
> >  ; AVX1-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    testb $-128, %al
> >  ; AVX1-NEXT:    je .LBB2_16
> >  ; AVX1-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -1135,17 +1131,24 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm4 = [255,255,255,255]
> >  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm5 =
> [9223372036854775808,9223372036854775808,9223372036854775808,9223372036854775808]
> > -; AVX2-NEXT:    vpxor %ymm5, %ymm1, %ymm6
> > +; AVX2-NEXT:    vpxor %ymm5, %ymm0, %ymm6
> >  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm7 =
> [9223372036854776063,9223372036854776063,9223372036854776063,9223372036854776063]
> >  ; AVX2-NEXT:    vpcmpgtq %ymm6, %ymm7, %ymm6
> > -; AVX2-NEXT:    vblendvpd %ymm6, %ymm1, %ymm4, %ymm1
> > -; AVX2-NEXT:    vpxor %ymm5, %ymm0, %ymm5
> > +; AVX2-NEXT:    vblendvpd %ymm6, %ymm0, %ymm4, %ymm0
> > +; AVX2-NEXT:    vpxor %ymm5, %ymm1, %ymm5
> >  ; AVX2-NEXT:    vpcmpgtq %ymm5, %ymm7, %ymm5
> > -; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
> > -; AVX2-NEXT:    vpackusdw %ymm1, %ymm0, %ymm0
> > -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
> > -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm1
> > -; AVX2-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> > +; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> > +; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm4
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 =
> <u,u,0,8,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm1, %xmm1
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
> > +; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm4
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> > +; AVX2-NEXT:    vpshufb %xmm5, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
> > +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
> >  ; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm2, %ymm1
> >  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
> >  ; AVX2-NEXT:    notl %eax
> > @@ -1180,31 +1183,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB2_4
> >  ; AVX2-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB2_6
> >  ; AVX2-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB2_8
> >  ; AVX2-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    testb $16, %al
> >  ; AVX2-NEXT:    je .LBB2_10
> >  ; AVX2-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $32, %al
> >  ; AVX2-NEXT:    je .LBB2_12
> >  ; AVX2-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX2-NEXT:    testb $64, %al
> >  ; AVX2-NEXT:    je .LBB2_14
> >  ; AVX2-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    testb $-128, %al
> >  ; AVX2-NEXT:    je .LBB2_16
> >  ; AVX2-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -1213,7 +1216,7 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX512F-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpminuq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> > -; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> > +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB2_1
> > @@ -1246,31 +1249,31 @@ define void @truncstore_v8i64_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB2_4
> >  ; AVX512F-NEXT:  .LBB2_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB2_6
> >  ; AVX512F-NEXT:  .LBB2_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB2_8
> >  ; AVX512F-NEXT:  .LBB2_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB2_10
> >  ; AVX512F-NEXT:  .LBB2_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB2_12
> >  ; AVX512F-NEXT:  .LBB2_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB2_14
> >  ; AVX512F-NEXT:  .LBB2_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB2_16
> >  ; AVX512F-NEXT:  .LBB2_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1504,7 +1507,7 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:    pxor %xmm3, %xmm3
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [65535,65535]
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm5 =
> [9223372039002259456,9223372039002259456]
> > -; SSE2-NEXT:    movdqa %xmm1, %xmm6
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm6
> >  ; SSE2-NEXT:    pxor %xmm5, %xmm6
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm9 =
> [9223372039002324991,9223372039002324991]
> >  ; SSE2-NEXT:    movdqa %xmm9, %xmm7
> > @@ -1515,22 +1518,26 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:    pand %xmm4, %xmm6
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm7[1,1,3,3]
> >  ; SSE2-NEXT:    por %xmm6, %xmm4
> > -; SSE2-NEXT:    pand %xmm4, %xmm1
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> >  ; SSE2-NEXT:    pandn %xmm8, %xmm4
> > -; SSE2-NEXT:    por %xmm1, %xmm4
> > -; SSE2-NEXT:    pxor %xmm0, %xmm5
> > -; SSE2-NEXT:    movdqa %xmm9, %xmm1
> > -; SSE2-NEXT:    pcmpgtd %xmm5, %xmm1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm1[0,0,2,2]
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pxor %xmm1, %xmm5
> > +; SSE2-NEXT:    movdqa %xmm9, %xmm0
> > +; SSE2-NEXT:    pcmpgtd %xmm5, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm0[0,0,2,2]
> >  ; SSE2-NEXT:    pcmpeqd %xmm9, %xmm5
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm5[1,1,3,3]
> >  ; SSE2-NEXT:    pand %xmm6, %xmm5
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm5, %xmm1
> > -; SSE2-NEXT:    pand %xmm1, %xmm0
> > -; SSE2-NEXT:    pandn %xmm8, %xmm1
> > -; SSE2-NEXT:    por %xmm0, %xmm1
> > -; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,2],xmm4[0,2]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm5, %xmm0
> > +; SSE2-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-NEXT:    pandn %xmm8, %xmm0
> > +; SSE2-NEXT:    por %xmm1, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
> >  ; SSE2-NEXT:    movmskps %xmm3, %eax
> >  ; SSE2-NEXT:    xorl $15, %eax
> > @@ -1548,45 +1555,49 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE2-NEXT:  .LBB4_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB4_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm1, %ecx
> > +; SSE2-NEXT:    movd %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB4_4
> >  ; SSE2-NEXT:  .LBB4_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm1, %ecx
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 2(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB4_6
> >  ; SSE2-NEXT:  .LBB4_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm1, %ecx
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 4(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> >  ; SSE2-NEXT:    je .LBB4_8
> >  ; SSE2-NEXT:  .LBB4_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm1, %eax
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 6(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i64_v4i16:
> >  ; SSE4:       # %bb.0:
> > -; SSE4-NEXT:    movdqa %xmm0, %xmm8
> > -; SSE4-NEXT:    pxor %xmm6, %xmm6
> > -; SSE4-NEXT:    movapd {{.*#+}} xmm5 = [65535,65535]
> > +; SSE4-NEXT:    movdqa %xmm0, %xmm5
> > +; SSE4-NEXT:    pxor %xmm8, %xmm8
> > +; SSE4-NEXT:    movapd {{.*#+}} xmm6 = [65535,65535]
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm7 =
> [9223372036854775808,9223372036854775808]
> > -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> > +; SSE4-NEXT:    movdqa %xmm0, %xmm3
> >  ; SSE4-NEXT:    pxor %xmm7, %xmm3
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm4 =
> [9223372036854841343,9223372036854841343]
> >  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> >  ; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > -; SSE4-NEXT:    movapd %xmm5, %xmm3
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm3
> > -; SSE4-NEXT:    pxor %xmm8, %xmm7
> > +; SSE4-NEXT:    movapd %xmm6, %xmm3
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm5, %xmm3
> > +; SSE4-NEXT:    pxor %xmm1, %xmm7
> >  ; SSE4-NEXT:    pcmpgtq %xmm7, %xmm4
> >  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm8, %xmm5
> > -; SSE4-NEXT:    packusdw %xmm3, %xmm5
> > -; SSE4-NEXT:    pcmpeqd %xmm2, %xmm6
> > -; SSE4-NEXT:    movmskps %xmm6, %eax
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm6
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm6[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> > +; SSE4-NEXT:    pcmpeqd %xmm2, %xmm8
> > +; SSE4-NEXT:    movmskps %xmm8, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> >  ; SSE4-NEXT:    testb $1, %al
> >  ; SSE4-NEXT:    jne .LBB4_1
> > @@ -1602,19 +1613,19 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; SSE4-NEXT:  .LBB4_8: # %else6
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB4_1: # %cond.store
> > -; SSE4-NEXT:    pextrw $0, %xmm5, (%rdi)
> > +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB4_4
> >  ; SSE4-NEXT:  .LBB4_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $2, %xmm5, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB4_6
> >  ; SSE4-NEXT:  .LBB4_5: # %cond.store3
> > -; SSE4-NEXT:    pextrw $4, %xmm5, 4(%rdi)
> > +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB4_8
> >  ; SSE4-NEXT:  .LBB4_7: # %cond.store5
> > -; SSE4-NEXT:    pextrw $6, %xmm5, 6(%rdi)
> > +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i64_v4i16:
> > @@ -1629,8 +1640,12 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm5, %xmm3
> >  ; AVX1-NEXT:    vmovapd {{.*#+}} xmm5 = [65535,65535]
> >  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm6, %xmm5, %xmm3
> > +; AVX1-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> >  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm0, %xmm5, %xmm0
> > -; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -1653,15 +1668,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB4_4
> >  ; AVX1-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB4_6
> >  ; AVX1-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB4_8
> >  ; AVX1-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -1675,7 +1690,11 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm5, %ymm4
> >  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
> >  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > -; AVX2-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> > +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -1698,15 +1717,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB4_4
> >  ; AVX2-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB4_6
> >  ; AVX2-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB4_8
> >  ; AVX2-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -1717,7 +1736,7 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [65535,65535,65535,65535]
> >  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB4_1
> > @@ -1738,15 +1757,15 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB4_4
> >  ; AVX512F-NEXT:  .LBB4_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB4_6
> >  ; AVX512F-NEXT:  .LBB4_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB4_8
> >  ; AVX512F-NEXT:  .LBB4_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -1755,12 +1774,11 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [65535,65535,65535,65535]
> > -; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> > +; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 =
> [65535,65535,65535,65535]
> > +; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > +; AVX512BW-NEXT:    vpmovqw %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -1783,92 +1801,99 @@ define void @truncstore_v4i64_v4i16(<4 x
> >  define void @truncstore_v4i64_v4i8(<4 x i64> %x, <4 x i8>* %p, <4 x
> i32> %mask) {
> >  ; SSE2-LABEL: truncstore_v4i64_v4i8:
> >  ; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pxor %xmm3, %xmm3
> > +; SSE2-NEXT:    pxor %xmm9, %xmm9
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [255,255]
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm5 =
> [9223372039002259456,9223372039002259456]
> > -; SSE2-NEXT:    movdqa %xmm1, %xmm6
> > -; SSE2-NEXT:    pxor %xmm5, %xmm6
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm9 =
> [9223372039002259711,9223372039002259711]
> > -; SSE2-NEXT:    movdqa %xmm9, %xmm7
> > -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm7[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm9, %xmm6
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm6[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm4, %xmm6
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm6 =
> [9223372039002259456,9223372039002259456]
> > +; SSE2-NEXT:    movdqa %xmm0, %xmm4
> > +; SSE2-NEXT:    pxor %xmm6, %xmm4
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm10 =
> [9223372039002259711,9223372039002259711]
> > +; SSE2-NEXT:    movdqa %xmm10, %xmm7
> > +; SSE2-NEXT:    pcmpgtd %xmm4, %xmm7
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm7[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm4
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm4[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm3, %xmm5
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm7[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm6, %xmm4
> > -; SSE2-NEXT:    pand %xmm4, %xmm1
> > +; SSE2-NEXT:    por %xmm5, %xmm4
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> >  ; SSE2-NEXT:    pandn %xmm8, %xmm4
> > -; SSE2-NEXT:    por %xmm1, %xmm4
> > -; SSE2-NEXT:    pxor %xmm0, %xmm5
> > -; SSE2-NEXT:    movdqa %xmm9, %xmm1
> > -; SSE2-NEXT:    pcmpgtd %xmm5, %xmm1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm1[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm9, %xmm5
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm5[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm5
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm5, %xmm1
> > -; SSE2-NEXT:    pand %xmm1, %xmm0
> > -; SSE2-NEXT:    pandn %xmm8, %xmm1
> > -; SSE2-NEXT:    por %xmm0, %xmm1
> > -; SSE2-NEXT:    packuswb %xmm4, %xmm1
> > -; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
> > -; SSE2-NEXT:    movmskps %xmm3, %eax
> > -; SSE2-NEXT:    xorl $15, %eax
> > -; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pxor %xmm1, %xmm6
> > +; SSE2-NEXT:    movdqa %xmm10, %xmm0
> > +; SSE2-NEXT:    pcmpgtd %xmm6, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,0,2,2]
> > +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm6
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm6[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm3, %xmm5
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm5, %xmm0
> > +; SSE2-NEXT:    pand %xmm0, %xmm1
> > +; SSE2-NEXT:    pandn %xmm8, %xmm0
> > +; SSE2-NEXT:    por %xmm1, %xmm0
> > +; SSE2-NEXT:    pand %xmm8, %xmm0
> > +; SSE2-NEXT:    pand %xmm8, %xmm4
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    pcmpeqd %xmm2, %xmm9
> > +; SSE2-NEXT:    movmskps %xmm9, %ecx
> > +; SSE2-NEXT:    xorl $15, %ecx
> > +; SSE2-NEXT:    testb $1, %cl
> > +; SSE2-NEXT:    movd %xmm4, %eax
> >  ; SSE2-NEXT:    jne .LBB5_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    jne .LBB5_3
> >  ; SSE2-NEXT:  .LBB5_4: # %else2
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    jne .LBB5_5
> >  ; SSE2-NEXT:  .LBB5_6: # %else4
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    jne .LBB5_7
> >  ; SSE2-NEXT:  .LBB5_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB5_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm1, %ecx
> > -; SSE2-NEXT:    movb %cl, (%rdi)
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    movb %al, (%rdi)
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    je .LBB5_4
> >  ; SSE2-NEXT:  .LBB5_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm1, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    movb %ah, 1(%rdi)
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    je .LBB5_6
> >  ; SSE2-NEXT:  .LBB5_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm1, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    movl %eax, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    je .LBB5_8
> >  ; SSE2-NEXT:  .LBB5_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm1, %eax
> > +; SSE2-NEXT:    shrl $24, %eax
> >  ; SSE2-NEXT:    movb %al, 3(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v4i64_v4i8:
> >  ; SSE4:       # %bb.0:
> > -; SSE4-NEXT:    movdqa %xmm0, %xmm8
> > -; SSE4-NEXT:    pxor %xmm6, %xmm6
> > -; SSE4-NEXT:    movapd {{.*#+}} xmm5 = [255,255]
> > -; SSE4-NEXT:    movdqa {{.*#+}} xmm7 =
> [9223372036854775808,9223372036854775808]
> > -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> > -; SSE4-NEXT:    pxor %xmm7, %xmm3
> > +; SSE4-NEXT:    movdqa %xmm0, %xmm3
> > +; SSE4-NEXT:    pxor %xmm8, %xmm8
> > +; SSE4-NEXT:    movapd {{.*#+}} xmm7 = [255,255]
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm6 =
> [9223372036854775808,9223372036854775808]
> > +; SSE4-NEXT:    movdqa %xmm0, %xmm5
> > +; SSE4-NEXT:    pxor %xmm6, %xmm5
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm4 =
> [9223372036854776063,9223372036854776063]
> >  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> > -; SSE4-NEXT:    movapd %xmm5, %xmm3
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm3
> > -; SSE4-NEXT:    pxor %xmm8, %xmm7
> > -; SSE4-NEXT:    pcmpgtq %xmm7, %xmm4
> > +; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> > +; SSE4-NEXT:    movapd %xmm7, %xmm5
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm5
> > +; SSE4-NEXT:    pxor %xmm1, %xmm6
> > +; SSE4-NEXT:    pcmpgtq %xmm6, %xmm4
> >  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm8, %xmm5
> > -; SSE4-NEXT:    packusdw %xmm3, %xmm5
> > -; SSE4-NEXT:    pcmpeqd %xmm2, %xmm6
> > -; SSE4-NEXT:    movmskps %xmm6, %eax
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm7
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; SSE4-NEXT:    pshufb %xmm0, %xmm7
> > +; SSE4-NEXT:    pshufb %xmm0, %xmm5
> > +; SSE4-NEXT:    punpcklwd {{.*#+}} xmm5 =
> xmm5[0],xmm7[0],xmm5[1],xmm7[1],xmm5[2],xmm7[2],xmm5[3],xmm7[3]
> > +; SSE4-NEXT:    pcmpeqd %xmm2, %xmm8
> > +; SSE4-NEXT:    movmskps %xmm8, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> >  ; SSE4-NEXT:    testb $1, %al
> >  ; SSE4-NEXT:    jne .LBB5_1
> > @@ -1888,15 +1913,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB5_4
> >  ; SSE4-NEXT:  .LBB5_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $4, %xmm5, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm5, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB5_6
> >  ; SSE4-NEXT:  .LBB5_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $8, %xmm5, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm5, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB5_8
> >  ; SSE4-NEXT:  .LBB5_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $12, %xmm5, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm5, 3(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i64_v4i8:
> > @@ -1911,8 +1936,11 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm5, %xmm3
> >  ; AVX1-NEXT:    vmovapd {{.*#+}} xmm5 = [255,255]
> >  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm6, %xmm5, %xmm3
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm6 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX1-NEXT:    vpshufb %xmm6, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm0, %xmm5, %xmm0
> > -; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpshufb %xmm6, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -1935,15 +1963,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB5_4
> >  ; AVX1-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB5_6
> >  ; AVX1-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB5_8
> >  ; AVX1-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -1957,7 +1985,10 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm5, %ymm4
> >  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
> >  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> > -; AVX2-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -1980,15 +2011,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB5_4
> >  ; AVX2-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB5_6
> >  ; AVX2-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB5_8
> >  ; AVX2-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -1999,7 +2030,7 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [255,255,255,255]
> >  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB5_1
> > @@ -2020,15 +2051,15 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB5_4
> >  ; AVX512F-NEXT:  .LBB5_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB5_6
> >  ; AVX512F-NEXT:  .LBB5_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB5_8
> >  ; AVX512F-NEXT:  .LBB5_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2037,12 +2068,11 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [255,255,255,255]
> > -; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> > +; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [255,255,255,255]
> > +; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > +; AVX512BW-NEXT:    vpmovqb %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2065,25 +2095,26 @@ define void @truncstore_v4i64_v4i8(<4 x
> >  define void @truncstore_v2i64_v2i32(<2 x i64> %x, <2 x i32>* %p, <2 x
> i64> %mask) {
> >  ; SSE2-LABEL: truncstore_v2i64_v2i32:
> >  ; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 =
> [9223372039002259456,9223372039002259456]
> > -; SSE2-NEXT:    pxor %xmm0, %xmm2
> > +; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [9223372039002259456,9223372039002259456]
> > +; SSE2-NEXT:    pxor %xmm0, %xmm3
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [9223372039002259455,9223372039002259455]
> >  ; SSE2-NEXT:    movdqa %xmm4, %xmm5
> > -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm5
> > +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm5[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm4
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm5[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm4, %xmm2
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> > -; SSE2-NEXT:    por %xmm0, %xmm2
> > -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm3, %xmm0
> > -; SSE2-NEXT:    movmskpd %xmm0, %eax
> > +; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm5[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm3, %xmm4
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> > +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> > +; SSE2-NEXT:    pand %xmm2, %xmm1
> > +; SSE2-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-NEXT:    xorl $3, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne .LBB6_1
> > @@ -2093,26 +2124,27 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE2-NEXT:  .LBB6_4: # %else2
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB6_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm2, (%rdi)
> > +; SSE2-NEXT:    movd %xmm0, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB6_4
> >  ; SSE2-NEXT:  .LBB6_3: # %cond.store1
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[2,3,0,1]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> >  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i32:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE4-NEXT:    pxor %xmm4, %xmm4
> > -; SSE4-NEXT:    movapd {{.*#+}} xmm3 = [4294967295,4294967295]
> > +; SSE4-NEXT:    pxor %xmm3, %xmm3
> > +; SSE4-NEXT:    movapd {{.*#+}} xmm4 = [4294967295,4294967295]
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 =
> [9223372036854775808,9223372036854775808]
> >  ; SSE4-NEXT:    pxor %xmm0, %xmm5
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm0 =
> [9223372041149743103,9223372041149743103]
> >  ; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> > -; SSE4-NEXT:    pcmpeqq %xmm1, %xmm4
> > -; SSE4-NEXT:    movmskpd %xmm4, %eax
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm4
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> > +; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> > +; SSE4-NEXT:    movmskpd %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> >  ; SSE4-NEXT:    testb $1, %al
> >  ; SSE4-NEXT:    jne .LBB6_1
> > @@ -2122,11 +2154,11 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; SSE4-NEXT:  .LBB6_4: # %else2
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB6_1: # %cond.store
> > -; SSE4-NEXT:    movss %xmm3, (%rdi)
> > +; SSE4-NEXT:    movd %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB6_4
> >  ; SSE4-NEXT:  .LBB6_3: # %cond.store1
> > -; SSE4-NEXT:    extractps $2, %xmm3, 4(%rdi)
> > +; SSE4-NEXT:    pextrd $1, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v2i64_v2i32:
> > @@ -2135,12 +2167,12 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > +; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX1-NEXT:    vmovapd {{.*#+}} xmm2 = [4294967295,4294967295]
> >  ; AVX1-NEXT:    vpxor {{.*}}(%rip), %xmm0, %xmm3
> >  ; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 =
> [9223372041149743103,9223372041149743103]
> >  ; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm4, %xmm3
> >  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> > -; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX1-NEXT:    vmaskmovps %xmm0, %xmm1, (%rdi)
> >  ; AVX1-NEXT:    retq
> > @@ -2151,12 +2183,12 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > +; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX2-NEXT:    vmovapd {{.*#+}} xmm2 = [4294967295,4294967295]
> >  ; AVX2-NEXT:    vpxor {{.*}}(%rip), %xmm0, %xmm3
> >  ; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 =
> [9223372041149743103,9223372041149743103]
> >  ; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm4, %xmm3
> >  ; AVX2-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> > -; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
> >  ; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX2-NEXT:    vpmaskmovd %xmm0, %xmm1, (%rdi)
> >  ; AVX2-NEXT:    retq
> > @@ -2166,11 +2198,11 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512F-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [4294967295,4294967295]
> >  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> >  ; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> > -; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> > @@ -2187,11 +2219,11 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [4294967295,4294967295]
> >  ; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2206,25 +2238,27 @@ define void @truncstore_v2i64_v2i32(<2 x
> >  define void @truncstore_v2i64_v2i16(<2 x i64> %x, <2 x i16>* %p, <2 x
> i64> %mask) {
> >  ; SSE2-LABEL: truncstore_v2i64_v2i16:
> >  ; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 =
> [9223372039002259456,9223372039002259456]
> > -; SSE2-NEXT:    pxor %xmm0, %xmm2
> > +; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [9223372039002259456,9223372039002259456]
> > +; SSE2-NEXT:    pxor %xmm0, %xmm3
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [9223372039002324991,9223372039002324991]
> >  ; SSE2-NEXT:    movdqa %xmm4, %xmm5
> > -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm5
> > +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm5[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm4
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm5[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm4, %xmm2
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> > -; SSE2-NEXT:    por %xmm0, %xmm2
> > -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm3, %xmm0
> > -; SSE2-NEXT:    movmskpd %xmm0, %eax
> > +; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm5[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm3, %xmm4
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> > +; SSE2-NEXT:    pand %xmm2, %xmm1
> > +; SSE2-NEXT:    movmskpd %xmm1, %eax
> >  ; SSE2-NEXT:    xorl $3, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne .LBB7_1
> > @@ -2234,27 +2268,29 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE2-NEXT:  .LBB7_4: # %else2
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB7_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm2, %ecx
> > +; SSE2-NEXT:    movd %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB7_4
> >  ; SSE2-NEXT:  .LBB7_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $4, %xmm2, %eax
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 2(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i16:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    movdqa %xmm0, %xmm2
> > -; SSE4-NEXT:    pxor %xmm4, %xmm4
> > -; SSE4-NEXT:    movapd {{.*#+}} xmm3 = [65535,65535]
> > +; SSE4-NEXT:    pxor %xmm3, %xmm3
> > +; SSE4-NEXT:    movapd {{.*#+}} xmm4 = [65535,65535]
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 =
> [9223372036854775808,9223372036854775808]
> >  ; SSE4-NEXT:    pxor %xmm0, %xmm5
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm0 =
> [9223372036854841343,9223372036854841343]
> >  ; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> > -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> > -; SSE4-NEXT:    pcmpeqq %xmm1, %xmm4
> > -; SSE4-NEXT:    movmskpd %xmm4, %eax
> > +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm4
> > +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> > +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > +; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> > +; SSE4-NEXT:    movmskpd %xmm3, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> >  ; SSE4-NEXT:    testb $1, %al
> >  ; SSE4-NEXT:    jne .LBB7_1
> > @@ -2264,11 +2300,11 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; SSE4-NEXT:  .LBB7_4: # %else2
> >  ; SSE4-NEXT:    retq
> >  ; SSE4-NEXT:  .LBB7_1: # %cond.store
> > -; SSE4-NEXT:    pextrw $0, %xmm3, (%rdi)
> > +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB7_4
> >  ; SSE4-NEXT:  .LBB7_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $4, %xmm3, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v2i64_v2i16:
> > @@ -2279,6 +2315,8 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm5 =
> [9223372036854841343,9223372036854841343]
> >  ; AVX-NEXT:    vpcmpgtq %xmm4, %xmm5, %xmm4
> >  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> > +; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
> >  ; AVX-NEXT:    xorl $3, %eax
> > @@ -2294,7 +2332,7 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB7_4
> >  ; AVX-NEXT:  .LBB7_3: # %cond.store1
> > -; AVX-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v2i64_v2i16:
> > @@ -2304,6 +2342,8 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [65535,65535]
> >  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; AVX512F-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB7_1
> > @@ -2318,7 +2358,7 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB7_4
> >  ; AVX512F-NEXT:  .LBB7_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2327,12 +2367,12 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [65535,65535]
> >  ; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> >  ; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> > -; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -2354,27 +2394,32 @@ define void @truncstore_v2i64_v2i16(<2 x
> >  define void @truncstore_v2i64_v2i8(<2 x i64> %x, <2 x i8>* %p, <2 x
> i64> %mask) {
> >  ; SSE2-LABEL: truncstore_v2i64_v2i8:
> >  ; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 =
> [9223372039002259456,9223372039002259456]
> > -; SSE2-NEXT:    pxor %xmm0, %xmm2
> > +; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [9223372039002259456,9223372039002259456]
> > +; SSE2-NEXT:    pxor %xmm0, %xmm3
> >  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [9223372039002259711,9223372039002259711]
> >  ; SSE2-NEXT:    movdqa %xmm4, %xmm5
> > -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm5
> > +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
> >  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm5[0,0,2,2]
> > -; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,1,3,3]
> > -; SSE2-NEXT:    pand %xmm6, %xmm4
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm5[1,1,3,3]
> > -; SSE2-NEXT:    por %xmm4, %xmm2
> > +; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> > +; SSE2-NEXT:    pand %xmm6, %xmm3
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm5[1,1,3,3]
> > +; SSE2-NEXT:    por %xmm3, %xmm4
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> >  ; SSE2-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> > -; SSE2-NEXT:    por %xmm0, %xmm2
> > -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,0,3,2]
> > -; SSE2-NEXT:    pand %xmm3, %xmm0
> >  ; SSE2-NEXT:    movmskpd %xmm0, %eax
> >  ; SSE2-NEXT:    xorl $3, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    movd %xmm4, %ecx
> >  ; SSE2-NEXT:    jne .LBB8_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> >  ; SSE2-NEXT:    testb $2, %al
> > @@ -2382,13 +2427,11 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE2-NEXT:  .LBB8_4: # %else2
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB8_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm2, %ecx
> >  ; SSE2-NEXT:    movb %cl, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB8_4
> >  ; SSE2-NEXT:  .LBB8_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $4, %xmm2, %eax
> > -; SSE2-NEXT:    movb %al, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v2i64_v2i8:
> > @@ -2401,6 +2444,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm0 =
> [9223372036854776063,9223372036854776063]
> >  ; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> >  ; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm3 =
> xmm3[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm4
> >  ; SSE4-NEXT:    movmskpd %xmm4, %eax
> >  ; SSE4-NEXT:    xorl $3, %eax
> > @@ -2416,7 +2460,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB8_4
> >  ; SSE4-NEXT:  .LBB8_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $8, %xmm3, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm3, 1(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v2i64_v2i8:
> > @@ -2427,6 +2471,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm5 =
> [9223372036854776063,9223372036854776063]
> >  ; AVX-NEXT:    vpcmpgtq %xmm4, %xmm5, %xmm4
> >  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> > +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
> >  ; AVX-NEXT:    xorl $3, %eax
> > @@ -2442,7 +2487,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB8_4
> >  ; AVX-NEXT:  .LBB8_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v2i64_v2i8:
> > @@ -2452,6 +2497,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [255,255]
> >  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB8_1
> > @@ -2466,7 +2512,7 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB8_4
> >  ; AVX512F-NEXT:  .LBB8_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -2475,11 +2521,11 @@ define void @truncstore_v2i64_v2i8(<2 x
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> >  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [255,255]
> >  ; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> >  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> > -; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -4352,6 +4398,7 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    pandn %xmm9, %xmm6
> >  ; SSE2-NEXT:    por %xmm0, %xmm6
> >  ; SSE2-NEXT:    packuswb %xmm4, %xmm6
> > +; SSE2-NEXT:    packuswb %xmm6, %xmm6
> >  ; SSE2-NEXT:    pcmpeqd %xmm8, %xmm3
> >  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm0
> >  ; SSE2-NEXT:    pxor %xmm0, %xmm3
> > @@ -4371,17 +4418,26 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB12_5
> >  ; SSE2-NEXT:  .LBB12_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB12_7
> > +; SSE2-NEXT:    je .LBB12_8
> > +; SSE2-NEXT:  .LBB12_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB12_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB12_9
> > +; SSE2-NEXT:    pextrw $2, %xmm6, %ecx
> > +; SSE2-NEXT:    je .LBB12_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB12_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB12_11
> > +; SSE2-NEXT:    je .LBB12_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB12_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm6, %ecx
> >  ; SSE2-NEXT:    jne .LBB12_13
> > -; SSE2-NEXT:  .LBB12_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB12_15
> >  ; SSE2-NEXT:  .LBB12_16: # %else14
> > @@ -4391,47 +4447,34 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB12_4
> >  ; SSE2-NEXT:  .LBB12_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB12_6
> >  ; SSE2-NEXT:  .LBB12_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm6, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB12_8
> > -; SSE2-NEXT:  .LBB12_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm6, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB12_10
> > -; SSE2-NEXT:  .LBB12_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm6, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB12_12
> > -; SSE2-NEXT:  .LBB12_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm6, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB12_14
> > +; SSE2-NEXT:    jne .LBB12_7
> > +; SSE2-NEXT:    jmp .LBB12_8
> >  ; SSE2-NEXT:  .LBB12_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm6, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB12_16
> >  ; SSE2-NEXT:  .LBB12_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm6, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i32_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> >  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [255,255,255,255]
> > -; SSE4-NEXT:    pminud %xmm5, %xmm1
> >  ; SSE4-NEXT:    pminud %xmm5, %xmm0
> > -; SSE4-NEXT:    packusdw %xmm1, %xmm0
> > +; SSE4-NEXT:    pminud %xmm5, %xmm1
> > +; SSE4-NEXT:    movdqa {{.*#+}} xmm5 =
> <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; SSE4-NEXT:    pshufb %xmm5, %xmm1
> > +; SSE4-NEXT:    pshufb %xmm5, %xmm0
> > +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> >  ; SSE4-NEXT:    pcmpeqd %xmm4, %xmm3
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm1, %xmm3
> > @@ -4470,40 +4513,43 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB12_4
> >  ; SSE4-NEXT:  .LBB12_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB12_6
> >  ; SSE4-NEXT:  .LBB12_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB12_8
> >  ; SSE4-NEXT:  .LBB12_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB12_10
> >  ; SSE4-NEXT:  .LBB12_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB12_12
> >  ; SSE4-NEXT:  .LBB12_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB12_14
> >  ; SSE4-NEXT:  .LBB12_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB12_16
> >  ; SSE4-NEXT:  .LBB12_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v8i32_v8i8:
> >  ; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
> > -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = [255,255,255,255]
> > -; AVX1-NEXT:    vpminud %xmm3, %xmm2, %xmm2
> > -; AVX1-NEXT:    vpminud %xmm3, %xmm0, %xmm0
> > -; AVX1-NEXT:    vpackusdw %xmm2, %xmm0, %xmm0
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [255,255,255,255]
> > +; AVX1-NEXT:    vpminud %xmm2, %xmm0, %xmm3
> > +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
> > +; AVX1-NEXT:    vpminud %xmm2, %xmm0, %xmm0
> > +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 =
> <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX1-NEXT:    vpshufb %xmm2, %xmm0, %xmm0
> > +; AVX1-NEXT:    vpshufb %xmm2, %xmm3, %xmm2
> > +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1]
> >  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
> >  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> >  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm2, %xmm2
> > @@ -4542,31 +4588,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB12_4
> >  ; AVX1-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB12_6
> >  ; AVX1-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB12_8
> >  ; AVX1-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    testb $16, %al
> >  ; AVX1-NEXT:    je .LBB12_10
> >  ; AVX1-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $32, %al
> >  ; AVX1-NEXT:    je .LBB12_12
> >  ; AVX1-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX1-NEXT:    testb $64, %al
> >  ; AVX1-NEXT:    je .LBB12_14
> >  ; AVX1-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    testb $-128, %al
> >  ; AVX1-NEXT:    je .LBB12_16
> >  ; AVX1-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX1-NEXT:    vzeroupper
> >  ; AVX1-NEXT:    retq
> >  ;
> > @@ -4576,7 +4622,10 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm3 =
> [255,255,255,255,255,255,255,255]
> >  ; AVX2-NEXT:    vpminud %ymm3, %ymm0, %ymm0
> >  ; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> > -; AVX2-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 =
> <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> > +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> >  ; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
> >  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
> >  ; AVX2-NEXT:    notl %eax
> > @@ -4611,31 +4660,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB12_4
> >  ; AVX2-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB12_6
> >  ; AVX2-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB12_8
> >  ; AVX2-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    testb $16, %al
> >  ; AVX2-NEXT:    je .LBB12_10
> >  ; AVX2-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $32, %al
> >  ; AVX2-NEXT:    je .LBB12_12
> >  ; AVX2-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX2-NEXT:    testb $64, %al
> >  ; AVX2-NEXT:    je .LBB12_14
> >  ; AVX2-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    testb $-128, %al
> >  ; AVX2-NEXT:    je .LBB12_16
> >  ; AVX2-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX2-NEXT:    vzeroupper
> >  ; AVX2-NEXT:    retq
> >  ;
> > @@ -4645,7 +4694,7 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [255,255,255,255,255,255,255,255]
> >  ; AVX512F-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> > -; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
> > +; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB12_1
> > @@ -4678,31 +4727,31 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB12_4
> >  ; AVX512F-NEXT:  .LBB12_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB12_6
> >  ; AVX512F-NEXT:  .LBB12_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB12_8
> >  ; AVX512F-NEXT:  .LBB12_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB12_10
> >  ; AVX512F-NEXT:  .LBB12_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB12_12
> >  ; AVX512F-NEXT:  .LBB12_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB12_14
> >  ; AVX512F-NEXT:  .LBB12_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB12_16
> >  ; AVX512F-NEXT:  .LBB12_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -4710,12 +4759,11 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [255,255,255,255,255,255,255,255]
> > -; AVX512BW-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> > -; AVX512BW-NEXT:    vpmovdw %zmm0, %ymm0
> > -; AVX512BW-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> > +; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 =
> [255,255,255,255,255,255,255,255]
> > +; AVX512BW-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> > +; AVX512BW-NEXT:    vpmovdb %zmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -4738,16 +4786,19 @@ define void @truncstore_v8i32_v8i8(<8 x
> >  define void @truncstore_v4i32_v4i16(<4 x i32> %x, <4 x i16>* %p, <4 x
> i32> %mask) {
> >  ; SSE2-LABEL: truncstore_v4i32_v4i16:
> >  ; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [2147483648,2147483648,2147483648,2147483648]
> > -; SSE2-NEXT:    pxor %xmm0, %xmm4
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 =
> [2147549183,2147549183,2147549183,2147549183]
> > -; SSE2-NEXT:    pcmpgtd %xmm4, %xmm2
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> > -; SSE2-NEXT:    por %xmm0, %xmm2
> > -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> > -; SSE2-NEXT:    movmskps %xmm3, %eax
> > +; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [2147483648,2147483648,2147483648,2147483648]
> > +; SSE2-NEXT:    pxor %xmm0, %xmm3
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [2147549183,2147549183,2147549183,2147549183]
> > +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm4
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm4[0,2,2,3,4,5,6,7]
> > +; SSE2-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE2-NEXT:    movmskps %xmm2, %eax
> >  ; SSE2-NEXT:    xorl $15, %eax
> >  ; SSE2-NEXT:    testb $1, %al
> >  ; SSE2-NEXT:    jne .LBB13_1
> > @@ -4763,22 +4814,22 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE2-NEXT:  .LBB13_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB13_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm2, %ecx
> > +; SSE2-NEXT:    movd %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, (%rdi)
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB13_4
> >  ; SSE2-NEXT:  .LBB13_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm2, %ecx
> > +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 2(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB13_6
> >  ; SSE2-NEXT:  .LBB13_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm2, %ecx
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> >  ; SSE2-NEXT:    movw %cx, 4(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> >  ; SSE2-NEXT:    je .LBB13_8
> >  ; SSE2-NEXT:  .LBB13_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm2, %eax
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
> >  ; SSE2-NEXT:    movw %ax, 6(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -4786,6 +4837,7 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE4-NEXT:    pminud {{.*}}(%rip), %xmm0
> > +; SSE4-NEXT:    packusdw %xmm0, %xmm0
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskps %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -4807,21 +4859,22 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB13_4
> >  ; SSE4-NEXT:  .LBB13_3: # %cond.store1
> > -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB13_6
> >  ; SSE4-NEXT:  .LBB13_5: # %cond.store3
> > -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB13_8
> >  ; SSE4-NEXT:  .LBB13_7: # %cond.store5
> > -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i32_v4i16:
> >  ; AVX1:       # %bb.0:
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> >  ; AVX1-NEXT:    vpminud {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX1-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -4843,15 +4896,15 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB13_4
> >  ; AVX1-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB13_6
> >  ; AVX1-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB13_8
> >  ; AVX1-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: truncstore_v4i32_v4i16:
> > @@ -4859,6 +4912,7 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> >  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [65535,65535,65535,65535]
> >  ; AVX2-NEXT:    vpminud %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -4880,15 +4934,15 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB13_4
> >  ; AVX2-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB13_6
> >  ; AVX2-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB13_8
> >  ; AVX2-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i32_v4i16:
> > @@ -4897,6 +4951,7 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [65535,65535,65535,65535]
> >  ; AVX512F-NEXT:    vpminud %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB13_1
> > @@ -4917,15 +4972,15 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB13_4
> >  ; AVX512F-NEXT:  .LBB13_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB13_6
> >  ; AVX512F-NEXT:  .LBB13_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB13_8
> >  ; AVX512F-NEXT:  .LBB13_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -4933,11 +4988,11 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 =
> [65535,65535,65535,65535]
> >  ; AVX512BW-NEXT:    vpminud %xmm1, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
> > -; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -4959,47 +5014,50 @@ define void @truncstore_v4i32_v4i16(<4 x
> >  define void @truncstore_v4i32_v4i8(<4 x i32> %x, <4 x i8>* %p, <4 x
> i32> %mask) {
> >  ; SSE2-LABEL: truncstore_v4i32_v4i8:
> >  ; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pxor %xmm3, %xmm3
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [2147483648,2147483648,2147483648,2147483648]
> > -; SSE2-NEXT:    pxor %xmm0, %xmm4
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 =
> [2147483903,2147483903,2147483903,2147483903]
> > -; SSE2-NEXT:    pcmpgtd %xmm4, %xmm2
> > -; SSE2-NEXT:    pand %xmm2, %xmm0
> > -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> > -; SSE2-NEXT:    por %xmm0, %xmm2
> > -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> > -; SSE2-NEXT:    movmskps %xmm3, %eax
> > -; SSE2-NEXT:    xorl $15, %eax
> > -; SSE2-NEXT:    testb $1, %al
> > +; SSE2-NEXT:    pxor %xmm2, %xmm2
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [2147483648,2147483648,2147483648,2147483648]
> > +; SSE2-NEXT:    pxor %xmm0, %xmm3
> > +; SSE2-NEXT:    movdqa {{.*#+}} xmm4 =
> [2147483903,2147483903,2147483903,2147483903]
> > +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm4
> > +; SSE2-NEXT:    pand %xmm4, %xmm0
> > +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    por %xmm0, %xmm4
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> > +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> > +; SSE2-NEXT:    movmskps %xmm2, %ecx
> > +; SSE2-NEXT:    xorl $15, %ecx
> > +; SSE2-NEXT:    testb $1, %cl
> > +; SSE2-NEXT:    movd %xmm4, %eax
> >  ; SSE2-NEXT:    jne .LBB14_1
> >  ; SSE2-NEXT:  # %bb.2: # %else
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    jne .LBB14_3
> >  ; SSE2-NEXT:  .LBB14_4: # %else2
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    jne .LBB14_5
> >  ; SSE2-NEXT:  .LBB14_6: # %else4
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    jne .LBB14_7
> >  ; SSE2-NEXT:  .LBB14_8: # %else6
> >  ; SSE2-NEXT:    retq
> >  ; SSE2-NEXT:  .LBB14_1: # %cond.store
> > -; SSE2-NEXT:    movd %xmm2, %ecx
> > -; SSE2-NEXT:    movb %cl, (%rdi)
> > -; SSE2-NEXT:    testb $2, %al
> > +; SSE2-NEXT:    movb %al, (%rdi)
> > +; SSE2-NEXT:    testb $2, %cl
> >  ; SSE2-NEXT:    je .LBB14_4
> >  ; SSE2-NEXT:  .LBB14_3: # %cond.store1
> > -; SSE2-NEXT:    pextrw $2, %xmm2, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > -; SSE2-NEXT:    testb $4, %al
> > +; SSE2-NEXT:    movb %ah, 1(%rdi)
> > +; SSE2-NEXT:    testb $4, %cl
> >  ; SSE2-NEXT:    je .LBB14_6
> >  ; SSE2-NEXT:  .LBB14_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $4, %xmm2, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > -; SSE2-NEXT:    testb $8, %al
> > +; SSE2-NEXT:    movl %eax, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> > +; SSE2-NEXT:    testb $8, %cl
> >  ; SSE2-NEXT:    je .LBB14_8
> >  ; SSE2-NEXT:  .LBB14_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $6, %xmm2, %eax
> > +; SSE2-NEXT:    shrl $24, %eax
> >  ; SSE2-NEXT:    movb %al, 3(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> > @@ -5007,6 +5065,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE4-NEXT:    pminud {{.*}}(%rip), %xmm0
> > +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
> >  ; SSE4-NEXT:    movmskps %xmm2, %eax
> >  ; SSE4-NEXT:    xorl $15, %eax
> > @@ -5028,21 +5087,22 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB14_4
> >  ; SSE4-NEXT:  .LBB14_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB14_6
> >  ; SSE4-NEXT:  .LBB14_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB14_8
> >  ; SSE4-NEXT:  .LBB14_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX1-LABEL: truncstore_v4i32_v4i8:
> >  ; AVX1:       # %bb.0:
> >  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> >  ; AVX1-NEXT:    vpminud {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX1-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX1-NEXT:    xorl $15, %eax
> > @@ -5064,15 +5124,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX1-NEXT:    testb $2, %al
> >  ; AVX1-NEXT:    je .LBB14_4
> >  ; AVX1-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX1-NEXT:    testb $4, %al
> >  ; AVX1-NEXT:    je .LBB14_6
> >  ; AVX1-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX1-NEXT:    testb $8, %al
> >  ; AVX1-NEXT:    je .LBB14_8
> >  ; AVX1-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX1-NEXT:    retq
> >  ;
> >  ; AVX2-LABEL: truncstore_v4i32_v4i8:
> > @@ -5080,6 +5140,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> >  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [255,255,255,255]
> >  ; AVX2-NEXT:    vpminud %xmm3, %xmm0, %xmm0
> > +; AVX2-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> >  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
> >  ; AVX2-NEXT:    xorl $15, %eax
> > @@ -5101,15 +5162,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX2-NEXT:    testb $2, %al
> >  ; AVX2-NEXT:    je .LBB14_4
> >  ; AVX2-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX2-NEXT:    testb $4, %al
> >  ; AVX2-NEXT:    je .LBB14_6
> >  ; AVX2-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX2-NEXT:    testb $8, %al
> >  ; AVX2-NEXT:    je .LBB14_8
> >  ; AVX2-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX2-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v4i32_v4i8:
> > @@ -5118,6 +5179,7 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [255,255,255,255]
> >  ; AVX512F-NEXT:    vpminud %xmm1, %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB14_1
> > @@ -5138,15 +5200,15 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB14_4
> >  ; AVX512F-NEXT:  .LBB14_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB14_6
> >  ; AVX512F-NEXT:  .LBB14_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB14_8
> >  ; AVX512F-NEXT:  .LBB14_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -5154,11 +5216,11 @@ define void @truncstore_v4i32_v4i8(<4 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> > +; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> > +; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> >  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [255,255,255,255]
> >  ; AVX512BW-NEXT:    vpminud %xmm1, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 =
> xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> > -; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> > -; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> > @@ -7041,10 +7103,10 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-LABEL: truncstore_v8i16_v8i8:
> >  ; SSE2:       # %bb.0:
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> > -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 =
> [32768,32768,32768,32768,32768,32768,32768,32768]
> > -; SSE2-NEXT:    pxor %xmm3, %xmm0
> > +; SSE2-NEXT:    pxor {{.*}}(%rip), %xmm0
> >  ; SSE2-NEXT:    pminsw {{.*}}(%rip), %xmm0
> > -; SSE2-NEXT:    pxor %xmm3, %xmm0
> > +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> > +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE2-NEXT:    pcmpeqw %xmm1, %xmm2
> >  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE2-NEXT:    pxor %xmm2, %xmm1
> > @@ -7061,17 +7123,26 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-NEXT:    jne .LBB17_5
> >  ; SSE2-NEXT:  .LBB17_6: # %else4
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    jne .LBB17_7
> > +; SSE2-NEXT:    je .LBB17_8
> > +; SSE2-NEXT:  .LBB17_7: # %cond.store5
> > +; SSE2-NEXT:    shrl $24, %ecx
> > +; SSE2-NEXT:    movb %cl, 3(%rdi)
> >  ; SSE2-NEXT:  .LBB17_8: # %else6
> >  ; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    jne .LBB17_9
> > +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > +; SSE2-NEXT:    je .LBB17_10
> > +; SSE2-NEXT:  # %bb.9: # %cond.store7
> > +; SSE2-NEXT:    movb %cl, 4(%rdi)
> >  ; SSE2-NEXT:  .LBB17_10: # %else8
> >  ; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    jne .LBB17_11
> > +; SSE2-NEXT:    je .LBB17_12
> > +; SSE2-NEXT:  # %bb.11: # %cond.store9
> > +; SSE2-NEXT:    movb %ch, 5(%rdi)
> >  ; SSE2-NEXT:  .LBB17_12: # %else10
> >  ; SSE2-NEXT:    testb $64, %al
> > +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> >  ; SSE2-NEXT:    jne .LBB17_13
> > -; SSE2-NEXT:  .LBB17_14: # %else12
> > +; SSE2-NEXT:  # %bb.14: # %else12
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    jne .LBB17_15
> >  ; SSE2-NEXT:  .LBB17_16: # %else14
> > @@ -7081,44 +7152,29 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE2-NEXT:    testb $2, %al
> >  ; SSE2-NEXT:    je .LBB17_4
> >  ; SSE2-NEXT:  .LBB17_3: # %cond.store1
> > -; SSE2-NEXT:    shrl $16, %ecx
> > -; SSE2-NEXT:    movb %cl, 1(%rdi)
> > +; SSE2-NEXT:    movb %ch, 1(%rdi)
> >  ; SSE2-NEXT:    testb $4, %al
> >  ; SSE2-NEXT:    je .LBB17_6
> >  ; SSE2-NEXT:  .LBB17_5: # %cond.store3
> > -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 2(%rdi)
> > +; SSE2-NEXT:    movl %ecx, %edx
> > +; SSE2-NEXT:    shrl $16, %edx
> > +; SSE2-NEXT:    movb %dl, 2(%rdi)
> >  ; SSE2-NEXT:    testb $8, %al
> > -; SSE2-NEXT:    je .LBB17_8
> > -; SSE2-NEXT:  .LBB17_7: # %cond.store5
> > -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 3(%rdi)
> > -; SSE2-NEXT:    testb $16, %al
> > -; SSE2-NEXT:    je .LBB17_10
> > -; SSE2-NEXT:  .LBB17_9: # %cond.store7
> > -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 4(%rdi)
> > -; SSE2-NEXT:    testb $32, %al
> > -; SSE2-NEXT:    je .LBB17_12
> > -; SSE2-NEXT:  .LBB17_11: # %cond.store9
> > -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> > -; SSE2-NEXT:    movb %cl, 5(%rdi)
> > -; SSE2-NEXT:    testb $64, %al
> > -; SSE2-NEXT:    je .LBB17_14
> > +; SSE2-NEXT:    jne .LBB17_7
> > +; SSE2-NEXT:    jmp .LBB17_8
> >  ; SSE2-NEXT:  .LBB17_13: # %cond.store11
> > -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
> >  ; SSE2-NEXT:    movb %cl, 6(%rdi)
> >  ; SSE2-NEXT:    testb $-128, %al
> >  ; SSE2-NEXT:    je .LBB17_16
> >  ; SSE2-NEXT:  .LBB17_15: # %cond.store13
> > -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> > -; SSE2-NEXT:    movb %al, 7(%rdi)
> > +; SSE2-NEXT:    movb %ch, 7(%rdi)
> >  ; SSE2-NEXT:    retq
> >  ;
> >  ; SSE4-LABEL: truncstore_v8i16_v8i8:
> >  ; SSE4:       # %bb.0:
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> >  ; SSE4-NEXT:    pminuw {{.*}}(%rip), %xmm0
> > +; SSE4-NEXT:    packuswb %xmm0, %xmm0
> >  ; SSE4-NEXT:    pcmpeqw %xmm1, %xmm2
> >  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
> >  ; SSE4-NEXT:    pxor %xmm2, %xmm1
> > @@ -7154,37 +7210,38 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; SSE4-NEXT:    testb $2, %al
> >  ; SSE4-NEXT:    je .LBB17_4
> >  ; SSE4-NEXT:  .LBB17_3: # %cond.store1
> > -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> > +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
> >  ; SSE4-NEXT:    testb $4, %al
> >  ; SSE4-NEXT:    je .LBB17_6
> >  ; SSE4-NEXT:  .LBB17_5: # %cond.store3
> > -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> > +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
> >  ; SSE4-NEXT:    testb $8, %al
> >  ; SSE4-NEXT:    je .LBB17_8
> >  ; SSE4-NEXT:  .LBB17_7: # %cond.store5
> > -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> > +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
> >  ; SSE4-NEXT:    testb $16, %al
> >  ; SSE4-NEXT:    je .LBB17_10
> >  ; SSE4-NEXT:  .LBB17_9: # %cond.store7
> > -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> > +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
> >  ; SSE4-NEXT:    testb $32, %al
> >  ; SSE4-NEXT:    je .LBB17_12
> >  ; SSE4-NEXT:  .LBB17_11: # %cond.store9
> > -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> > +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
> >  ; SSE4-NEXT:    testb $64, %al
> >  ; SSE4-NEXT:    je .LBB17_14
> >  ; SSE4-NEXT:  .LBB17_13: # %cond.store11
> > -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> > +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
> >  ; SSE4-NEXT:    testb $-128, %al
> >  ; SSE4-NEXT:    je .LBB17_16
> >  ; SSE4-NEXT:  .LBB17_15: # %cond.store13
> > -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> > +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
> >  ; SSE4-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: truncstore_v8i16_v8i8:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> >  ; AVX-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
> >  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
> >  ; AVX-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> > @@ -7220,31 +7277,31 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX-NEXT:    testb $2, %al
> >  ; AVX-NEXT:    je .LBB17_4
> >  ; AVX-NEXT:  .LBB17_3: # %cond.store1
> > -; AVX-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX-NEXT:    testb $4, %al
> >  ; AVX-NEXT:    je .LBB17_6
> >  ; AVX-NEXT:  .LBB17_5: # %cond.store3
> > -; AVX-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX-NEXT:    testb $8, %al
> >  ; AVX-NEXT:    je .LBB17_8
> >  ; AVX-NEXT:  .LBB17_7: # %cond.store5
> > -; AVX-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX-NEXT:    testb $16, %al
> >  ; AVX-NEXT:    je .LBB17_10
> >  ; AVX-NEXT:  .LBB17_9: # %cond.store7
> > -; AVX-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX-NEXT:    testb $32, %al
> >  ; AVX-NEXT:    je .LBB17_12
> >  ; AVX-NEXT:  .LBB17_11: # %cond.store9
> > -; AVX-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX-NEXT:    testb $64, %al
> >  ; AVX-NEXT:    je .LBB17_14
> >  ; AVX-NEXT:  .LBB17_13: # %cond.store11
> > -; AVX-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX-NEXT:    testb $-128, %al
> >  ; AVX-NEXT:    je .LBB17_16
> >  ; AVX-NEXT:  .LBB17_15: # %cond.store13
> > -; AVX-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; AVX512F-LABEL: truncstore_v8i16_v8i8:
> > @@ -7255,6 +7312,7 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512F-NEXT:    vpmovsxwq %xmm1, %zmm1
> >  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> >  ; AVX512F-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX512F-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX512F-NEXT:    kmovw %k0, %eax
> >  ; AVX512F-NEXT:    testb $1, %al
> >  ; AVX512F-NEXT:    jne .LBB17_1
> > @@ -7287,31 +7345,31 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512F-NEXT:    testb $2, %al
> >  ; AVX512F-NEXT:    je .LBB17_4
> >  ; AVX512F-NEXT:  .LBB17_3: # %cond.store1
> > -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> > +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> >  ; AVX512F-NEXT:    testb $4, %al
> >  ; AVX512F-NEXT:    je .LBB17_6
> >  ; AVX512F-NEXT:  .LBB17_5: # %cond.store3
> > -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> > +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> >  ; AVX512F-NEXT:    testb $8, %al
> >  ; AVX512F-NEXT:    je .LBB17_8
> >  ; AVX512F-NEXT:  .LBB17_7: # %cond.store5
> > -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> > +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> >  ; AVX512F-NEXT:    testb $16, %al
> >  ; AVX512F-NEXT:    je .LBB17_10
> >  ; AVX512F-NEXT:  .LBB17_9: # %cond.store7
> > -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> > +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
> >  ; AVX512F-NEXT:    testb $32, %al
> >  ; AVX512F-NEXT:    je .LBB17_12
> >  ; AVX512F-NEXT:  .LBB17_11: # %cond.store9
> > -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> > +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
> >  ; AVX512F-NEXT:    testb $64, %al
> >  ; AVX512F-NEXT:    je .LBB17_14
> >  ; AVX512F-NEXT:  .LBB17_13: # %cond.store11
> > -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> > +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
> >  ; AVX512F-NEXT:    testb $-128, %al
> >  ; AVX512F-NEXT:    je .LBB17_16
> >  ; AVX512F-NEXT:  .LBB17_15: # %cond.store13
> > -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> > +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
> >  ; AVX512F-NEXT:    vzeroupper
> >  ; AVX512F-NEXT:    retq
> >  ;
> > @@ -7319,10 +7377,10 @@ define void @truncstore_v8i16_v8i8(<8 x
> >  ; AVX512BW:       # %bb.0:
> >  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> >  ; AVX512BW-NEXT:    vptestmw %zmm1, %zmm1, %k0
> > -; AVX512BW-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> > -; AVX512BW-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> >  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> > +; AVX512BW-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> > +; AVX512BW-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> >  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
> >  ; AVX512BW-NEXT:    vzeroupper
> >  ; AVX512BW-NEXT:    retq
> >
> > Modified: llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll Wed Aug
> 7 09:24:26 2019
> > @@ -676,18 +676,18 @@ define <16 x i16> @merge_16i16_i16_0uu3z
> >  define <2 x i8> @PR42846(<2 x i8>* %j, <2 x i8> %k) {
> >  ; AVX-LABEL: PR42846:
> >  ; AVX:       # %bb.0:
> > -; AVX-NEXT:    vmovdqa {{.*}}(%rip), %ymm1
> > -; AVX-NEXT:    vpmovzxbq {{.*#+}} xmm0 =
> xmm1[0],zero,zero,zero,zero,zero,zero,zero,xmm1[1],zero,zero,zero,zero,zero,zero,zero
> > -; AVX-NEXT:    vpextrw $0, %xmm1, (%rdi)
> > +; AVX-NEXT:    vmovdqa {{.*}}(%rip), %ymm0
> > +; AVX-NEXT:    vpextrw $0, %xmm0, (%rdi)
> > +; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
> >  ; AVX-NEXT:    vzeroupper
> >  ; AVX-NEXT:    retq
> >  ;
> >  ; X32-AVX-LABEL: PR42846:
> >  ; X32-AVX:       # %bb.0:
> >  ; X32-AVX-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; X32-AVX-NEXT:    vmovdqa l, %ymm1
> > -; X32-AVX-NEXT:    vpmovzxbq {{.*#+}} xmm0 =
> xmm1[0],zero,zero,zero,zero,zero,zero,zero,xmm1[1],zero,zero,zero,zero,zero,zero,zero
> > -; X32-AVX-NEXT:    vpextrw $0, %xmm1, (%eax)
> > +; X32-AVX-NEXT:    vmovdqa l, %ymm0
> > +; X32-AVX-NEXT:    vpextrw $0, %xmm0, (%eax)
> > +; X32-AVX-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
> >  ; X32-AVX-NEXT:    vzeroupper
> >  ; X32-AVX-NEXT:    retl
> >    %t0 = load volatile <32 x i8>, <32 x i8>* @l, align 32
> >
> > Modified: llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll Wed Aug  7
> 09:24:26 2019
> > @@ -22,13 +22,12 @@ define void @t3() nounwind  {
> >  define void @t4(x86_mmx %v1, x86_mmx %v2) nounwind  {
> >  ; X86-64-LABEL: t4:
> >  ; X86-64:       ## %bb.0:
> > -; X86-64-NEXT:    movdq2q %xmm1, %mm0
> > -; X86-64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> >  ; X86-64-NEXT:    movdq2q %xmm0, %mm0
> >  ; X86-64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X86-64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X86-64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X86-64-NEXT:    paddb %xmm1, %xmm0
> > +; X86-64-NEXT:    movdq2q %xmm1, %mm0
> > +; X86-64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > +; X86-64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
> > +; X86-64-NEXT:    paddb -{{[0-9]+}}(%rsp), %xmm0
> >  ; X86-64-NEXT:    movb $1, %al
> >  ; X86-64-NEXT:    jmp _pass_v8qi ## TAILCALL
> >    %v1a = bitcast x86_mmx %v1 to <8 x i8>
> >
> > Modified: llvm/trunk/test/CodeGen/X86/mmx-arith.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mmx-arith.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/mmx-arith.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/mmx-arith.ll Wed Aug  7 09:24:26 2019
> > @@ -13,8 +13,8 @@ define void @test0(x86_mmx* %A, x86_mmx*
> >  ; X32-NEXT:    .cfi_offset %ebp, -8
> >  ; X32-NEXT:    movl %esp, %ebp
> >  ; X32-NEXT:    .cfi_def_cfa_register %ebp
> > -; X32-NEXT:    andl $-8, %esp
> > -; X32-NEXT:    subl $16, %esp
> > +; X32-NEXT:    andl $-16, %esp
> > +; X32-NEXT:    subl $48, %esp
> >  ; X32-NEXT:    movl 12(%ebp), %ecx
> >  ; X32-NEXT:    movl 8(%ebp), %eax
> >  ; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > @@ -26,7 +26,7 @@ define void @test0(x86_mmx* %A, x86_mmx*
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    paddusb (%ecx), %mm0
> >  ; X32-NEXT:    movq %mm0, {{[0-9]+}}(%esp)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    movdqa {{[0-9]+}}(%esp), %xmm0
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X32-NEXT:    psubb %xmm1, %xmm0
> > @@ -36,37 +36,24 @@ define void @test0(x86_mmx* %A, x86_mmx*
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    psubusb (%ecx), %mm0
> >  ; X32-NEXT:    movq %mm0, (%esp)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; X32-NEXT:    movdqa (%esp), %xmm0
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X32-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > -; X32-NEXT:    pmullw %xmm0, %xmm1
> > -; X32-NEXT:    movdqa {{.*#+}} xmm0 =
> [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
> > -; X32-NEXT:    movdqa %xmm1, %xmm2
> > -; X32-NEXT:    pand %xmm0, %xmm2
> > -; X32-NEXT:    packuswb %xmm0, %xmm2
> > -; X32-NEXT:    movq %xmm2, (%eax)
> > -; X32-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> > -; X32-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> > -; X32-NEXT:    pand %xmm1, %xmm2
> > -; X32-NEXT:    movdqa %xmm2, %xmm1
> > +; X32-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; X32-NEXT:    pmullw %xmm1, %xmm0
> > +; X32-NEXT:    pand {{\.LCPI.*}}, %xmm0
> > +; X32-NEXT:    packuswb %xmm0, %xmm0
> > +; X32-NEXT:    movq %xmm0, (%eax)
> > +; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X32-NEXT:    pand %xmm0, %xmm1
> > -; X32-NEXT:    packuswb %xmm0, %xmm1
> >  ; X32-NEXT:    movq %xmm1, (%eax)
> > +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    por %xmm1, %xmm0
> > +; X32-NEXT:    movq %xmm0, (%eax)
> >  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > -; X32-NEXT:    por %xmm2, %xmm1
> > -; X32-NEXT:    movdqa %xmm1, %xmm2
> > -; X32-NEXT:    pand %xmm0, %xmm2
> > -; X32-NEXT:    packuswb %xmm0, %xmm2
> > -; X32-NEXT:    movq %xmm2, (%eax)
> > -; X32-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> > -; X32-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> > -; X32-NEXT:    pxor %xmm1, %xmm2
> > -; X32-NEXT:    pand %xmm0, %xmm2
> > -; X32-NEXT:    packuswb %xmm0, %xmm2
> > -; X32-NEXT:    movq %xmm2, (%eax)
> > +; X32-NEXT:    pxor %xmm0, %xmm1
> > +; X32-NEXT:    movq %xmm1, (%eax)
> >  ; X32-NEXT:    emms
> >  ; X32-NEXT:    movl %ebp, %esp
> >  ; X32-NEXT:    popl %ebp
> > @@ -84,7 +71,7 @@ define void @test0(x86_mmx* %A, x86_mmx*
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    paddusb (%rsi), %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X64-NEXT:    psubb %xmm1, %xmm0
> > @@ -94,37 +81,24 @@ define void @test0(x86_mmx* %A, x86_mmx*
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    psubusb (%rsi), %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; X64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X64-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > -; X64-NEXT:    pmullw %xmm0, %xmm1
> > -; X64-NEXT:    movdqa {{.*#+}} xmm0 =
> [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
> > -; X64-NEXT:    movdqa %xmm1, %xmm2
> > -; X64-NEXT:    pand %xmm0, %xmm2
> > -; X64-NEXT:    packuswb %xmm0, %xmm2
> > -; X64-NEXT:    movq %xmm2, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> > -; X64-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> > -; X64-NEXT:    pand %xmm1, %xmm2
> > -; X64-NEXT:    movdqa %xmm2, %xmm1
> > +; X64-NEXT:    punpcklbw {{.*#+}} xmm0 =
> xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> > +; X64-NEXT:    pmullw %xmm1, %xmm0
> > +; X64-NEXT:    pand {{.*}}(%rip), %xmm0
> > +; X64-NEXT:    packuswb %xmm0, %xmm0
> > +; X64-NEXT:    movq %xmm0, (%rdi)
> > +; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X64-NEXT:    pand %xmm0, %xmm1
> > -; X64-NEXT:    packuswb %xmm0, %xmm1
> >  ; X64-NEXT:    movq %xmm1, (%rdi)
> > +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    por %xmm1, %xmm0
> > +; X64-NEXT:    movq %xmm0, (%rdi)
> >  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    punpcklbw {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> > -; X64-NEXT:    por %xmm2, %xmm1
> > -; X64-NEXT:    movdqa %xmm1, %xmm2
> > -; X64-NEXT:    pand %xmm0, %xmm2
> > -; X64-NEXT:    packuswb %xmm0, %xmm2
> > -; X64-NEXT:    movq %xmm2, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> > -; X64-NEXT:    punpcklbw {{.*#+}} xmm2 =
> xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> > -; X64-NEXT:    pxor %xmm1, %xmm2
> > -; X64-NEXT:    pand %xmm0, %xmm2
> > -; X64-NEXT:    packuswb %xmm0, %xmm2
> > -; X64-NEXT:    movq %xmm2, (%rdi)
> > +; X64-NEXT:    pxor %xmm0, %xmm1
> > +; X64-NEXT:    movq %xmm1, (%rdi)
> >  ; X64-NEXT:    emms
> >  ; X64-NEXT:    retq
> >  entry:
> > @@ -182,66 +156,56 @@ entry:
> >  define void @test1(x86_mmx* %A, x86_mmx* %B) {
> >  ; X32-LABEL: test1:
> >  ; X32:       # %bb.0: # %entry
> > -; X32-NEXT:    movl {{[0-9]+}}(%esp), %ecx
> >  ; X32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> > -; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,1,1,3]
> > -; X32-NEXT:    paddq %xmm0, %xmm1
> > -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > -; X32-NEXT:    movq %xmm0, (%eax)
> > -; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; X32-NEXT:    pmuludq %xmm1, %xmm0
> > -; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> > -; X32-NEXT:    movq %xmm1, (%eax)
> > -; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,1,1,3]
> > -; X32-NEXT:    andps %xmm0, %xmm1
> > -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > -; X32-NEXT:    movq %xmm0, (%eax)
> > -; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; X32-NEXT:    orps %xmm1, %xmm0
> > -; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> > -; X32-NEXT:    movq %xmm1, (%eax)
> > -; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,1,1,3]
> > -; X32-NEXT:    xorps %xmm0, %xmm1
> > -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > -; X32-NEXT:    movq %xmm0, (%eax)
> > +; X32-NEXT:    movl {{[0-9]+}}(%esp), %ecx
> > +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > +; X32-NEXT:    paddd %xmm0, %xmm1
> > +; X32-NEXT:    movq %xmm1, (%ecx)
> > +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    pshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
> > +; X32-NEXT:    pmuludq %xmm0, %xmm1
> > +; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]
> > +; X32-NEXT:    pmuludq %xmm0, %xmm2
> > +; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> > +; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; X32-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
> > +; X32-NEXT:    movq %xmm1, (%ecx)
> > +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    pand %xmm1, %xmm0
> > +; X32-NEXT:    movq %xmm0, (%ecx)
> > +; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > +; X32-NEXT:    por %xmm0, %xmm1
> > +; X32-NEXT:    movq %xmm1, (%ecx)
> > +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    pxor %xmm1, %xmm0
> > +; X32-NEXT:    movq %xmm0, (%ecx)
> >  ; X32-NEXT:    emms
> >  ; X32-NEXT:    retl
> >  ;
> >  ; X64-LABEL: test1:
> >  ; X64:       # %bb.0: # %entry
> >  ; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> >  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
> > -; X64-NEXT:    paddq %xmm0, %xmm1
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > -; X64-NEXT:    movq %xmm0, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; X64-NEXT:    pmuludq %xmm1, %xmm0
> > -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> > +; X64-NEXT:    paddd %xmm0, %xmm1
> >  ; X64-NEXT:    movq %xmm1, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
> > -; X64-NEXT:    pand %xmm0, %xmm1
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > -; X64-NEXT:    movq %xmm0, (%rdi)
> >  ; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; X64-NEXT:    por %xmm1, %xmm0
> > -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> > +; X64-NEXT:    pshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
> > +; X64-NEXT:    pmuludq %xmm0, %xmm1
> > +; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> > +; X64-NEXT:    pmuludq %xmm2, %xmm0
> > +; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; X64-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
> >  ; X64-NEXT:    movq %xmm1, (%rdi)
> > +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    pand %xmm1, %xmm0
> > +; X64-NEXT:    movq %xmm0, (%rdi)
> >  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
> > -; X64-NEXT:    pxor %xmm0, %xmm1
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> > +; X64-NEXT:    por %xmm0, %xmm1
> > +; X64-NEXT:    movq %xmm1, (%rdi)
> > +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    pxor %xmm1, %xmm0
> >  ; X64-NEXT:    movq %xmm0, (%rdi)
> >  ; X64-NEXT:    emms
> >  ; X64-NEXT:    retq
> > @@ -294,8 +258,8 @@ define void @test2(x86_mmx* %A, x86_mmx*
> >  ; X32-NEXT:    .cfi_offset %ebp, -8
> >  ; X32-NEXT:    movl %esp, %ebp
> >  ; X32-NEXT:    .cfi_def_cfa_register %ebp
> > -; X32-NEXT:    andl $-8, %esp
> > -; X32-NEXT:    subl $24, %esp
> > +; X32-NEXT:    andl $-16, %esp
> > +; X32-NEXT:    subl $64, %esp
> >  ; X32-NEXT:    movl 12(%ebp), %ecx
> >  ; X32-NEXT:    movl 8(%ebp), %eax
> >  ; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > @@ -307,7 +271,7 @@ define void @test2(x86_mmx* %A, x86_mmx*
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    paddusw (%ecx), %mm0
> >  ; X32-NEXT:    movq %mm0, {{[0-9]+}}(%esp)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    movdqa {{[0-9]+}}(%esp), %xmm0
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X32-NEXT:    psubw %xmm1, %xmm0
> > @@ -317,40 +281,25 @@ define void @test2(x86_mmx* %A, x86_mmx*
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    psubusw (%ecx), %mm0
> >  ; X32-NEXT:    movq %mm0, {{[0-9]+}}(%esp)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> >  ; X32-NEXT:    movq %mm0, (%eax)
> > -; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    pmullw %xmm0, %xmm1
> > -; X32-NEXT:    movdq2q %xmm1, %mm0
> > -; X32-NEXT:    movq %xmm1, (%eax)
> > +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    pmullw {{[0-9]+}}(%esp), %xmm0
> > +; X32-NEXT:    movdq2q %xmm0, %mm0
> > +; X32-NEXT:    movq %xmm0, (%eax)
> >  ; X32-NEXT:    pmulhw (%ecx), %mm0
> >  ; X32-NEXT:    movq %mm0, (%eax)
> >  ; X32-NEXT:    pmaddwd (%ecx), %mm0
> >  ; X32-NEXT:    movq %mm0, (%esp)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> >  ; X32-NEXT:    movq %mm0, (%eax)
> > -; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> > -; X32-NEXT:    pand %xmm0, %xmm1
> > -; X32-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> > -; X32-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> > -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; X32-NEXT:    movq %xmm0, (%eax)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > -; X32-NEXT:    por %xmm1, %xmm0
> > -; X32-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> > -; X32-NEXT:    pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,6,6,7]
> > -; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > -; X32-NEXT:    movq %xmm1, (%eax)
> > -; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X32-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> > -; X32-NEXT:    pxor %xmm0, %xmm1
> > -; X32-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> > -; X32-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> > -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; X32-NEXT:    movq %xmm0, (%eax)
> > +; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    andps (%esp), %xmm0
> > +; X32-NEXT:    movlps %xmm0, (%eax)
> > +; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> > +; X32-NEXT:    orps %xmm0, %xmm1
> > +; X32-NEXT:    movlps %xmm1, (%eax)
> > +; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > +; X32-NEXT:    xorps %xmm1, %xmm0
> > +; X32-NEXT:    movlps %xmm0, (%eax)
> >  ; X32-NEXT:    emms
> >  ; X32-NEXT:    movl %ebp, %esp
> >  ; X32-NEXT:    popl %ebp
> > @@ -368,7 +317,7 @@ define void @test2(x86_mmx* %A, x86_mmx*
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    paddusw (%rsi), %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> >  ; X64-NEXT:    psubw %xmm1, %xmm0
> > @@ -378,40 +327,25 @@ define void @test2(x86_mmx* %A, x86_mmx*
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    psubusw (%rsi), %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    pmullw %xmm0, %xmm1
> > -; X64-NEXT:    movdq2q %xmm1, %mm0
> > -; X64-NEXT:    movq %xmm1, (%rdi)
> > +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    pmullw -{{[0-9]+}}(%rsp), %xmm0
> > +; X64-NEXT:    movdq2q %xmm0, %mm0
> > +; X64-NEXT:    movq %xmm0, (%rdi)
> >  ; X64-NEXT:    pmulhw (%rsi), %mm0
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> >  ; X64-NEXT:    pmaddwd (%rsi), %mm0
> >  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> >  ; X64-NEXT:    movq %mm0, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> > -; X64-NEXT:    pand %xmm0, %xmm1
> > -; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> > -; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; X64-NEXT:    movq %xmm0, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X64-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> > -; X64-NEXT:    por %xmm1, %xmm0
> > -; X64-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> > -; X64-NEXT:    pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,6,6,7]
> > -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > -; X64-NEXT:    movq %xmm1, (%rdi)
> > -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> > -; X64-NEXT:    punpcklwd {{.*#+}} xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> > -; X64-NEXT:    pxor %xmm0, %xmm1
> > -; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> > -; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> > -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > -; X64-NEXT:    movq %xmm0, (%rdi)
> > +; X64-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    andps -{{[0-9]+}}(%rsp), %xmm0
> > +; X64-NEXT:    movlps %xmm0, (%rdi)
> > +; X64-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> > +; X64-NEXT:    orps %xmm0, %xmm1
> > +; X64-NEXT:    movlps %xmm1, (%rdi)
> > +; X64-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> > +; X64-NEXT:    xorps %xmm1, %xmm0
> > +; X64-NEXT:    movlps %xmm0, (%rdi)
> >  ; X64-NEXT:    emms
> >  ; X64-NEXT:    retq
> >  entry:
> > @@ -479,45 +413,34 @@ define <1 x i64> @test3(<1 x i64>* %a, <
> >  ; X32-LABEL: test3:
> >  ; X32:       # %bb.0: # %entry
> >  ; X32-NEXT:    pushl %ebp
> > -; X32-NEXT:    movl %esp, %ebp
> >  ; X32-NEXT:    pushl %ebx
> >  ; X32-NEXT:    pushl %edi
> >  ; X32-NEXT:    pushl %esi
> > -; X32-NEXT:    andl $-8, %esp
> > -; X32-NEXT:    subl $16, %esp
> > -; X32-NEXT:    cmpl $0, 16(%ebp)
> > +; X32-NEXT:    cmpl $0, {{[0-9]+}}(%esp)
> >  ; X32-NEXT:    je .LBB3_1
> >  ; X32-NEXT:  # %bb.2: # %bb26.preheader
> > +; X32-NEXT:    movl {{[0-9]+}}(%esp), %esi
> > +; X32-NEXT:    movl {{[0-9]+}}(%esp), %edi
> >  ; X32-NEXT:    xorl %ebx, %ebx
> >  ; X32-NEXT:    xorl %eax, %eax
> >  ; X32-NEXT:    xorl %edx, %edx
> >  ; X32-NEXT:    .p2align 4, 0x90
> >  ; X32-NEXT:  .LBB3_3: # %bb26
> >  ; X32-NEXT:    # =>This Inner Loop Header: Depth=1
> > -; X32-NEXT:    movl 8(%ebp), %ecx
> > -; X32-NEXT:    movl %ecx, %esi
> > -; X32-NEXT:    movl (%ecx,%ebx,8), %ecx
> > -; X32-NEXT:    movl 4(%esi,%ebx,8), %esi
> > -; X32-NEXT:    movl 12(%ebp), %edi
> > -; X32-NEXT:    addl (%edi,%ebx,8), %ecx
> > -; X32-NEXT:    adcl 4(%edi,%ebx,8), %esi
> > -; X32-NEXT:    addl %eax, %ecx
> > -; X32-NEXT:    movl %ecx, (%esp)
> > -; X32-NEXT:    adcl %edx, %esi
> > -; X32-NEXT:    movl %esi, {{[0-9]+}}(%esp)
> > -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> > -; X32-NEXT:    movd %xmm0, %eax
> > -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[1,1,0,1]
> > -; X32-NEXT:    movd %xmm0, %edx
> > +; X32-NEXT:    movl (%edi,%ebx,8), %ebp
> > +; X32-NEXT:    movl 4(%edi,%ebx,8), %ecx
> > +; X32-NEXT:    addl (%esi,%ebx,8), %ebp
> > +; X32-NEXT:    adcl 4(%esi,%ebx,8), %ecx
> > +; X32-NEXT:    addl %ebp, %eax
> > +; X32-NEXT:    adcl %ecx, %edx
> >  ; X32-NEXT:    incl %ebx
> > -; X32-NEXT:    cmpl 16(%ebp), %ebx
> > +; X32-NEXT:    cmpl {{[0-9]+}}(%esp), %ebx
> >  ; X32-NEXT:    jb .LBB3_3
> >  ; X32-NEXT:    jmp .LBB3_4
> >  ; X32-NEXT:  .LBB3_1:
> >  ; X32-NEXT:    xorl %eax, %eax
> >  ; X32-NEXT:    xorl %edx, %edx
> >  ; X32-NEXT:  .LBB3_4: # %bb31
> > -; X32-NEXT:    leal -12(%ebp), %esp
> >  ; X32-NEXT:    popl %esi
> >  ; X32-NEXT:    popl %edi
> >  ; X32-NEXT:    popl %ebx
> >
> > Modified: llvm/trunk/test/CodeGen/X86/mmx-cvt.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mmx-cvt.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/mmx-cvt.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/mmx-cvt.ll Wed Aug  7 09:24:26 2019
> > @@ -296,8 +296,8 @@ define <4 x float> @sitofp_v2i32_v2f32(<
> >  ; X86:       # %bb.0:
> >  ; X86-NEXT:    pushl %ebp
> >  ; X86-NEXT:    movl %esp, %ebp
> > -; X86-NEXT:    andl $-8, %esp
> > -; X86-NEXT:    subl $8, %esp
> > +; X86-NEXT:    andl $-16, %esp
> > +; X86-NEXT:    subl $32, %esp
> >  ; X86-NEXT:    movl 8(%ebp), %eax
> >  ; X86-NEXT:    movq (%eax), %mm0
> >  ; X86-NEXT:    paddd %mm0, %mm0
> >
> > Modified: llvm/trunk/test/CodeGen/X86/mulvi32.ll
> > URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mulvi32.ll?rev=368183&r1=368182&r2=368183&view=diff
> >
> ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/mulvi32.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/mulvi32.ll Wed Aug  7 09:24:26 2019
> > @@ -7,36 +7,39 @@
> >  ; PR6399
> >
> >  define <2 x i32> @_mul2xi32a(<2 x i32>, <2 x i32>) {
> > -; SSE-LABEL: _mul2xi32a:
> > -; SSE:       # %bb.0:
> > -; SSE-NEXT:    pmuludq %xmm1, %xmm0
> > -; SSE-NEXT:    retq
> > +; SSE2-LABEL: _mul2xi32a:
> > +; SSE2:       # %bb.0:
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]
> > +; SSE2-NEXT:    pmuludq %xmm1, %xmm0
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> > +; SSE2-NEXT:    pmuludq %xmm2, %xmm1
> > +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> > +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 =
> xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> > +; SSE2-NEXT:    retq
> > +;
> > +; SSE42-LABEL: _mul2xi32a:
> > +; SSE42:       # %bb.0:
> > +; SSE42-NEXT:    pmulld %xmm1, %xmm0
> > +; SSE42-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: _mul2xi32a:
> >  ; AVX:       # %bb.0:
> > -; AVX-NEXT:    vpmuludq %xmm1, %xmm0, %xmm0
> > +; AVX-NEXT:    vpmulld %xmm1, %xmm0, %xmm0
> >  ; AVX-NEXT:    retq
> >    %r = mul <2 x i32> %0, %1
> >    ret <2 x i32> %r
> >  }
> >
> >  define <2 x i32> @_mul2xi32b(<2 x i32>, <2 x i32>) {
> > -; SSE2-LABEL: _mul2xi32b:
> > -; SSE2:       # %bb.0:
> > -; SSE2-NEXT:    pmuludq %xmm1, %xmm0
> > -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> > -; SSE2-NEXT:    retq
> > -;
> > -; SSE42-LABEL: _mul2xi32b:
> > -; SSE42:       # %bb.0:
> > -; SSE42-NEXT:    pmuludq %xmm1, %xmm0
> > -; SSE42-NEXT:    pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> > -; SSE42-NEXT:    retq
> > +; SSE-LABEL: _mul2xi32b:
> > +; SSE:       # %bb.0:
> > +; SSE-NEXT:    pmuludq %xmm1, %xmm0
> > +; SSE-NEXT:    retq
> >  ;
> >  ; AVX-LABEL: _mul2xi32b:
> >  ; AVX:       # %bb.0:
> >  ; AVX-NEXT:    vpmuludq %xmm1, %xmm0, %xmm0
> > -; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> >  ; AVX-NEXT:    retq
> >    %factor0 = shufflevector <2 x i32> %0, <2 x i32> undef, <4 x i32>
> <i32 0, i32 undef, i32 2, i32 undef>
> >    %factor1 = shufflevector <2 x i32> %1, <2 x i32> undef, <4 x i32>
> <i32 0, i32 undef, i32 2, i32 undef>
> > @@ -153,8 +156,8 @@ define <4 x i64> @_mul4xi32toi64a(<4 x i
> >  ;
> >  ; AVX1-LABEL: _mul4xi32toi64a:
> >  ; AVX1:       # %bb.0:
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
> > -; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,2,3,3]
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm1[2,1,3,3]
> > +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,1,3,3]
> >  ; AVX1-NEXT:    vpmuludq %xmm2, %xmm3, %xmm2
> >  ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
> >  ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190819/816da9c5/attachment-0001.html>


More information about the llvm-commits mailing list