[llvm] r368183 - Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."

Eric Christopher via llvm-commits llvm-commits at lists.llvm.org
Mon Aug 19 21:55:27 PDT 2019


HI Craig,

We're seeing a rather lot of performance regressions with this enabled
by default. Is it possible to get it turned on under a command flag
for the near term while we work on getting you a pile of testcases
(some of it is Eigen and those will at least be easier as you have
access to that source :)

Thoughts?

Thanks!

-eric

On Wed, Aug 7, 2019 at 9:23 AM Craig Topper via llvm-commits
<llvm-commits at lists.llvm.org> wrote:
>
> Author: ctopper
> Date: Wed Aug  7 09:24:26 2019
> New Revision: 368183
>
> URL: http://llvm.org/viewvc/llvm-project?rev=368183&view=rev
> Log:
> Recommit r367901 "[X86] Enable -x86-experimental-vector-widening-legalization by default."
>
> The assert that caused this to be reverted should be fixed now.
>
> Original commit message:
>
> This patch changes our defualt legalization behavior for 16, 32, and
> 64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
> widening. For example, v8i8 will now be widened to v16i8 instead of
> promoted to v8i16. This keeps the elements widths the same and pads
> with undef elements. We believe this is a better legalization strategy.
> But it carries some issues due to the fragmented vector ISA. For
> example, i8 shifts and multiplies get widened and then later have
> to be promoted/split into vXi16 vectors.
>
> This has the potential to cause regressions so we wanted to get
> it in early in the 10.0 cycle so we have plenty of time to
> address them.
>
> Next steps will be to merge tests that explicitly test the command
> line option. And then we can remove the option and its associated
> code.
>
> Removed:
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll
> Modified:
>     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>     llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
>     llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll
>     llvm/trunk/test/Analysis/CostModel/X86/arith.ll
>     llvm/trunk/test/Analysis/CostModel/X86/cast.ll
>     llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll
>     llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll
>     llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll
>     llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll
>     llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll
>     llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll
>     llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll
>     llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll
>     llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll
>     llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll
>     llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll
>     llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll
>     llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll
>     llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll
>     llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll
>     llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll
>     llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll
>     llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll
>     llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll
>     llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll
>     llvm/trunk/test/CodeGen/X86/4char-promote.ll
>     llvm/trunk/test/CodeGen/X86/and-load-fold.ll
>     llvm/trunk/test/CodeGen/X86/atomic-unordered.ll
>     llvm/trunk/test/CodeGen/X86/avg.ll
>     llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll
>     llvm/trunk/test/CodeGen/X86/avx-fp2int.ll
>     llvm/trunk/test/CodeGen/X86/avx2-conversions.ll
>     llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll
>     llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll
>     llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll
>     llvm/trunk/test/CodeGen/X86/avx512-cvt.ll
>     llvm/trunk/test/CodeGen/X86/avx512-ext.ll
>     llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll
>     llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll
>     llvm/trunk/test/CodeGen/X86/avx512-trunc.ll
>     llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll
>     llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll
>     llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
>     llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll
>     llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll
>     llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll
>     llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll
>     llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll
>     llvm/trunk/test/CodeGen/X86/bitreverse.ll
>     llvm/trunk/test/CodeGen/X86/bswap-vector.ll
>     llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll
>     llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll
>     llvm/trunk/test/CodeGen/X86/combine-or.ll
>     llvm/trunk/test/CodeGen/X86/complex-fastmath.ll
>     llvm/trunk/test/CodeGen/X86/cvtv2f32.ll
>     llvm/trunk/test/CodeGen/X86/extract-concat.ll
>     llvm/trunk/test/CodeGen/X86/extract-insert.ll
>     llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll
>     llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll
>     llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll
>     llvm/trunk/test/CodeGen/X86/known-bits.ll
>     llvm/trunk/test/CodeGen/X86/load-partial.ll
>     llvm/trunk/test/CodeGen/X86/lower-bitcast.ll
>     llvm/trunk/test/CodeGen/X86/madd.ll
>     llvm/trunk/test/CodeGen/X86/masked_compressstore.ll
>     llvm/trunk/test/CodeGen/X86/masked_expandload.ll
>     llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll
>     llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll
>     llvm/trunk/test/CodeGen/X86/masked_load.ll
>     llvm/trunk/test/CodeGen/X86/masked_store.ll
>     llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll
>     llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll
>     llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll
>     llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll
>     llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll
>     llvm/trunk/test/CodeGen/X86/mmx-arith.ll
>     llvm/trunk/test/CodeGen/X86/mmx-cvt.ll
>     llvm/trunk/test/CodeGen/X86/mulvi32.ll
>     llvm/trunk/test/CodeGen/X86/oddshuffles.ll
>     llvm/trunk/test/CodeGen/X86/oddsubvector.ll
>     llvm/trunk/test/CodeGen/X86/pmaddubsw.ll
>     llvm/trunk/test/CodeGen/X86/pmulh.ll
>     llvm/trunk/test/CodeGen/X86/pointer-vector.ll
>     llvm/trunk/test/CodeGen/X86/pr14161.ll
>     llvm/trunk/test/CodeGen/X86/pr35918.ll
>     llvm/trunk/test/CodeGen/X86/pr40994.ll
>     llvm/trunk/test/CodeGen/X86/promote-vec3.ll
>     llvm/trunk/test/CodeGen/X86/promote.ll
>     llvm/trunk/test/CodeGen/X86/psubus.ll
>     llvm/trunk/test/CodeGen/X86/ret-mmx.ll
>     llvm/trunk/test/CodeGen/X86/sad.ll
>     llvm/trunk/test/CodeGen/X86/sadd_sat_vec.ll
>     llvm/trunk/test/CodeGen/X86/scalar_widen_div.ll
>     llvm/trunk/test/CodeGen/X86/select.ll
>     llvm/trunk/test/CodeGen/X86/shift-combine.ll
>     llvm/trunk/test/CodeGen/X86/shrink_vmul.ll
>     llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-128.ll
>     llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-256.ll
>     llvm/trunk/test/CodeGen/X86/shuffle-strided-with-offset-512.ll
>     llvm/trunk/test/CodeGen/X86/shuffle-vs-trunc-128.ll
>     llvm/trunk/test/CodeGen/X86/shuffle-vs-trunc-256.ll
>     llvm/trunk/test/CodeGen/X86/shuffle-vs-trunc-512.ll
>     llvm/trunk/test/CodeGen/X86/slow-pmulld.ll
>     llvm/trunk/test/CodeGen/X86/sse2-intrinsics-canonical.ll
>     llvm/trunk/test/CodeGen/X86/sse2-vector-shifts.ll
>     llvm/trunk/test/CodeGen/X86/ssub_sat_vec.ll
>     llvm/trunk/test/CodeGen/X86/test-shrink-bug.ll
>     llvm/trunk/test/CodeGen/X86/trunc-ext-ld-st.ll
>     llvm/trunk/test/CodeGen/X86/trunc-subvector.ll
>     llvm/trunk/test/CodeGen/X86/uadd_sat_vec.ll
>     llvm/trunk/test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
>     llvm/trunk/test/CodeGen/X86/usub_sat_vec.ll
>     llvm/trunk/test/CodeGen/X86/vec_cast2.ll
>     llvm/trunk/test/CodeGen/X86/vec_cast3.ll
>     llvm/trunk/test/CodeGen/X86/vec_ctbits.ll
>     llvm/trunk/test/CodeGen/X86/vec_extract-mmx.ll
>     llvm/trunk/test/CodeGen/X86/vec_fp_to_int.ll
>     llvm/trunk/test/CodeGen/X86/vec_insert-5.ll
>     llvm/trunk/test/CodeGen/X86/vec_insert-7.ll
>     llvm/trunk/test/CodeGen/X86/vec_insert-mmx.ll
>     llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
>     llvm/trunk/test/CodeGen/X86/vec_saddo.ll
>     llvm/trunk/test/CodeGen/X86/vec_smulo.ll
>     llvm/trunk/test/CodeGen/X86/vec_ssubo.ll
>     llvm/trunk/test/CodeGen/X86/vec_uaddo.ll
>     llvm/trunk/test/CodeGen/X86/vec_umulo.ll
>     llvm/trunk/test/CodeGen/X86/vec_usubo.ll
>     llvm/trunk/test/CodeGen/X86/vector-blend.ll
>     llvm/trunk/test/CodeGen/X86/vector-ext-logic.ll
>     llvm/trunk/test/CodeGen/X86/vector-gep.ll
>     llvm/trunk/test/CodeGen/X86/vector-half-conversions.ll
>     llvm/trunk/test/CodeGen/X86/vector-idiv-v2i32.ll
>     llvm/trunk/test/CodeGen/X86/vector-narrow-binop.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-add.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-and-bool.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-and.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-or-bool.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-or.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-smax.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-smin.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-umax.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-umin.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-xor-bool.ll
>     llvm/trunk/test/CodeGen/X86/vector-reduce-xor.ll
>     llvm/trunk/test/CodeGen/X86/vector-sext.ll
>     llvm/trunk/test/CodeGen/X86/vector-shift-ashr-sub128.ll
>     llvm/trunk/test/CodeGen/X86/vector-shift-by-select-loop.ll
>     llvm/trunk/test/CodeGen/X86/vector-shift-lshr-sub128.ll
>     llvm/trunk/test/CodeGen/X86/vector-shift-shl-sub128.ll
>     llvm/trunk/test/CodeGen/X86/vector-shuffle-128-v16.ll
>     llvm/trunk/test/CodeGen/X86/vector-shuffle-combining.ll
>     llvm/trunk/test/CodeGen/X86/vector-trunc-packus.ll
>     llvm/trunk/test/CodeGen/X86/vector-trunc-ssat.ll
>     llvm/trunk/test/CodeGen/X86/vector-trunc-usat.ll
>     llvm/trunk/test/CodeGen/X86/vector-trunc.ll
>     llvm/trunk/test/CodeGen/X86/vector-truncate-combine.ll
>     llvm/trunk/test/CodeGen/X86/vector-zext.ll
>     llvm/trunk/test/CodeGen/X86/vsel-cmp-load.ll
>     llvm/trunk/test/CodeGen/X86/vselect-avx.ll
>     llvm/trunk/test/CodeGen/X86/vselect.ll
>     llvm/trunk/test/CodeGen/X86/vshift-4.ll
>     llvm/trunk/test/CodeGen/X86/widen_arith-1.ll
>     llvm/trunk/test/CodeGen/X86/widen_arith-2.ll
>     llvm/trunk/test/CodeGen/X86/widen_arith-3.ll
>     llvm/trunk/test/CodeGen/X86/widen_bitops-0.ll
>     llvm/trunk/test/CodeGen/X86/widen_cast-1.ll
>     llvm/trunk/test/CodeGen/X86/widen_cast-2.ll
>     llvm/trunk/test/CodeGen/X86/widen_cast-3.ll
>     llvm/trunk/test/CodeGen/X86/widen_cast-4.ll
>     llvm/trunk/test/CodeGen/X86/widen_cast-5.ll
>     llvm/trunk/test/CodeGen/X86/widen_cast-6.ll
>     llvm/trunk/test/CodeGen/X86/widen_compare-1.ll
>     llvm/trunk/test/CodeGen/X86/widen_conv-1.ll
>     llvm/trunk/test/CodeGen/X86/widen_conv-2.ll
>     llvm/trunk/test/CodeGen/X86/widen_conv-3.ll
>     llvm/trunk/test/CodeGen/X86/widen_conv-4.ll
>     llvm/trunk/test/CodeGen/X86/widen_load-2.ll
>     llvm/trunk/test/CodeGen/X86/widen_shuffle-1.ll
>     llvm/trunk/test/CodeGen/X86/x86-interleaved-access.ll
>     llvm/trunk/test/CodeGen/X86/x86-shifts.ll
>     llvm/trunk/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll
>     llvm/trunk/test/Transforms/SLPVectorizer/X86/fptosi.ll
>     llvm/trunk/test/Transforms/SLPVectorizer/X86/fptoui.ll
>     llvm/trunk/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
>     llvm/trunk/test/Transforms/SLPVectorizer/X86/sitofp.ll
>     llvm/trunk/test/Transforms/SLPVectorizer/X86/uitofp.ll
>
> Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Wed Aug  7 09:24:26 2019
> @@ -66,7 +66,7 @@ using namespace llvm;
>  STATISTIC(NumTailCalls, "Number of tail calls");
>
>  static cl::opt<bool> ExperimentalVectorWideningLegalization(
> -    "x86-experimental-vector-widening-legalization", cl::init(false),
> +    "x86-experimental-vector-widening-legalization", cl::init(true),
>      cl::desc("Enable an experimental vector type legalization through widening "
>               "rather than promotion."),
>      cl::Hidden);
> @@ -40453,8 +40453,7 @@ static SDValue combineStore(SDNode *N, S
>    bool NoImplicitFloatOps = F.hasFnAttribute(Attribute::NoImplicitFloat);
>    bool F64IsLegal =
>        !Subtarget.useSoftFloat() && !NoImplicitFloatOps && Subtarget.hasSSE2();
> -  if (((VT.isVector() && !VT.isFloatingPoint()) ||
> -       (VT == MVT::i64 && F64IsLegal && !Subtarget.is64Bit())) &&
> +  if ((VT == MVT::i64 && F64IsLegal && !Subtarget.is64Bit()) &&
>        isa<LoadSDNode>(St->getValue()) &&
>        !cast<LoadSDNode>(St->getValue())->isVolatile() &&
>        St->getChain().hasOneUse() && !St->isVolatile()) {
>
> Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Wed Aug  7 09:24:26 2019
> @@ -887,7 +887,7 @@ int X86TTIImpl::getArithmeticInstrCost(
>  int X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
>                                 Type *SubTp) {
>    // 64-bit packed float vectors (v2f32) are widened to type v4f32.
> -  // 64-bit packed integer vectors (v2i32) are promoted to type v2i64.
> +  // 64-bit packed integer vectors (v2i32) are widened to type v4i32.
>    std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
>
>    // Treat Transpose as 2-op shuffles - there's no difference in lowering.
> @@ -2425,14 +2425,6 @@ int X86TTIImpl::getAddressComputationCos
>
>  int X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, Type *ValTy,
>                                             bool IsPairwise) {
> -
> -  std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
> -
> -  MVT MTy = LT.second;
> -
> -  int ISD = TLI->InstructionOpcodeToISD(Opcode);
> -  assert(ISD && "Invalid opcode");
> -
>    // We use the Intel Architecture Code Analyzer(IACA) to measure the throughput
>    // and make it as the cost.
>
> @@ -2440,7 +2432,10 @@ int X86TTIImpl::getArithmeticReductionCo
>      { ISD::FADD,  MVT::v2f64,   2 },
>      { ISD::FADD,  MVT::v4f32,   4 },
>      { ISD::ADD,   MVT::v2i64,   2 },      // The data reported by the IACA tool is "1.6".
> +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than v4i32.
>      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the IACA tool is "3.5".
> +    { ISD::ADD,   MVT::v2i16,   3 }, // FIXME: chosen to be less than v4i16
> +    { ISD::ADD,   MVT::v4i16,   4 }, // FIXME: chosen to be less than v8i16
>      { ISD::ADD,   MVT::v8i16,   5 },
>    };
>
> @@ -2449,8 +2444,11 @@ int X86TTIImpl::getArithmeticReductionCo
>      { ISD::FADD,  MVT::v4f64,   5 },
>      { ISD::FADD,  MVT::v8f32,   7 },
>      { ISD::ADD,   MVT::v2i64,   1 },      // The data reported by the IACA tool is "1.5".
> +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than v4i32
>      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the IACA tool is "3.5".
>      { ISD::ADD,   MVT::v4i64,   5 },      // The data reported by the IACA tool is "4.8".
> +    { ISD::ADD,   MVT::v2i16,   3 }, // FIXME: chosen to be less than v4i16
> +    { ISD::ADD,   MVT::v4i16,   4 }, // FIXME: chosen to be less than v8i16
>      { ISD::ADD,   MVT::v8i16,   5 },
>      { ISD::ADD,   MVT::v8i32,   5 },
>    };
> @@ -2459,7 +2457,10 @@ int X86TTIImpl::getArithmeticReductionCo
>      { ISD::FADD,  MVT::v2f64,   2 },
>      { ISD::FADD,  MVT::v4f32,   4 },
>      { ISD::ADD,   MVT::v2i64,   2 },      // The data reported by the IACA tool is "1.6".
> +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than v4i32
>      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the IACA tool is "3.3".
> +    { ISD::ADD,   MVT::v2i16,   2 },      // The data reported by the IACA tool is "4.3".
> +    { ISD::ADD,   MVT::v4i16,   3 },      // The data reported by the IACA tool is "4.3".
>      { ISD::ADD,   MVT::v8i16,   4 },      // The data reported by the IACA tool is "4.3".
>    };
>
> @@ -2468,12 +2469,47 @@ int X86TTIImpl::getArithmeticReductionCo
>      { ISD::FADD,  MVT::v4f64,   3 },
>      { ISD::FADD,  MVT::v8f32,   4 },
>      { ISD::ADD,   MVT::v2i64,   1 },      // The data reported by the IACA tool is "1.5".
> +    { ISD::ADD,   MVT::v2i32,   2 }, // FIXME: chosen to be less than v4i32
>      { ISD::ADD,   MVT::v4i32,   3 },      // The data reported by the IACA tool is "2.8".
>      { ISD::ADD,   MVT::v4i64,   3 },
> +    { ISD::ADD,   MVT::v2i16,   2 },      // The data reported by the IACA tool is "4.3".
> +    { ISD::ADD,   MVT::v4i16,   3 },      // The data reported by the IACA tool is "4.3".
>      { ISD::ADD,   MVT::v8i16,   4 },
>      { ISD::ADD,   MVT::v8i32,   5 },
>    };
>
> +  int ISD = TLI->InstructionOpcodeToISD(Opcode);
> +  assert(ISD && "Invalid opcode");
> +
> +  // Before legalizing the type, give a chance to look up illegal narrow types
> +  // in the table.
> +  // FIXME: Is there a better way to do this?
> +  EVT VT = TLI->getValueType(DL, ValTy);
> +  if (VT.isSimple()) {
> +    MVT MTy = VT.getSimpleVT();
> +    if (IsPairwise) {
> +      if (ST->hasAVX())
> +        if (const auto *Entry = CostTableLookup(AVX1CostTblPairWise, ISD, MTy))
> +          return Entry->Cost;
> +
> +      if (ST->hasSSE42())
> +        if (const auto *Entry = CostTableLookup(SSE42CostTblPairWise, ISD, MTy))
> +          return Entry->Cost;
> +    } else {
> +      if (ST->hasAVX())
> +        if (const auto *Entry = CostTableLookup(AVX1CostTblNoPairWise, ISD, MTy))
> +          return Entry->Cost;
> +
> +      if (ST->hasSSE42())
> +        if (const auto *Entry = CostTableLookup(SSE42CostTblNoPairWise, ISD, MTy))
> +          return Entry->Cost;
> +    }
> +  }
> +
> +  std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
> +
> +  MVT MTy = LT.second;
> +
>    if (IsPairwise) {
>      if (ST->hasAVX())
>        if (const auto *Entry = CostTableLookup(AVX1CostTblPairWise, ISD, MTy))
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/alternate-shuffle-cost.ll Wed Aug  7 09:24:26 2019
> @@ -18,9 +18,21 @@
>  ; 64-bit packed float vectors (v2f32) are widened to type v4f32.
>
>  define <2 x i32> @test_v2i32(<2 x i32> %a, <2 x i32> %b) {
> -; CHECK-LABEL: 'test_v2i32'
> -; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> -; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +; SSE2-LABEL: 'test_v2i32'
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +;
> +; SSSE3-LABEL: 'test_v2i32'
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +;
> +; SSE42-LABEL: 'test_v2i32'
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +;
> +; AVX-LABEL: 'test_v2i32'
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
>  ;
>  ; BTVER2-LABEL: 'test_v2i32'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 0, i32 3>
> @@ -56,9 +68,21 @@ define <2 x float> @test_v2f32(<2 x floa
>  }
>
>  define <2 x i32> @test_v2i32_2(<2 x i32> %a, <2 x i32> %b) {
> -; CHECK-LABEL: 'test_v2i32_2'
> -; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> -; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +; SSE2-LABEL: 'test_v2i32_2'
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +;
> +; SSSE3-LABEL: 'test_v2i32_2'
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +;
> +; SSE42-LABEL: 'test_v2i32_2'
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
> +;
> +; AVX-LABEL: 'test_v2i32_2'
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %1
>  ;
>  ; BTVER2-LABEL: 'test_v2i32_2'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = shufflevector <2 x i32> %a, <2 x i32> %b, <2 x i32> <i32 2, i32 1>
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/arith.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/arith.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/arith.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/arith.ll Wed Aug  7 09:24:26 2019
> @@ -1342,36 +1342,32 @@ define i32 @mul(i32 %arg) {
>  ; A <2 x i64> vector multiply is implemented using
>  ; 3 PMULUDQ and 2 PADDS and 4 shifts.
>  define void @mul_2i32() {
> -; SSE-LABEL: 'mul_2i32'
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A0 = mul <2 x i32> undef, undef
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> +; SSSE3-LABEL: 'mul_2i32'
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %A0 = mul <2 x i32> undef, undef
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> +;
> +; SSE42-LABEL: 'mul_2i32'
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %A0 = mul <2 x i32> undef, undef
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; AVX-LABEL: 'mul_2i32'
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A0 = mul <2 x i32> undef, undef
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %A0 = mul <2 x i32> undef, undef
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
> -; AVX512F-LABEL: 'mul_2i32'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A0 = mul <2 x i32> undef, undef
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> -;
> -; AVX512BW-LABEL: 'mul_2i32'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A0 = mul <2 x i32> undef, undef
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> -;
> -; AVX512DQ-LABEL: 'mul_2i32'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A0 = mul <2 x i32> undef, undef
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> +; AVX512-LABEL: 'mul_2i32'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %A0 = mul <2 x i32> undef, undef
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; SLM-LABEL: 'mul_2i32'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %A0 = mul <2 x i32> undef, undef
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %A0 = mul <2 x i32> undef, undef
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; GLM-LABEL: 'mul_2i32'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A0 = mul <2 x i32> undef, undef
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %A0 = mul <2 x i32> undef, undef
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; BTVER2-LABEL: 'mul_2i32'
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A0 = mul <2 x i32> undef, undef
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %A0 = mul <2 x i32> undef, undef
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>    %A0 = mul <2 x i32> undef, undef
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/cast.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/cast.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/cast.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/cast.ll Wed Aug  7 09:24:26 2019
> @@ -315,10 +315,10 @@ define void @sitofp4(<4 x i1> %a, <4 x i
>  ; SSE-LABEL: 'sitofp4'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %A1 = sitofp <4 x i1> %a to <4 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %A2 = sitofp <4 x i1> %a to <4 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %B1 = sitofp <4 x i8> %b to <4 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %B2 = sitofp <4 x i8> %b to <4 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %C1 = sitofp <4 x i16> %c to <4 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %C2 = sitofp <4 x i16> %c to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %B1 = sitofp <4 x i8> %b to <4 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %B2 = sitofp <4 x i8> %b to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %C1 = sitofp <4 x i16> %c to <4 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %C2 = sitofp <4 x i16> %c to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %D1 = sitofp <4 x i32> %d to <4 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %D2 = sitofp <4 x i32> %d to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> @@ -359,7 +359,7 @@ define void @sitofp4(<4 x i1> %a, <4 x i
>  define void @sitofp8(<8 x i1> %a, <8 x i8> %b, <8 x i16> %c, <8 x i32> %d) {
>  ; SSE-LABEL: 'sitofp8'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %A1 = sitofp <8 x i1> %a to <8 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %B1 = sitofp <8 x i8> %b to <8 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %B1 = sitofp <8 x i8> %b to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %C1 = sitofp <8 x i16> %c to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %D1 = sitofp <8 x i32> %d to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> @@ -390,9 +390,9 @@ define void @uitofp4(<4 x i1> %a, <4 x i
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %A1 = uitofp <4 x i1> %a to <4 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %A2 = uitofp <4 x i1> %a to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %B1 = uitofp <4 x i8> %b to <4 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %B2 = uitofp <4 x i8> %b to <4 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %C1 = uitofp <4 x i16> %c to <4 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %C2 = uitofp <4 x i16> %c to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %B2 = uitofp <4 x i8> %b to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %C1 = uitofp <4 x i16> %c to <4 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %C2 = uitofp <4 x i16> %c to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %D1 = uitofp <4 x i32> %d to <4 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %D2 = uitofp <4 x i32> %d to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> @@ -433,7 +433,7 @@ define void @uitofp4(<4 x i1> %a, <4 x i
>  define void @uitofp8(<8 x i1> %a, <8 x i8> %b, <8 x i16> %c, <8 x i32> %d) {
>  ; SSE-LABEL: 'uitofp8'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %A1 = uitofp <8 x i1> %a to <8 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %B1 = uitofp <8 x i8> %b to <8 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %B1 = uitofp <8 x i8> %b to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %C1 = uitofp <8 x i16> %c to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %D1 = uitofp <8 x i32> %d to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/fptosi.ll Wed Aug  7 09:24:26 2019
> @@ -92,35 +92,28 @@ define i32 @fptosi_double_i32(i32 %arg)
>  define i32 @fptosi_double_i16(i32 %arg) {
>  ; SSE-LABEL: 'fptosi_double_i16'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptosi double undef to i16
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX-LABEL: 'fptosi_double_i16'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptosi double undef to i16
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> -; AVX512F-LABEL: 'fptosi_double_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptosi double undef to i16
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'fptosi_double_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptosi double undef to i16
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> +; AVX512-LABEL: 'fptosi_double_i16'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptosi double undef to i16
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; BTVER2-LABEL: 'fptosi_double_i16'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptosi double undef to i16
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptosi <2 x double> undef to <2 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptosi <4 x double> undef to <4 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I16 = fptosi <8 x double> undef to <8 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> @@ -143,29 +136,22 @@ define i32 @fptosi_double_i8(i32 %arg) {
>  ; AVX-LABEL: 'fptosi_double_i8'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptosi double undef to i8
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> -; AVX512F-LABEL: 'fptosi_double_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptosi double undef to i8
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'fptosi_double_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptosi double undef to i8
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> +; AVX512-LABEL: 'fptosi_double_i8'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptosi double undef to i8
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; BTVER2-LABEL: 'fptosi_double_i8'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptosi double undef to i8
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I8 = fptosi <2 x double> undef to <2 x i8>
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I8 = fptosi <4 x double> undef to <4 x i8>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8I8 = fptosi <8 x double> undef to <8 x i8>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>    %I8 = fptosi double undef to i8
> @@ -285,9 +271,9 @@ define i32 @fptosi_float_i16(i32 %arg) {
>  define i32 @fptosi_float_i8(i32 %arg) {
>  ; SSE-LABEL: 'fptosi_float_i8'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptosi float undef to i8
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptosi <4 x float> undef to <4 x i8>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I8 = fptosi <8 x float> undef to <8 x i8>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V16I8 = fptosi <16 x float> undef to <16 x i8>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I8 = fptosi <4 x float> undef to <4 x i8>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8I8 = fptosi <8 x float> undef to <8 x i8>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 51 for instruction: %V16I8 = fptosi <16 x float> undef to <16 x i8>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX-LABEL: 'fptosi_float_i8'
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/fptoui.ll Wed Aug  7 09:24:26 2019
> @@ -68,19 +68,12 @@ define i32 @fptoui_double_i32(i32 %arg)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> -; AVX512F-LABEL: 'fptoui_double_i32'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = fptoui double undef to i32
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I32 = fptoui <2 x double> undef to <2 x i32>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = fptoui <4 x double> undef to <4 x i32>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'fptoui_double_i32'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = fptoui double undef to i32
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I32 = fptoui <2 x double> undef to <2 x i32>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = fptoui <4 x double> undef to <4 x i32>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> +; AVX512-LABEL: 'fptoui_double_i32'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = fptoui double undef to i32
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I32 = fptoui <2 x double> undef to <2 x i32>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = fptoui <4 x double> undef to <4 x i32>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I32 = fptoui <8 x double> undef to <8 x i32>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; BTVER2-LABEL: 'fptoui_double_i32'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I32 = fptoui double undef to i32
> @@ -106,30 +99,23 @@ define i32 @fptoui_double_i16(i32 %arg)
>  ;
>  ; AVX-LABEL: 'fptoui_double_i16'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui double undef to i16
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> -; AVX512F-LABEL: 'fptoui_double_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui double undef to i16
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'fptoui_double_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui double undef to i16
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> +; AVX512-LABEL: 'fptoui_double_i16'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui double undef to i16
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; BTVER2-LABEL: 'fptoui_double_i16'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui double undef to i16
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I16 = fptoui <2 x double> undef to <2 x i16>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x double> undef to <4 x i16>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I16 = fptoui <8 x double> undef to <8 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>    %I16 = fptoui double undef to i16
> @@ -154,19 +140,12 @@ define i32 @fptoui_double_i8(i32 %arg) {
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> -; AVX512F-LABEL: 'fptoui_double_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptoui double undef to i8
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I8 = fptoui <2 x double> undef to <2 x i8>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptoui <4 x double> undef to <4 x i8>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'fptoui_double_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptoui double undef to i8
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I8 = fptoui <2 x double> undef to <2 x i8>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptoui <4 x double> undef to <4 x i8>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> +; AVX512-LABEL: 'fptoui_double_i8'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptoui double undef to i8
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2I8 = fptoui <2 x double> undef to <2 x i8>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I8 = fptoui <4 x double> undef to <4 x i8>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I8 = fptoui <8 x double> undef to <8 x i8>
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; BTVER2-LABEL: 'fptoui_double_i8'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptoui double undef to i8
> @@ -277,7 +256,7 @@ define i32 @fptoui_float_i16(i32 %arg) {
>  ;
>  ; AVX-LABEL: 'fptoui_float_i16'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui float undef to i16
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I16 = fptoui <4 x float> undef to <4 x i16>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x float> undef to <4 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I16 = fptoui <8 x float> undef to <8 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16I16 = fptoui <16 x float> undef to <16 x i16>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> @@ -291,7 +270,7 @@ define i32 @fptoui_float_i16(i32 %arg) {
>  ;
>  ; BTVER2-LABEL: 'fptoui_float_i16'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I16 = fptoui float undef to i16
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I16 = fptoui <4 x float> undef to <4 x i16>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I16 = fptoui <4 x float> undef to <4 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I16 = fptoui <8 x float> undef to <8 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16I16 = fptoui <16 x float> undef to <16 x i16>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> @@ -314,8 +293,8 @@ define i32 @fptoui_float_i8(i32 %arg) {
>  ; AVX-LABEL: 'fptoui_float_i8'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptoui float undef to i8
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I8 = fptoui <4 x float> undef to <4 x i8>
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I8 = fptoui <8 x float> undef to <8 x i8>
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16I8 = fptoui <16 x float> undef to <16 x i8>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V8I8 = fptoui <8 x float> undef to <8 x i8>
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 49 for instruction: %V16I8 = fptoui <16 x float> undef to <16 x i8>
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512-LABEL: 'fptoui_float_i8'
> @@ -328,8 +307,8 @@ define i32 @fptoui_float_i8(i32 %arg) {
>  ; BTVER2-LABEL: 'fptoui_float_i8'
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %I8 = fptoui float undef to i8
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4I8 = fptoui <4 x float> undef to <4 x i8>
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I8 = fptoui <8 x float> undef to <8 x i8>
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16I8 = fptoui <16 x float> undef to <16 x i8>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V8I8 = fptoui <8 x float> undef to <8 x i8>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 49 for instruction: %V16I8 = fptoui <16 x float> undef to <16 x i8>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>    %I8 = fptoui float undef to i8
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll Wed Aug  7 09:24:26 2019
> @@ -52,7 +52,7 @@ define i32 @masked_load() {
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16I32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* undef, i32 1, <16 x i1> undef, <16 x i32> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8I32 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* undef, i32 1, <8 x i1> undef, <8 x i32> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4I32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef, i32 1, <4 x i1> undef, <4 x i32> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 1, <2 x i1> undef, <2 x i32> undef)
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 1, <2 x i1> undef, <2 x i32> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 128 for instruction: %V32I16 = call <32 x i16> @llvm.masked.load.v32i16.p0v32i16(<32 x i16>* undef, i32 1, <32 x i1> undef, <32 x i16> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 64 for instruction: %V16I16 = call <16 x i16> @llvm.masked.load.v16i16.p0v16i16(<16 x i16>* undef, i32 1, <16 x i1> undef, <16 x i16> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V8I16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef, i32 1, <8 x i1> undef, <8 x i16> undef)
> @@ -79,7 +79,7 @@ define i32 @masked_load() {
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16I32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* undef, i32 1, <16 x i1> undef, <16 x i32> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I32 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* undef, i32 1, <8 x i1> undef, <8 x i32> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef, i32 1, <4 x i1> undef, <4 x i32> undef)
> -; KNL-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 1, <2 x i1> undef, <2 x i32> undef)
> +; KNL-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 1, <2 x i1> undef, <2 x i32> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 128 for instruction: %V32I16 = call <32 x i16> @llvm.masked.load.v32i16.p0v32i16(<32 x i16>* undef, i32 1, <32 x i1> undef, <32 x i16> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 64 for instruction: %V16I16 = call <16 x i16> @llvm.masked.load.v16i16.p0v16i16(<16 x i16>* undef, i32 1, <16 x i1> undef, <16 x i16> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V8I16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef, i32 1, <8 x i1> undef, <8 x i16> undef)
> @@ -106,15 +106,15 @@ define i32 @masked_load() {
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16I32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* undef, i32 1, <16 x i1> undef, <16 x i32> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I32 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32>* undef, i32 1, <8 x i1> undef, <8 x i32> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef, i32 1, <4 x i1> undef, <4 x i32> undef)
> -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 1, <2 x i1> undef, <2 x i32> undef)
> +; SKX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2I32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 1, <2 x i1> undef, <2 x i32> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32I16 = call <32 x i16> @llvm.masked.load.v32i16.p0v32i16(<32 x i16>* undef, i32 1, <32 x i1> undef, <32 x i16> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16I16 = call <16 x i16> @llvm.masked.load.v16i16.p0v16i16(<16 x i16>* undef, i32 1, <16 x i1> undef, <16 x i16> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8I16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef, i32 1, <8 x i1> undef, <8 x i16> undef)
> -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4I16 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* undef, i32 1, <4 x i1> undef, <4 x i16> undef)
> +; SKX-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V4I16 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* undef, i32 1, <4 x i1> undef, <4 x i16> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64I8 = call <64 x i8> @llvm.masked.load.v64i8.p0v64i8(<64 x i8>* undef, i32 1, <64 x i1> undef, <64 x i8> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32I8 = call <32 x i8> @llvm.masked.load.v32i8.p0v32i8(<32 x i8>* undef, i32 1, <32 x i1> undef, <32 x i8> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16I8 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* undef, i32 1, <16 x i1> undef, <16 x i8> undef)
> -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8I8 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* undef, i32 1, <8 x i1> undef, <8 x i8> undef)
> +; SKX-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V8I8 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* undef, i32 1, <8 x i1> undef, <8 x i8> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 0
>  ;
>    %V8F64 = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* undef, i32 1, <8 x i1> undef, <8 x double> undef)
> @@ -194,7 +194,7 @@ define i32 @masked_store() {
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> undef, <16 x i32>* undef, i32 1, <16 x i1> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> undef, <8 x i32>* undef, i32 1, <8 x i1> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> undef, <4 x i32>* undef, i32 1, <4 x i1> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>* undef, i32 1, <2 x i1> undef)
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>* undef, i32 1, <2 x i1> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 128 for instruction: call void @llvm.masked.store.v32i16.p0v32i16(<32 x i16> undef, <32 x i16>* undef, i32 1, <32 x i1> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 64 for instruction: call void @llvm.masked.store.v16i16.p0v16i16(<16 x i16> undef, <16 x i16>* undef, i32 1, <16 x i1> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> undef, <8 x i16>* undef, i32 1, <8 x i1> undef)
> @@ -221,7 +221,7 @@ define i32 @masked_store() {
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> undef, <16 x i32>* undef, i32 1, <16 x i1> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> undef, <8 x i32>* undef, i32 1, <8 x i1> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> undef, <4 x i32>* undef, i32 1, <4 x i1> undef)
> -; KNL-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>* undef, i32 1, <2 x i1> undef)
> +; KNL-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>* undef, i32 1, <2 x i1> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 128 for instruction: call void @llvm.masked.store.v32i16.p0v32i16(<32 x i16> undef, <32 x i16>* undef, i32 1, <32 x i1> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 64 for instruction: call void @llvm.masked.store.v16i16.p0v16i16(<16 x i16> undef, <16 x i16>* undef, i32 1, <16 x i1> undef)
>  ; KNL-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> undef, <8 x i16>* undef, i32 1, <8 x i1> undef)
> @@ -248,15 +248,15 @@ define i32 @masked_store() {
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v16i32.p0v16i32(<16 x i32> undef, <16 x i32>* undef, i32 1, <16 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> undef, <8 x i32>* undef, i32 1, <8 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> undef, <4 x i32>* undef, i32 1, <4 x i1> undef)
> -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>* undef, i32 1, <2 x i1> undef)
> +; SKX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> undef, <2 x i32>* undef, i32 1, <2 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v32i16.p0v32i16(<32 x i16> undef, <32 x i16>* undef, i32 1, <32 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v16i16.p0v16i16(<16 x i16> undef, <16 x i16>* undef, i32 1, <16 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> undef, <8 x i16>* undef, i32 1, <8 x i1> undef)
> -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: call void @llvm.masked.store.v4i16.p0v4i16(<4 x i16> undef, <4 x i16>* undef, i32 1, <4 x i1> undef)
> +; SKX-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: call void @llvm.masked.store.v4i16.p0v4i16(<4 x i16> undef, <4 x i16>* undef, i32 1, <4 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v64i8.p0v64i8(<64 x i8> undef, <64 x i8>* undef, i32 1, <64 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v32i8.p0v32i8(<32 x i8> undef, <32 x i8>* undef, i32 1, <32 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: call void @llvm.masked.store.v16i8.p0v16i8(<16 x i8> undef, <16 x i8>* undef, i32 1, <16 x i1> undef)
> -; SKX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: call void @llvm.masked.store.v8i8.p0v8i8(<8 x i8> undef, <8 x i8>* undef, i32 1, <8 x i1> undef)
> +; SKX-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: call void @llvm.masked.store.v8i8.p0v8i8(<8 x i8> undef, <8 x i8>* undef, i32 1, <8 x i1> undef)
>  ; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 0
>  ;
>    call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> undef, <8 x double>* undef, i32 1, <8 x i1> undef)
> @@ -960,15 +960,10 @@ define <8 x float> @test4(<8 x i32> %tri
>  }
>
>  define void @test5(<2 x i32> %trigger, <2 x float>* %addr, <2 x float> %val) {
> -; SSE2-LABEL: 'test5'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x float>* %addr, i32 4, <2 x i1> %mask)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> -;
> -; SSE42-LABEL: 'test5'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x float>* %addr, i32 4, <2 x i1> %mask)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> +; SSE-LABEL: 'test5'
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x float>* %addr, i32 4, <2 x i1> %mask)
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; AVX-LABEL: 'test5'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> @@ -986,24 +981,19 @@ define void @test5(<2 x i32> %trigger, <
>  }
>
>  define void @test6(<2 x i32> %trigger, <2 x i32>* %addr, <2 x i32> %val) {
> -; SSE2-LABEL: 'test6'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> -;
> -; SSE42-LABEL: 'test6'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
> +; SSE-LABEL: 'test6'
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; AVX-LABEL: 'test6'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; AVX512-LABEL: 'test6'
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> @@ -1012,15 +1002,10 @@ define void @test6(<2 x i32> %trigger, <
>  }
>
>  define <2 x float> @test7(<2 x i32> %trigger, <2 x float>* %addr, <2 x float> %dst) {
> -; SSE2-LABEL: 'test7'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr, i32 4, <2 x i1> %mask, <2 x float> %dst)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x float> %res
> -;
> -; SSE42-LABEL: 'test7'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr, i32 4, <2 x i1> %mask, <2 x float> %dst)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x float> %res
> +; SSE-LABEL: 'test7'
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr, i32 4, <2 x i1> %mask, <2 x float> %dst)
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x float> %res
>  ;
>  ; AVX-LABEL: 'test7'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> @@ -1038,24 +1023,19 @@ define <2 x float> @test7(<2 x i32> %tri
>  }
>
>  define <2 x i32> @test8(<2 x i32> %trigger, <2 x i32>* %addr, <2 x i32> %dst) {
> -; SSE2-LABEL: 'test8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
> -;
> -; SSE42-LABEL: 'test8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
> +; SSE-LABEL: 'test8'
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
>  ;
>  ; AVX-LABEL: 'test8'
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
>  ;
>  ; AVX512-LABEL: 'test8'
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %mask = icmp eq <2 x i32> %trigger, zeroinitializer
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
>  ;
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>
> Removed: llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll?rev=368182&view=auto
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-add-widen.ll (removed)
> @@ -1,307 +0,0 @@
> -; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse2 | FileCheck %s --check-prefixes=CHECK,SSE,SSE2
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+ssse3 | FileCheck %s --check-prefixes=CHECK,SSE,SSSE3
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+sse4.2 | FileCheck %s --check-prefixes=CHECK,SSE,SSE42
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx | FileCheck %s --check-prefixes=CHECK,AVX,AVX1
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx2 | FileCheck %s --check-prefixes=CHECK,AVX,AVX2
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f | FileCheck %s --check-prefixes=CHECK,AVX512,AVX512F
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512bw | FileCheck %s --check-prefixes=CHECK,AVX512,AVX512BW
> -; RUN: opt < %s -x86-experimental-vector-widening-legalization -cost-model -mtriple=x86_64-apple-darwin -analyze -mattr=+avx512f,+avx512dq | FileCheck %s --check-prefixes=CHECK,AVX512,AVX512DQ
> -
> -define i32 @reduce_i64(i32 %arg) {
> -; SSE2-LABEL: 'reduce_i64'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSSE3-LABEL: 'reduce_i64'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSE42-LABEL: 'reduce_i64'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX-LABEL: 'reduce_i64'
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512-LABEL: 'reduce_i64'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V1 = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -  %V1  = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> undef)
> -  %V2  = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> undef)
> -  %V4  = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> undef)
> -  %V8  = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> undef)
> -  %V16 = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> undef)
> -  ret i32 undef
> -}
> -
> -define i32 @reduce_i32(i32 %arg) {
> -; SSE2-LABEL: 'reduce_i32'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSSE3-LABEL: 'reduce_i32'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSE42-LABEL: 'reduce_i32'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX-LABEL: 'reduce_i32'
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512-LABEL: 'reduce_i32'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -  %V2  = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> -  %V4  = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
> -  %V8  = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
> -  %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> -  %V32 = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> undef)
> -  ret i32 undef
> -}
> -
> -define i32 @reduce_i16(i32 %arg) {
> -; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSSE3-LABEL: 'reduce_i16'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSE42-LABEL: 'reduce_i16'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 49 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 53 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'reduce_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -  %V2  = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -  %V4  = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> -  %V8  = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
> -  %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> -  %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> -  %V64 = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> undef)
> -  ret i32 undef
> -}
> -
> -define i32 @reduce_i8(i32 %arg) {
> -; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 52 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSSE3-LABEL: 'reduce_i8'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 65 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 73 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 29 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 29 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 55 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 29 for instruction: %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -  %V2   = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -  %V4   = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -  %V8   = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> -  %V16  = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
> -  %V32  = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
> -  %V64  = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> -  %V128 = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> undef)
> -  ret i32 undef
> -}
> -
> -declare i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64>)
> -declare i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64>)
> -declare i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64>)
> -declare i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64>)
> -declare i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64>)
> -
> -declare i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32>)
> -declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
> -declare i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32>)
> -declare i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>)
> -declare i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32>)
> -
> -declare i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16>)
> -declare i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16>)
> -declare i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16>)
> -declare i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16>)
> -declare i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16>)
> -declare i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16>)
> -
> -declare i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8>)
> -declare i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8>)
> -declare i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8>)
> -declare i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8>)
> -declare i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8>)
> -declare i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8>)
> -declare i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8>)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-add.ll Wed Aug  7 09:24:26 2019
> @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX-LABEL: 'reduce_i32'
> -; AVX-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> +; AVX-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512-LABEL: 'reduce_i32'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> undef)
> @@ -108,8 +108,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> undef)
> @@ -135,7 +135,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 49 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> @@ -144,7 +144,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> @@ -153,7 +153,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> @@ -162,7 +162,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> @@ -171,7 +171,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.add.v2i16(<2 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> undef)
> @@ -190,9 +190,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> @@ -210,9 +210,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> @@ -220,9 +220,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 65 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> @@ -230,9 +230,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> @@ -240,9 +240,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> @@ -250,9 +250,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 55 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
> @@ -260,9 +260,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.add.v2i8(<2 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.add.v4i8(<4 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-and.ll Wed Aug  7 09:24:26 2019
> @@ -92,8 +92,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.and.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.and.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.and.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.and.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.and.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.and.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.and.v32i16(<32 x i16> undef)
> @@ -174,9 +174,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.and.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.and.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.and.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.and.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.and.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.and.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.and.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.and.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.and.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-mul.ll Wed Aug  7 09:24:26 2019
> @@ -67,7 +67,7 @@ define i32 @reduce_i64(i32 %arg) {
>
>  define i32 @reduce_i32(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i32'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> @@ -75,7 +75,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i32'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i32'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i32'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 29 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> @@ -99,36 +99,20 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i32'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> -; AVX512F-LABEL: 'reduce_i32'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512BW-LABEL: 'reduce_i32'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> -;
> -; AVX512DQ-LABEL: 'reduce_i32'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> +; AVX512-LABEL: 'reduce_i32'
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.mul.v8i32(<8 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.mul.v16i32(<16 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i32 @llvm.experimental.vector.reduce.mul.v32i32(<32 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>    %V2  = call i32 @llvm.experimental.vector.reduce.mul.v2i32(<2 x i32> undef)
>    %V4  = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> undef)
> @@ -140,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> @@ -149,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i16'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> @@ -158,8 +142,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i16'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> @@ -167,8 +151,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 49 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 53 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> @@ -176,8 +160,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.mul.v32i16(<32 x i16> undef)
> @@ -185,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> @@ -194,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.mul.v2i16(<2 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.mul.v4i16(<4 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.mul.v8i16(<8 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.mul.v16i16(<16 x i16> undef)
> @@ -222,9 +206,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 67 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 89 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 101 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 125 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -232,9 +216,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i8'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 53 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 65 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 89 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -242,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 53 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 65 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 89 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -252,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 53 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 171 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 197 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -262,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 106 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 123 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -272,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 86 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 99 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -282,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 115 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
> @@ -292,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.mul.v2i8(<2 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.mul.v4i8(<4 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.mul.v8i8(<8 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.mul.v16i8(<16 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 86 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.mul.v32i8(<32 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 99 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.mul.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-or.ll Wed Aug  7 09:24:26 2019
> @@ -92,8 +92,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.or.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.or.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.or.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.or.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.or.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.or.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.or.v32i16(<32 x i16> undef)
> @@ -174,9 +174,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.or.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.or.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.or.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.or.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.or.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.or.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.or.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.or.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.or.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-smax.ll Wed Aug  7 09:24:26 2019
> @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i32'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32> undef)
> @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i32'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32> undef)
> @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i32'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32> undef)
> @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512-LABEL: 'reduce_i32'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smax.v2i32(<2 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smax.v4i32(<4 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smax.v8i32(<8 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smax.v16i32(<16 x i32> undef)
> @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.smax.v32i16(<32 x i16> undef)
> @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i16'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.smax.v32i16(<32 x i16> undef)
> @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i16'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smax.v2i16(<2 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smax.v4i16(<4 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smax.v8i16(<8 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smax.v16i16(<16 x i16> undef)
> @@ -206,8 +206,8 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
> @@ -216,8 +216,8 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i8'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
> @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
> @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smax.v2i8(<2 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smax.v4i8(<4 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smax.v8i8(<8 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smax.v16i8(<16 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smax.v32i8(<32 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smax.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-smin.ll Wed Aug  7 09:24:26 2019
> @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i32'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32> undef)
> @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i32'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32> undef)
> @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i32'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32> undef)
> @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512-LABEL: 'reduce_i32'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.smin.v2i32(<2 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.smin.v8i32(<8 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.smin.v16i32(<16 x i32> undef)
> @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.smin.v32i16(<32 x i16> undef)
> @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i16'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.smin.v32i16(<32 x i16> undef)
> @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i16'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.smin.v2i16(<2 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.smin.v4i16(<4 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.smin.v8i16(<8 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.smin.v16i16(<16 x i16> undef)
> @@ -206,8 +206,8 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
> @@ -216,8 +216,8 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i8'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
> @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
> @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.smin.v2i8(<2 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.smin.v4i8(<4 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.smin.v8i8(<8 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.smin.v16i8(<16 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.smin.v32i8(<32 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.smin.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-umax.ll Wed Aug  7 09:24:26 2019
> @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i32'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32> undef)
> @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i32'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32> undef)
> @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i32'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32> undef)
> @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512-LABEL: 'reduce_i32'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umax.v2i32(<2 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umax.v8i32(<8 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umax.v16i32(<16 x i32> undef)
> @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.umax.v32i16(<32 x i16> undef)
> @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i16'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.umax.v32i16(<32 x i16> undef)
> @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i16'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umax.v2i16(<2 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umax.v4i16(<4 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umax.v8i16(<8 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umax.v16i16(<16 x i16> undef)
> @@ -206,9 +206,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -216,9 +216,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i8'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
> @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umax.v2i8(<2 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umax.v4i8(<4 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umax.v8i8(<8 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umax.v16i8(<16 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umax.v32i8(<32 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umax.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-umin.ll Wed Aug  7 09:24:26 2019
> @@ -83,7 +83,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i32'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8 x i32> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32> undef)
> @@ -91,7 +91,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i32'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8 x i32> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32> undef)
> @@ -99,7 +99,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i32'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8 x i32> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32> undef)
> @@ -107,7 +107,7 @@ define i32 @reduce_i32(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512-LABEL: 'reduce_i32'
> -; AVX512-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
> +; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i32 @llvm.experimental.vector.reduce.umin.v2i32(<2 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i32 @llvm.experimental.vector.reduce.umin.v4i32(<4 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i32 @llvm.experimental.vector.reduce.umin.v8i32(<8 x i32> undef)
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i32 @llvm.experimental.vector.reduce.umin.v16i32(<16 x i32> undef)
> @@ -124,8 +124,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.umin.v32i16(<32 x i16> undef)
> @@ -133,8 +133,8 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i16'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.umin.v32i16(<32 x i16> undef)
> @@ -142,7 +142,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i16'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> @@ -151,7 +151,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i16'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> @@ -160,7 +160,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i16'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> @@ -169,7 +169,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i16'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> @@ -178,7 +178,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i16'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> @@ -187,7 +187,7 @@ define i32 @reduce_i16(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i16'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.umin.v2i16(<2 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.umin.v4i16(<4 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.umin.v8i16(<8 x i16> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.umin.v16i16(<16 x i16> undef)
> @@ -206,9 +206,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -216,9 +216,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSSE3-LABEL: 'reduce_i8'
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; SSSE3-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -226,9 +226,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSSE3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; SSE42-LABEL: 'reduce_i8'
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; SSE42-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -236,9 +236,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; SSE42-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX1-LABEL: 'reduce_i8'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -246,9 +246,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX2-LABEL: 'reduce_i8'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -256,9 +256,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512F-LABEL: 'reduce_i8'
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -266,9 +266,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512F-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512BW-LABEL: 'reduce_i8'
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 61 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
> @@ -276,9 +276,9 @@ define i32 @reduce_i8(i32 %arg) {
>  ; AVX512BW-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX512DQ-LABEL: 'reduce_i8'
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> -; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.umin.v2i8(<2 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.umin.v4i8(<4 x i8> undef)
> +; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.umin.v8i8(<8 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.umin.v16i8(<16 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.umin.v32i8(<32 x i8> undef)
>  ; AVX512DQ-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.umin.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/reduce-xor.ll Wed Aug  7 09:24:26 2019
> @@ -92,8 +92,8 @@ define i32 @reduce_i32(i32 %arg) {
>
>  define i32 @reduce_i16(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i16'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.xor.v2i16(<2 x i16> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.xor.v4i16(<4 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2 = call i16 @llvm.experimental.vector.reduce.xor.v2i16(<2 x i16> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: %V4 = call i16 @llvm.experimental.vector.reduce.xor.v4i16(<4 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i16 @llvm.experimental.vector.reduce.xor.v8i16(<8 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %V16 = call i16 @llvm.experimental.vector.reduce.xor.v16i16(<16 x i16> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %V32 = call i16 @llvm.experimental.vector.reduce.xor.v32i16(<32 x i16> undef)
> @@ -174,9 +174,9 @@ define i32 @reduce_i16(i32 %arg) {
>
>  define i32 @reduce_i8(i32 %arg) {
>  ; SSE2-LABEL: 'reduce_i8'
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.xor.v2i8(<2 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.xor.v4i8(<4 x i8> undef)
> -; SSE2-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.xor.v8i8(<8 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V2 = call i8 @llvm.experimental.vector.reduce.xor.v2i8(<2 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V4 = call i8 @llvm.experimental.vector.reduce.xor.v4i8(<4 x i8> undef)
> +; SSE2-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8 = call i8 @llvm.experimental.vector.reduce.xor.v8i8(<8 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 45 for instruction: %V16 = call i8 @llvm.experimental.vector.reduce.xor.v16i8(<16 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 46 for instruction: %V32 = call i8 @llvm.experimental.vector.reduce.xor.v32i8(<32 x i8> undef)
>  ; SSE2-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: %V64 = call i8 @llvm.experimental.vector.reduce.xor.v64i8(<64 x i8> undef)
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/shuffle-transpose.ll Wed Aug  7 09:24:26 2019
> @@ -123,21 +123,21 @@ define void @test_vXf32(<2 x float> %a64
>
>  define void @test_vXi32(<2 x i32> %a64, <2 x i32> %b64, <4 x i32> %a128, <4 x i32> %b128, <8 x i32> %a256, <8 x i32> %b256, <16 x i32> %a512, <16 x i32> %b512) {
>  ; SSE-LABEL: 'test_vXi32'
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; AVX1-LABEL: 'test_vXi32'
> -; AVX1-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
> +; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
>  ; AVX1-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; AVX2-LABEL: 'test_vXi32'
> -; AVX2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
> +; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
>  ; AVX2-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
> @@ -151,7 +151,7 @@ define void @test_vXi32(<2 x i32> %a64,
>  ; AVX512-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
>  ;
>  ; BTVER2-LABEL: 'test_vXi32'
> -; BTVER2-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
> +; BTVER2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V64 = shufflevector <2 x i32> %a64, <2 x i32> %b64, <2 x i32> <i32 0, i32 2>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V128 = shufflevector <4 x i32> %a128, <4 x i32> %b128, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V256 = shufflevector <8 x i32> %a256, <8 x i32> %b256, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
>  ; BTVER2-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V512 = shufflevector <16 x i32> %a512, <16 x i32> %b512, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/sitofp.ll Wed Aug  7 09:24:26 2019
> @@ -13,9 +13,9 @@
>  define i32 @sitofp_i8_double() {
>  ; SSE-LABEL: 'sitofp_i8_double'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i8_f64 = sitofp i8 undef to double
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %cvt_v2i8_v2f64 = sitofp <2 x i8> undef to <2 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v4i8_v4f64 = sitofp <4 x i8> undef to <4 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v8i8_v8f64 = sitofp <8 x i8> undef to <8 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %cvt_v2i8_v2f64 = sitofp <2 x i8> undef to <2 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %cvt_v4i8_v4f64 = sitofp <4 x i8> undef to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %cvt_v8i8_v8f64 = sitofp <8 x i8> undef to <8 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX-LABEL: 'sitofp_i8_double'
> @@ -49,8 +49,8 @@ define i32 @sitofp_i8_double() {
>  define i32 @sitofp_i16_double() {
>  ; SSE-LABEL: 'sitofp_i16_double'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i16_f64 = sitofp i16 undef to double
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %cvt_v2i16_v2f64 = sitofp <2 x i16> undef to <2 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v4i16_v4f64 = sitofp <4 x i16> undef to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v2i16_v2f64 = sitofp <2 x i16> undef to <2 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v4i16_v4f64 = sitofp <4 x i16> undef to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v8i16_v8f64 = sitofp <8 x i16> undef to <8 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> @@ -85,7 +85,7 @@ define i32 @sitofp_i16_double() {
>  define i32 @sitofp_i32_double() {
>  ; SSE-LABEL: 'sitofp_i32_double'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i32_f64 = sitofp i32 undef to double
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %cvt_v2i32_v2f64 = sitofp <2 x i32> undef to <2 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v2i32_v2f64 = sitofp <2 x i32> undef to <2 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v4i32_v4f64 = sitofp <4 x i32> undef to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v8i32_v8f64 = sitofp <8 x i32> undef to <8 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> @@ -164,8 +164,8 @@ define i32 @sitofp_i64_double() {
>  define i32 @sitofp_i8_float() {
>  ; SSE-LABEL: 'sitofp_i8_float'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i8_f32 = sitofp i8 undef to float
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %cvt_v4i8_v4f32 = sitofp <4 x i8> undef to <4 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %cvt_v8i8_v8f32 = sitofp <8 x i8> undef to <8 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v4i8_v4f32 = sitofp <4 x i8> undef to <4 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v8i8_v8f32 = sitofp <8 x i8> undef to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v16i8_v16f32 = sitofp <16 x i8> undef to <16 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> @@ -200,7 +200,7 @@ define i32 @sitofp_i8_float() {
>  define i32 @sitofp_i16_float() {
>  ; SSE-LABEL: 'sitofp_i16_float'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i16_f32 = sitofp i16 undef to float
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %cvt_v4i16_v4f32 = sitofp <4 x i16> undef to <4 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %cvt_v4i16_v4f32 = sitofp <4 x i16> undef to <4 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %cvt_v8i16_v8f32 = sitofp <8 x i16> undef to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %cvt_v16i16_v16f32 = sitofp <16 x i16> undef to <16 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/slm-arith-costs.ll Wed Aug  7 09:24:26 2019
> @@ -47,11 +47,11 @@ entry:
>
>  define <2 x i8> @slm-costs_8_v2_mul(<2 x i8> %a, <2 x i8> %b)  {
>  ; SLM-LABEL: 'slm-costs_8_v2_mul'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %res = mul nsw <2 x i8> %a, %b
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %res = mul nsw <2 x i8> %a, %b
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i8> %res
>  ;
>  ; GLM-LABEL: 'slm-costs_8_v2_mul'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res = mul nsw <2 x i8> %a, %b
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %res = mul nsw <2 x i8> %a, %b
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i8> %res
>  ;
>  entry:
> @@ -61,11 +61,11 @@ entry:
>
>  define <4 x i8> @slm-costs_8_v4_mul(<4 x i8> %a, <4 x i8> %b)  {
>  ; SLM-LABEL: 'slm-costs_8_v4_mul'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %res = mul nsw <4 x i8> %a, %b
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %res = mul nsw <4 x i8> %a, %b
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i8> %res
>  ;
>  ; GLM-LABEL: 'slm-costs_8_v4_mul'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = mul nsw <4 x i8> %a, %b
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %res = mul nsw <4 x i8> %a, %b
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i8> %res
>  ;
>  entry:
> @@ -177,11 +177,11 @@ entry:
>
>  define <8 x i8> @slm-costs_8_v8_mul(<8 x i8> %a, <8 x i8> %b)  {
>  ; SLM-LABEL: 'slm-costs_8_v8_mul'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = mul nsw <8 x i8> %a, %b
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %res = mul nsw <8 x i8> %a, %b
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %res
>  ;
>  ; GLM-LABEL: 'slm-costs_8_v8_mul'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %res = mul nsw <8 x i8> %a, %b
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %res = mul nsw <8 x i8> %a, %b
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %res
>  ;
>  entry:
> @@ -216,11 +216,11 @@ entry:
>
>  define <2 x i16> @slm-costs_16_v2_mul(<2 x i16> %a, <2 x i16> %b)  {
>  ; SLM-LABEL: 'slm-costs_16_v2_mul'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %res = mul nsw <2 x i16> %a, %b
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = mul nsw <2 x i16> %a, %b
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i16> %res
>  ;
>  ; GLM-LABEL: 'slm-costs_16_v2_mul'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res = mul nsw <2 x i16> %a, %b
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %res = mul nsw <2 x i16> %a, %b
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i16> %res
>  ;
>  entry:
> @@ -230,11 +230,11 @@ entry:
>
>  define <4 x i16> @slm-costs_16_v4_mul(<4 x i16> %a, <4 x i16> %b)  {
>  ; SLM-LABEL: 'slm-costs_16_v4_mul'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %res = mul nsw <4 x i16> %a, %b
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = mul nsw <4 x i16> %a, %b
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %res
>  ;
>  ; GLM-LABEL: 'slm-costs_16_v4_mul'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = mul nsw <4 x i16> %a, %b
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %res = mul nsw <4 x i16> %a, %b
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %res
>  ;
>  entry:
> @@ -385,11 +385,11 @@ entry:
>
>  define <2 x i32> @slm-costs_32_v2_mul(<2 x i32> %a, <2 x i32> %b)  {
>  ; SLM-LABEL: 'slm-costs_32_v2_mul'
> -; SLM-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %res = mul nsw <2 x i32> %a, %b
> +; SLM-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %res = mul nsw <2 x i32> %a, %b
>  ; SLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
>  ;
>  ; GLM-LABEL: 'slm-costs_32_v2_mul'
> -; GLM-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res = mul nsw <2 x i32> %a, %b
> +; GLM-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %res = mul nsw <2 x i32> %a, %b
>  ; GLM-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %res
>  ;
>  entry:
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/testshiftashr.ll Wed Aug  7 09:24:26 2019
> @@ -5,9 +5,9 @@
>  define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
>  entry:
>    ; SSE2-LABEL: shift2i16
> -  ; SSE2: cost of 12 {{.*}} ashr
> +  ; SSE2: cost of 32 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift2i16
> -  ; SSE2-CODEGEN: psrlq
> +  ; SSE2-CODEGEN: psraw
>
>    %0 = ashr %shifttype %a , %b
>    ret %shifttype %0
> @@ -17,9 +17,9 @@ entry:
>  define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i16
> -  ; SSE2: cost of 16 {{.*}} ashr
> +  ; SSE2: cost of 32 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift4i16
> -  ; SSE2-CODEGEN: psrad
> +  ; SSE2-CODEGEN: psraw
>
>    %0 = ashr %shifttype4i16 %a , %b
>    ret %shifttype4i16 %0
> @@ -65,9 +65,9 @@ entry:
>  define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i32
> -  ; SSE2: cost of 12 {{.*}} ashr
> +  ; SSE2: cost of 16 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift2i32
> -  ; SSE2-CODEGEN: psrlq
> +  ; SSE2-CODEGEN: psrad
>
>    %0 = ashr %shifttype2i32 %a , %b
>    ret %shifttype2i32 %0
> @@ -185,9 +185,9 @@ entry:
>  define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i8
> -  ; SSE2: cost of 12 {{.*}} ashr
> +  ; SSE2: cost of 54 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift2i8
> -  ; SSE2-CODEGEN: psrlq
> +  ; SSE2-CODEGEN: psrlw
>
>    %0 = ashr %shifttype2i8 %a , %b
>    ret %shifttype2i8 %0
> @@ -197,9 +197,9 @@ entry:
>  define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i8
> -  ; SSE2: cost of 16 {{.*}} ashr
> +  ; SSE2: cost of 54 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift4i8
> -  ; SSE2-CODEGEN: psrad
> +  ; SSE2-CODEGEN: psraw
>
>    %0 = ashr %shifttype4i8 %a , %b
>    ret %shifttype4i8 %0
> @@ -209,7 +209,7 @@ entry:
>  define %shifttype8i8 @shift8i8(%shifttype8i8 %a, %shifttype8i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift8i8
> -  ; SSE2: cost of 32 {{.*}} ashr
> +  ; SSE2: cost of 54 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift8i8
>    ; SSE2-CODEGEN: psraw
>
> @@ -247,9 +247,9 @@ entry:
>  define %shifttypec @shift2i16const(%shifttypec %a, %shifttypec %b) {
>  entry:
>    ; SSE2-LABEL: shift2i16const
> -  ; SSE2: cost of 4 {{.*}} ashr
> +  ; SSE2: cost of 1 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift2i16const
> -  ; SSE2-CODEGEN: psrad $3
> +  ; SSE2-CODEGEN: psraw $3
>
>    %0 = ashr %shifttypec %a , <i16 3, i16 3>
>    ret %shifttypec %0
> @@ -261,7 +261,7 @@ entry:
>    ; SSE2-LABEL: shift4i16const
>    ; SSE2: cost of 1 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift4i16const
> -  ; SSE2-CODEGEN: psrad $19
> +  ; SSE2-CODEGEN: psraw $3
>
>    %0 = ashr %shifttypec4i16 %a , <i16 3, i16 3, i16 3, i16 3>
>    ret %shifttypec4i16 %0
> @@ -320,7 +320,7 @@ entry:
>  define %shifttypec2i32 @shift2i32c(%shifttypec2i32 %a, %shifttypec2i32 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i32c
> -  ; SSE2: cost of 4 {{.*}} ashr
> +  ; SSE2: cost of 1 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift2i32c
>    ; SSE2-CODEGEN: psrad $3
>
> @@ -464,7 +464,7 @@ entry:
>    ; SSE2-LABEL: shift2i8c
>    ; SSE2: cost of 4 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift2i8c
> -  ; SSE2-CODEGEN: psrad $3
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = ashr %shifttypec2i8 %a , <i8 3, i8 3>
>    ret %shifttypec2i8 %0
> @@ -474,9 +474,9 @@ entry:
>  define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i8c
> -  ; SSE2: cost of 1 {{.*}} ashr
> +  ; SSE2: cost of 4 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift4i8c
> -  ; SSE2-CODEGEN: psrad $27
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = ashr %shifttypec4i8 %a , <i8 3, i8 3, i8 3, i8 3>
>    ret %shifttypec4i8 %0
> @@ -486,9 +486,9 @@ entry:
>  define %shifttypec8i8 @shift8i8c(%shifttypec8i8 %a, %shifttypec8i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift8i8c
> -  ; SSE2: cost of 1 {{.*}} ashr
> +  ; SSE2: cost of 4 {{.*}} ashr
>    ; SSE2-CODEGEN-LABEL: shift8i8c
> -  ; SSE2-CODEGEN: psraw $11
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = ashr %shifttypec8i8 %a , <i8 3, i8 3, i8 3, i8 3,
>                                   i8 3, i8 3, i8 3, i8 3>
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/testshiftlshr.ll Wed Aug  7 09:24:26 2019
> @@ -5,9 +5,9 @@
>  define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
>  entry:
>    ; SSE2-LABEL: shift2i16
> -  ; SSE2: cost of 4 {{.*}} lshr
> +  ; SSE2: cost of 32 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift2i16
> -  ; SSE2-CODEGEN: psrlq
> +  ; SSE2-CODEGEN: psrlw
>
>    %0 = lshr %shifttype %a , %b
>    ret %shifttype %0
> @@ -17,9 +17,9 @@ entry:
>  define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i16
> -  ; SSE2: cost of 16 {{.*}} lshr
> +  ; SSE2: cost of 32 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift4i16
> -  ; SSE2-CODEGEN: psrld
> +  ; SSE2-CODEGEN: psrlw
>
>    %0 = lshr %shifttype4i16 %a , %b
>    ret %shifttype4i16 %0
> @@ -65,9 +65,9 @@ entry:
>  define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i32
> -  ; SSE2: cost of 4 {{.*}} lshr
> +  ; SSE2: cost of 16 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift2i32
> -  ; SSE2-CODEGEN: psrlq
> +  ; SSE2-CODEGEN: psrld
>
>    %0 = lshr %shifttype2i32 %a , %b
>    ret %shifttype2i32 %0
> @@ -185,9 +185,9 @@ entry:
>  define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i8
> -  ; SSE2: cost of 4 {{.*}} lshr
> +  ; SSE2: cost of 26 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift2i8
> -  ; SSE2-CODEGEN: psrlq
> +  ; SSE2-CODEGEN: psrlw
>
>    %0 = lshr %shifttype2i8 %a , %b
>    ret %shifttype2i8 %0
> @@ -197,9 +197,9 @@ entry:
>  define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i8
> -  ; SSE2: cost of 16 {{.*}} lshr
> +  ; SSE2: cost of 26 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift4i8
> -  ; SSE2-CODEGEN: psrld
> +  ; SSE2-CODEGEN: psrlw
>
>    %0 = lshr %shifttype4i8 %a , %b
>    ret %shifttype4i8 %0
> @@ -209,7 +209,7 @@ entry:
>  define %shifttype8i8 @shift8i8(%shifttype8i8 %a, %shifttype8i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift8i8
> -  ; SSE2: cost of 32 {{.*}} lshr
> +  ; SSE2: cost of 26 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift8i8
>    ; SSE2-CODEGEN: psrlw
>
> @@ -249,7 +249,7 @@ entry:
>    ; SSE2-LABEL: shift2i16const
>    ; SSE2: cost of 1 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift2i16const
> -  ; SSE2-CODEGEN: psrlq $3
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = lshr %shifttypec %a , <i16 3, i16 3>
>    ret %shifttypec %0
> @@ -261,7 +261,7 @@ entry:
>    ; SSE2-LABEL: shift4i16const
>    ; SSE2: cost of 1 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift4i16const
> -  ; SSE2-CODEGEN: psrld $3
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = lshr %shifttypec4i16 %a , <i16 3, i16 3, i16 3, i16 3>
>    ret %shifttypec4i16 %0
> @@ -322,7 +322,7 @@ entry:
>    ; SSE2-LABEL: shift2i32c
>    ; SSE2: cost of 1 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift2i32c
> -  ; SSE2-CODEGEN: psrlq $3
> +  ; SSE2-CODEGEN: psrld $3
>
>    %0 = lshr %shifttypec2i32 %a , <i32 3, i32 3>
>    ret %shifttypec2i32 %0
> @@ -461,9 +461,9 @@ entry:
>  define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i8c
> -  ; SSE2: cost of 1 {{.*}} lshr
> +  ; SSE2: cost of 2 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift2i8c
> -  ; SSE2-CODEGEN: psrlq $3
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = lshr %shifttypec2i8 %a , <i8 3, i8 3>
>    ret %shifttypec2i8 %0
> @@ -473,9 +473,9 @@ entry:
>  define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i8c
> -  ; SSE2: cost of 1 {{.*}} lshr
> +  ; SSE2: cost of 2 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift4i8c
> -  ; SSE2-CODEGEN: psrld $3
> +  ; SSE2-CODEGEN: psrlw $3
>
>    %0 = lshr %shifttypec4i8 %a , <i8 3, i8 3, i8 3, i8 3>
>    ret %shifttypec4i8 %0
> @@ -485,7 +485,7 @@ entry:
>  define %shifttypec8i8 @shift8i8c(%shifttypec8i8 %a, %shifttypec8i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift8i8c
> -  ; SSE2: cost of 1 {{.*}} lshr
> +  ; SSE2: cost of 2 {{.*}} lshr
>    ; SSE2-CODEGEN-LABEL: shift8i8c
>    ; SSE2-CODEGEN: psrlw $3
>
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/testshiftshl.ll Wed Aug  7 09:24:26 2019
> @@ -5,9 +5,9 @@
>  define %shifttype @shift2i16(%shifttype %a, %shifttype %b) {
>  entry:
>    ; SSE2-LABEL: shift2i16
> -  ; SSE2: cost of 4 {{.*}} shl
> +  ; SSE2: cost of 32 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift2i16
> -  ; SSE2-CODEGEN: psllq
> +  ; SSE2-CODEGEN: pmullw
>
>    %0 = shl %shifttype %a , %b
>    ret %shifttype %0
> @@ -17,9 +17,9 @@ entry:
>  define %shifttype4i16 @shift4i16(%shifttype4i16 %a, %shifttype4i16 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i16
> -  ; SSE2: cost of 10 {{.*}} shl
> +  ; SSE2: cost of 32 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift4i16
> -  ; SSE2-CODEGEN: pmuludq
> +  ; SSE2-CODEGEN: pmullw
>
>    %0 = shl %shifttype4i16 %a , %b
>    ret %shifttype4i16 %0
> @@ -65,9 +65,9 @@ entry:
>  define %shifttype2i32 @shift2i32(%shifttype2i32 %a, %shifttype2i32 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i32
> -  ; SSE2: cost of 4 {{.*}} shl
> +  ; SSE2: cost of 10 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift2i32
> -  ; SSE2-CODEGEN: psllq
> +  ; SSE2-CODEGEN: pmuludq
>
>    %0 = shl %shifttype2i32 %a , %b
>    ret %shifttype2i32 %0
> @@ -185,9 +185,9 @@ entry:
>  define %shifttype2i8 @shift2i8(%shifttype2i8 %a, %shifttype2i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i8
> -  ; SSE2: cost of 4 {{.*}} shl
> +  ; SSE2: cost of 26 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift2i8
> -  ; SSE2-CODEGEN: psllq
> +  ; SSE2-CODEGEN: psllw
>
>    %0 = shl %shifttype2i8 %a , %b
>    ret %shifttype2i8 %0
> @@ -197,9 +197,9 @@ entry:
>  define %shifttype4i8 @shift4i8(%shifttype4i8 %a, %shifttype4i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i8
> -  ; SSE2: cost of 10 {{.*}} shl
> +  ; SSE2: cost of 26 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift4i8
> -  ; SSE2-CODEGEN: pmuludq
> +  ; SSE2-CODEGEN: psllw
>
>    %0 = shl %shifttype4i8 %a , %b
>    ret %shifttype4i8 %0
> @@ -209,9 +209,9 @@ entry:
>  define %shifttype8i8 @shift8i8(%shifttype8i8 %a, %shifttype8i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift8i8
> -  ; SSE2: cost of 32 {{.*}} shl
> +  ; SSE2: cost of 26 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift8i8
> -  ; SSE2-CODEGEN: pmullw
> +  ; SSE2-CODEGEN: psllw
>
>    %0 = shl %shifttype8i8 %a , %b
>    ret %shifttype8i8 %0
> @@ -249,7 +249,7 @@ entry:
>    ; SSE2-LABEL: shift2i16const
>    ; SSE2: cost of 1 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift2i16const
> -  ; SSE2-CODEGEN: psllq $3
> +  ; SSE2-CODEGEN: psllw $3
>
>    %0 = shl %shifttypec %a , <i16 3, i16 3>
>    ret %shifttypec %0
> @@ -261,7 +261,7 @@ entry:
>    ; SSE2-LABEL: shift4i16const
>    ; SSE2: cost of 1 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift4i16const
> -  ; SSE2-CODEGEN: pslld $3
> +  ; SSE2-CODEGEN: psllw $3
>
>    %0 = shl %shifttypec4i16 %a , <i16 3, i16 3, i16 3, i16 3>
>    ret %shifttypec4i16 %0
> @@ -322,7 +322,7 @@ entry:
>    ; SSE2-LABEL: shift2i32c
>    ; SSE2: cost of 1 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift2i32c
> -  ; SSE2-CODEGEN: psllq $3
> +  ; SSE2-CODEGEN: pslld $3
>
>    %0 = shl %shifttypec2i32 %a , <i32 3, i32 3>
>    ret %shifttypec2i32 %0
> @@ -461,9 +461,9 @@ entry:
>  define %shifttypec2i8 @shift2i8c(%shifttypec2i8 %a, %shifttypec2i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift2i8c
> -  ; SSE2: cost of 1 {{.*}} shl
> +  ; SSE2: cost of 2 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift2i8c
> -  ; SSE2-CODEGEN: psllq $3
> +  ; SSE2-CODEGEN: psllw $3
>
>    %0 = shl %shifttypec2i8 %a , <i8 3, i8 3>
>    ret %shifttypec2i8 %0
> @@ -473,9 +473,9 @@ entry:
>  define %shifttypec4i8 @shift4i8c(%shifttypec4i8 %a, %shifttypec4i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift4i8c
> -  ; SSE2: cost of 1 {{.*}} shl
> +  ; SSE2: cost of 2 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift4i8c
> -  ; SSE2-CODEGEN: pslld $3
> +  ; SSE2-CODEGEN: psllw $3
>
>    %0 = shl %shifttypec4i8 %a , <i8 3, i8 3, i8 3, i8 3>
>    ret %shifttypec4i8 %0
> @@ -485,7 +485,7 @@ entry:
>  define %shifttypec8i8 @shift8i8c(%shifttypec8i8 %a, %shifttypec8i8 %b) {
>  entry:
>    ; SSE2-LABEL: shift8i8c
> -  ; SSE2: cost of 1 {{.*}} shl
> +  ; SSE2: cost of 2 {{.*}} shl
>    ; SSE2-CODEGEN-LABEL: shift8i8c
>    ; SSE2-CODEGEN: psllw $3
>
>
> Modified: llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll (original)
> +++ llvm/trunk/test/Analysis/CostModel/X86/uitofp.ll Wed Aug  7 09:24:26 2019
> @@ -13,9 +13,9 @@
>  define i32 @uitofp_i8_double() {
>  ; SSE-LABEL: 'uitofp_i8_double'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i8_f64 = uitofp i8 undef to double
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %cvt_v2i8_v2f64 = uitofp <2 x i8> undef to <2 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v4i8_v4f64 = uitofp <4 x i8> undef to <4 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v8i8_v8f64 = uitofp <8 x i8> undef to <8 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %cvt_v2i8_v2f64 = uitofp <2 x i8> undef to <2 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %cvt_v4i8_v4f64 = uitofp <4 x i8> undef to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 160 for instruction: %cvt_v8i8_v8f64 = uitofp <8 x i8> undef to <8 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
>  ; AVX-LABEL: 'uitofp_i8_double'
> @@ -49,8 +49,8 @@ define i32 @uitofp_i8_double() {
>  define i32 @uitofp_i16_double() {
>  ; SSE-LABEL: 'uitofp_i16_double'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i16_f64 = uitofp i16 undef to double
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %cvt_v2i16_v2f64 = uitofp <2 x i16> undef to <2 x double>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v4i16_v4f64 = uitofp <4 x i16> undef to <4 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v2i16_v2f64 = uitofp <2 x i16> undef to <2 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v4i16_v4f64 = uitofp <4 x i16> undef to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v8i16_v8f64 = uitofp <8 x i16> undef to <8 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> @@ -85,7 +85,7 @@ define i32 @uitofp_i16_double() {
>  define i32 @uitofp_i32_double() {
>  ; SSE-LABEL: 'uitofp_i32_double'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i32_f64 = uitofp i32 undef to double
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %cvt_v2i32_v2f64 = uitofp <2 x i32> undef to <2 x double>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v2i32_v2f64 = uitofp <2 x i32> undef to <2 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 40 for instruction: %cvt_v4i32_v4f64 = uitofp <4 x i32> undef to <4 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 80 for instruction: %cvt_v8i32_v8f64 = uitofp <8 x i32> undef to <8 x double>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
> @@ -165,7 +165,7 @@ define i32 @uitofp_i8_float() {
>  ; SSE-LABEL: 'uitofp_i8_float'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i8_f32 = uitofp i8 undef to float
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v4i8_v4f32 = uitofp <4 x i8> undef to <4 x float>
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %cvt_v8i8_v8f32 = uitofp <8 x i8> undef to <8 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v8i8_v8f32 = uitofp <8 x i8> undef to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v16i8_v16f32 = uitofp <16 x i8> undef to <16 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>  ;
> @@ -200,7 +200,7 @@ define i32 @uitofp_i8_float() {
>  define i32 @uitofp_i16_float() {
>  ; SSE-LABEL: 'uitofp_i16_float'
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %cvt_i16_f32 = uitofp i16 undef to float
> -; SSE-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %cvt_v4i16_v4f32 = uitofp <4 x i16> undef to <4 x float>
> +; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %cvt_v4i16_v4f32 = uitofp <4 x i16> undef to <4 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %cvt_v8i16_v8f32 = uitofp <8 x i16> undef to <8 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %cvt_v16i16_v16f32 = uitofp <16 x i16> undef to <16 x float>
>  ; SSE-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 undef
>
> Modified: llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2008-09-05-sinttofp-2xi32.ll Wed Aug  7 09:24:26 2019
> @@ -7,7 +7,6 @@
>  define <2 x double> @a(<2 x i32> %x) nounwind {
>  ; CHECK-LABEL: a:
>  ; CHECK:       # %bb.0: # %entry
> -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; CHECK-NEXT:    cvtdq2pd %xmm0, %xmm0
>  ; CHECK-NEXT:    retl
>  entry:
> @@ -19,7 +18,6 @@ define <2 x i32> @b(<2 x double> %x) nou
>  ; CHECK-LABEL: b:
>  ; CHECK:       # %bb.0: # %entry
>  ; CHECK-NEXT:    cvttpd2dq %xmm0, %xmm0
> -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
>  ; CHECK-NEXT:    retl
>  entry:
>    %y = fptosi <2 x double> %x to <2 x i32>
>
> Modified: llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2009-06-05-VZextByteShort.ll Wed Aug  7 09:24:26 2019
> @@ -7,6 +7,7 @@ define <4 x i16> @a(i32* %x1) nounwind {
>  ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; CHECK-NEXT:    movl (%eax), %eax
>  ; CHECK-NEXT:    shrl %eax
> +; CHECK-NEXT:    movzwl %ax, %eax
>  ; CHECK-NEXT:    movd %eax, %xmm0
>  ; CHECK-NEXT:    retl
>
> @@ -40,7 +41,7 @@ define <8 x i8> @c(i32* %x1) nounwind {
>  ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; CHECK-NEXT:    movl (%eax), %eax
>  ; CHECK-NEXT:    shrl %eax
> -; CHECK-NEXT:    movzwl %ax, %eax
> +; CHECK-NEXT:    movzbl %al, %eax
>  ; CHECK-NEXT:    movd %eax, %xmm0
>  ; CHECK-NEXT:    retl
>
>
> Modified: llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2011-10-19-LegelizeLoad.ll Wed Aug  7 09:24:26 2019
> @@ -17,19 +17,23 @@ target triple = "x86_64-unknown-linux-gn
>  define i32 @main() nounwind uwtable {
>  ; CHECK-LABEL: main:
>  ; CHECK:       # %bb.0: # %entry
> -; CHECK-NEXT:    pmovsxbq {{.*}}(%rip), %xmm0
> -; CHECK-NEXT:    pmovsxbq {{.*}}(%rip), %xmm1
> -; CHECK-NEXT:    pextrq $1, %xmm1, %rax
> -; CHECK-NEXT:    pextrq $1, %xmm0, %rcx
> -; CHECK-NEXT:    cqto
> -; CHECK-NEXT:    idivq %rcx
> -; CHECK-NEXT:    movq %rax, %xmm2
> -; CHECK-NEXT:    movq %xmm1, %rax
> -; CHECK-NEXT:    movq %xmm0, %rcx
> -; CHECK-NEXT:    cqto
> -; CHECK-NEXT:    idivq %rcx
> -; CHECK-NEXT:    movq %rax, %xmm0
> -; CHECK-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3],xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
> +; CHECK-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; CHECK-NEXT:    pextrb $1, %xmm0, %eax
> +; CHECK-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> +; CHECK-NEXT:    pextrb $1, %xmm1, %ecx
> +; CHECK-NEXT:    # kill: def $al killed $al killed $eax
> +; CHECK-NEXT:    cbtw
> +; CHECK-NEXT:    idivb %cl
> +; CHECK-NEXT:    movl %eax, %ecx
> +; CHECK-NEXT:    pextrb $0, %xmm0, %eax
> +; CHECK-NEXT:    # kill: def $al killed $al killed $eax
> +; CHECK-NEXT:    cbtw
> +; CHECK-NEXT:    pextrb $0, %xmm1, %edx
> +; CHECK-NEXT:    idivb %dl
> +; CHECK-NEXT:    movzbl %cl, %ecx
> +; CHECK-NEXT:    movzbl %al, %eax
> +; CHECK-NEXT:    movd %eax, %xmm0
> +; CHECK-NEXT:    pinsrb $1, %ecx, %xmm0
>  ; CHECK-NEXT:    pextrw $0, %xmm0, {{.*}}(%rip)
>  ; CHECK-NEXT:    xorl %eax, %eax
>  ; CHECK-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2011-12-28-vselecti8.ll Wed Aug  7 09:24:26 2019
> @@ -18,10 +18,11 @@ target triple = "x86_64-apple-darwin11.2
>  define void @foo8(float* nocapture %RET) nounwind {
>  ; CHECK-LABEL: foo8:
>  ; CHECK:       ## %bb.0: ## %allocas
> -; CHECK-NEXT:    movaps {{.*#+}} xmm0 = [1.0E+2,2.0E+0,1.0E+2,4.0E+0]
> -; CHECK-NEXT:    movaps {{.*#+}} xmm1 = [1.0E+2,6.0E+0,1.0E+2,8.0E+0]
> -; CHECK-NEXT:    movups %xmm1, 16(%rdi)
> -; CHECK-NEXT:    movups %xmm0, (%rdi)
> +; CHECK-NEXT:    pmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
> +; CHECK-NEXT:    cvtdq2ps %xmm0, %xmm0
> +; CHECK-NEXT:    movaps {{.*#+}} xmm1 = [1.0E+2,2.0E+0,1.0E+2,4.0E+0]
> +; CHECK-NEXT:    movups %xmm1, (%rdi)
> +; CHECK-NEXT:    movups %xmm0, 16(%rdi)
>  ; CHECK-NEXT:    retq
>  allocas:
>    %resultvec.i = select <8 x i1> <i1 false, i1 true, i1 false, i1 true, i1 false, i1 true, i1 false, i1 true>, <8 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8>, <8 x i8> <i8 100, i8 100, i8 100, i8 100, i8 100, i8 100, i8 100, i8 100>
>
> Modified: llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2011-12-8-bitcastintprom.ll Wed Aug  7 09:24:26 2019
> @@ -6,16 +6,12 @@
>  define void @prom_bug(<4 x i8> %t, i16* %p) {
>  ; SSE2-LABEL: prom_bug:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> -; SSE2-NEXT:    pextrw $0, %xmm0, %eax
> +; SSE2-NEXT:    movd %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, (%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE41-LABEL: prom_bug:
>  ; SSE41:       ## %bb.0:
> -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE41-NEXT:    pextrw $0, %xmm0, (%rdi)
>  ; SSE41-NEXT:    retq
>    %r = bitcast <4 x i8> %t to <2 x i16>
>
> Modified: llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2012-01-18-vbitcast.ll Wed Aug  7 09:24:26 2019
> @@ -4,9 +4,8 @@
>  define <2 x i32> @vcast(<2 x float> %a, <2 x float> %b) {
>  ; CHECK-LABEL: vcast:
>  ; CHECK:       # %bb.0:
> -; CHECK-NEXT:    pmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
> -; CHECK-NEXT:    pmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero
> -; CHECK-NEXT:    psubq %xmm1, %xmm0
> +; CHECK-NEXT:    movdqa (%rcx), %xmm0
> +; CHECK-NEXT:    psubd (%rdx), %xmm0
>  ; CHECK-NEXT:    retq
>    %af = bitcast <2 x float> %a to <2 x i32>
>    %bf = bitcast <2 x float> %b to <2 x i32>
>
> Modified: llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2012-03-15-build_vector_wl.ll Wed Aug  7 09:24:26 2019
> @@ -4,7 +4,6 @@
>  define <4 x i8> @build_vector_again(<16 x i8> %in) nounwind readnone {
>  ; CHECK-LABEL: build_vector_again:
>  ; CHECK:       ## %bb.0: ## %entry
> -; CHECK-NEXT:    vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
>  ; CHECK-NEXT:    retq
>  entry:
>    %out = shufflevector <16 x i8> %in, <16 x i8> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
>
> Modified: llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/2012-07-10-extload64.ll Wed Aug  7 09:24:26 2019
> @@ -33,7 +33,7 @@ define <2 x i32> @load_64(<2 x i32>* %pt
>  ; CHECK-LABEL: load_64:
>  ; CHECK:       # %bb.0: # %BB
>  ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; CHECK-NEXT:    pmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
> +; CHECK-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
>  ; CHECK-NEXT:    retl
>  BB:
>    %t = load <2 x i32>, <2 x i32>* %ptr
>
> Modified: llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/3dnow-intrinsics.ll Wed Aug  7 09:24:26 2019
> @@ -14,8 +14,7 @@ define <8 x i8> @test_pavgusb(x86_mmx %a
>  ; X64:       # %bb.0: # %entry
>  ; X64-NEXT:    pavgusb %mm1, %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast x86_mmx %a.coerce to <8 x i8>
> @@ -52,8 +51,7 @@ define <2 x i32> @test_pf2id(<2 x float>
>  ; X64-NEXT:    movdq2q %xmm0, %mm0
>  ; X64-NEXT:    pf2id %mm0, %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast <2 x float> %a to x86_mmx
> @@ -169,8 +167,7 @@ define <2 x i32> @test_pfcmpeq(<2 x floa
>  ; X64-NEXT:    movdq2q %xmm0, %mm1
>  ; X64-NEXT:    pfcmpeq %mm0, %mm1
>  ; X64-NEXT:    movq %mm1, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast <2 x float> %a to x86_mmx
> @@ -209,8 +206,7 @@ define <2 x i32> @test_pfcmpge(<2 x floa
>  ; X64-NEXT:    movdq2q %xmm0, %mm1
>  ; X64-NEXT:    pfcmpge %mm0, %mm1
>  ; X64-NEXT:    movq %mm1, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast <2 x float> %a to x86_mmx
> @@ -249,8 +245,7 @@ define <2 x i32> @test_pfcmpgt(<2 x floa
>  ; X64-NEXT:    movdq2q %xmm0, %mm1
>  ; X64-NEXT:    pfcmpgt %mm0, %mm1
>  ; X64-NEXT:    movq %mm1, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast <2 x float> %a to x86_mmx
> @@ -685,8 +680,7 @@ define <4 x i16> @test_pmulhrw(x86_mmx %
>  ; X64:       # %bb.0: # %entry
>  ; X64-NEXT:    pmulhrw %mm1, %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast x86_mmx %a.coerce to <4 x i16>
> @@ -723,8 +717,7 @@ define <2 x i32> @test_pf2iw(<2 x float>
>  ; X64-NEXT:    movdq2q %xmm0, %mm0
>  ; X64-NEXT:    pf2iw %mm0, %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast <2 x float> %a to x86_mmx
> @@ -896,12 +889,10 @@ define <2 x i32> @test_pswapdsi(<2 x i32
>  ;
>  ; X64-LABEL: test_pswapdsi:
>  ; X64:       # %bb.0: # %entry
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; X64-NEXT:    movdq2q %xmm0, %mm0
>  ; X64-NEXT:    pswapd %mm0, %mm0 # mm0 = mm0[1,0]
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> +; X64-NEXT:    movaps -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    retq
>  entry:
>    %0 = bitcast <2 x i32> %a to x86_mmx
>
> Modified: llvm/trunk/test/CodeGen/X86/4char-promote.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/4char-promote.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/4char-promote.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/4char-promote.ll Wed Aug  7 09:24:26 2019
> @@ -7,8 +7,11 @@ target triple = "x86_64-apple-darwin"
>  define <4 x i8> @foo(<4 x i8> %x, <4 x i8> %y) {
>  ; CHECK-LABEL: foo:
>  ; CHECK:       ## %bb.0: ## %entry
> -; CHECK-NEXT:    pmulld %xmm0, %xmm1
> -; CHECK-NEXT:    paddd %xmm1, %xmm0
> +; CHECK-NEXT:    pmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> +; CHECK-NEXT:    pmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; CHECK-NEXT:    pmullw %xmm1, %xmm2
> +; CHECK-NEXT:    pshufb {{.*#+}} xmm2 = xmm2[0,2,4,6,u,u,u,u,u,u,u,u,u,u,u,u]
> +; CHECK-NEXT:    paddb %xmm2, %xmm0
>  ; CHECK-NEXT:    retq
>  entry:
>   %binop = mul <4 x i8> %x, %y
>
> Modified: llvm/trunk/test/CodeGen/X86/and-load-fold.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/and-load-fold.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/and-load-fold.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/and-load-fold.ll Wed Aug  7 09:24:26 2019
> @@ -6,10 +6,8 @@
>  define i8 @foo(<4 x i8>* %V) {
>  ; CHECK-LABEL: foo:
>  ; CHECK:       # %bb.0:
> -; CHECK-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> -; CHECK-NEXT:    pextrw $1, %xmm0, %eax
> +; CHECK-NEXT:    movb 2(%rdi), %al
>  ; CHECK-NEXT:    andb $95, %al
> -; CHECK-NEXT:    # kill: def $al killed $al killed $eax
>  ; CHECK-NEXT:    retq
>    %Vp = bitcast <4 x i8>* %V to <3 x i8>*
>    %V3i8 = load <3 x i8>, <3 x i8>* %Vp, align 4
>
> Modified: llvm/trunk/test/CodeGen/X86/atomic-unordered.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/atomic-unordered.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/atomic-unordered.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/atomic-unordered.ll Wed Aug  7 09:24:26 2019
> @@ -460,7 +460,7 @@ define void @vec_store(i32* %p0, <2 x i3
>  ; CHECK-O0-LABEL: vec_store:
>  ; CHECK-O0:       # %bb.0:
>  ; CHECK-O0-NEXT:    vmovd %xmm0, %eax
> -; CHECK-O0-NEXT:    vpextrd $2, %xmm0, %ecx
> +; CHECK-O0-NEXT:    vpextrd $1, %xmm0, %ecx
>  ; CHECK-O0-NEXT:    movl %eax, (%rdi)
>  ; CHECK-O0-NEXT:    movl %ecx, 4(%rdi)
>  ; CHECK-O0-NEXT:    retq
> @@ -468,7 +468,7 @@ define void @vec_store(i32* %p0, <2 x i3
>  ; CHECK-O3-LABEL: vec_store:
>  ; CHECK-O3:       # %bb.0:
>  ; CHECK-O3-NEXT:    vmovd %xmm0, %eax
> -; CHECK-O3-NEXT:    vpextrd $2, %xmm0, %ecx
> +; CHECK-O3-NEXT:    vpextrd $1, %xmm0, %ecx
>  ; CHECK-O3-NEXT:    movl %eax, (%rdi)
>  ; CHECK-O3-NEXT:    movl %ecx, 4(%rdi)
>  ; CHECK-O3-NEXT:    retq
> @@ -485,7 +485,7 @@ define void @vec_store_unaligned(i32* %p
>  ; CHECK-O0-LABEL: vec_store_unaligned:
>  ; CHECK-O0:       # %bb.0:
>  ; CHECK-O0-NEXT:    vmovd %xmm0, %eax
> -; CHECK-O0-NEXT:    vpextrd $2, %xmm0, %ecx
> +; CHECK-O0-NEXT:    vpextrd $1, %xmm0, %ecx
>  ; CHECK-O0-NEXT:    movl %eax, (%rdi)
>  ; CHECK-O0-NEXT:    movl %ecx, 4(%rdi)
>  ; CHECK-O0-NEXT:    retq
> @@ -493,7 +493,7 @@ define void @vec_store_unaligned(i32* %p
>  ; CHECK-O3-LABEL: vec_store_unaligned:
>  ; CHECK-O3:       # %bb.0:
>  ; CHECK-O3-NEXT:    vmovd %xmm0, %eax
> -; CHECK-O3-NEXT:    vpextrd $2, %xmm0, %ecx
> +; CHECK-O3-NEXT:    vpextrd $1, %xmm0, %ecx
>  ; CHECK-O3-NEXT:    movl %eax, (%rdi)
>  ; CHECK-O3-NEXT:    movl %ecx, 4(%rdi)
>  ; CHECK-O3-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/avg.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avg.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avg.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avg.ll Wed Aug  7 09:24:26 2019
> @@ -378,63 +378,65 @@ define void @avg_v48i8(<48 x i8>* %a, <4
>  ; AVX2-LABEL: avg_v48i8:
>  ; AVX2:       # %bb.0:
>  ; AVX2-NEXT:    vmovdqa (%rdi), %xmm0
> -; AVX2-NEXT:    vmovdqa 32(%rdi), %xmm1
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> -; AVX2-NEXT:    vpbroadcastq 24(%rdi), %xmm3
> +; AVX2-NEXT:    vmovdqa 16(%rdi), %xmm1
> +; AVX2-NEXT:    vmovdqa 32(%rdi), %xmm2
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
>  ; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm3 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero,xmm3[2],zero,zero,zero,xmm3[3],zero,zero,zero,xmm3[4],zero,zero,zero,xmm3[5],zero,zero,zero,xmm3[6],zero,zero,zero,xmm3[7],zero,zero,zero
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm1[2,3,0,1]
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm8 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
> -; AVX2-NEXT:    vmovdqa (%rsi), %xmm6
> -; AVX2-NEXT:    vmovdqa 32(%rsi), %xmm7
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm6[2,3,0,1]
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
> -; AVX2-NEXT:    vpaddd %ymm1, %ymm2, %ymm1
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 = xmm6[0],zero,zero,zero,xmm6[1],zero,zero,zero,xmm6[2],zero,zero,zero,xmm6[3],zero,zero,zero,xmm6[4],zero,zero,zero,xmm6[5],zero,zero,zero,xmm6[6],zero,zero,zero,xmm6[7],zero,zero,zero
> -; AVX2-NEXT:    vpaddd %ymm2, %ymm0, %ymm0
> -; AVX2-NEXT:    vpbroadcastq 24(%rsi), %xmm2
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> -; AVX2-NEXT:    vpaddd %ymm2, %ymm3, %ymm2
> -; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
> -; AVX2-NEXT:    vpaddd %ymm3, %ymm4, %ymm3
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm4 = xmm7[2,3,0,1]
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm4 = xmm1[2,3,0,1]
>  ; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero,xmm4[2],zero,zero,zero,xmm4[3],zero,zero,zero,xmm4[4],zero,zero,zero,xmm4[5],zero,zero,zero,xmm4[6],zero,zero,zero,xmm4[7],zero,zero,zero
> -; AVX2-NEXT:    vpaddd %ymm4, %ymm5, %ymm4
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm9 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm8 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> +; AVX2-NEXT:    vmovdqa (%rsi), %xmm6
> +; AVX2-NEXT:    vmovdqa 16(%rsi), %xmm7
> +; AVX2-NEXT:    vmovdqa 32(%rsi), %xmm2
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm6[2,3,0,1]
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> +; AVX2-NEXT:    vpaddd %ymm5, %ymm3, %ymm3
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 = xmm6[0],zero,zero,zero,xmm6[1],zero,zero,zero,xmm6[2],zero,zero,zero,xmm6[3],zero,zero,zero,xmm6[4],zero,zero,zero,xmm6[5],zero,zero,zero,xmm6[6],zero,zero,zero,xmm6[7],zero,zero,zero
> +; AVX2-NEXT:    vpaddd %ymm5, %ymm0, %ymm0
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm7[2,3,0,1]
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> +; AVX2-NEXT:    vpaddd %ymm5, %ymm4, %ymm4
>  ; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 = xmm7[0],zero,zero,zero,xmm7[1],zero,zero,zero,xmm7[2],zero,zero,zero,xmm7[3],zero,zero,zero,xmm7[4],zero,zero,zero,xmm7[5],zero,zero,zero,xmm7[6],zero,zero,zero,xmm7[7],zero,zero,zero
> -; AVX2-NEXT:    vpaddd %ymm5, %ymm8, %ymm5
> +; AVX2-NEXT:    vpaddd %ymm5, %ymm1, %ymm1
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero,xmm5[2],zero,zero,zero,xmm5[3],zero,zero,zero,xmm5[4],zero,zero,zero,xmm5[5],zero,zero,zero,xmm5[6],zero,zero,zero,xmm5[7],zero,zero,zero
> +; AVX2-NEXT:    vpaddd %ymm5, %ymm9, %ymm5
> +; AVX2-NEXT:    vpmovzxbd {{.*#+}} ymm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero,xmm2[4],zero,zero,zero,xmm2[5],zero,zero,zero,xmm2[6],zero,zero,zero,xmm2[7],zero,zero,zero
> +; AVX2-NEXT:    vpaddd %ymm2, %ymm8, %ymm2
>  ; AVX2-NEXT:    vpcmpeqd %ymm6, %ymm6, %ymm6
> -; AVX2-NEXT:    vpsubd %ymm6, %ymm1, %ymm1
> -; AVX2-NEXT:    vpsubd %ymm6, %ymm0, %ymm0
> -; AVX2-NEXT:    vpsubd %ymm6, %ymm2, %ymm2
>  ; AVX2-NEXT:    vpsubd %ymm6, %ymm3, %ymm3
> +; AVX2-NEXT:    vpsubd %ymm6, %ymm0, %ymm0
>  ; AVX2-NEXT:    vpsubd %ymm6, %ymm4, %ymm4
> +; AVX2-NEXT:    vpsubd %ymm6, %ymm1, %ymm1
>  ; AVX2-NEXT:    vpsubd %ymm6, %ymm5, %ymm5
> +; AVX2-NEXT:    vpsubd %ymm6, %ymm2, %ymm2
> +; AVX2-NEXT:    vpsrld $1, %ymm2, %ymm2
>  ; AVX2-NEXT:    vpsrld $1, %ymm5, %ymm5
> +; AVX2-NEXT:    vpsrld $1, %ymm1, %ymm1
>  ; AVX2-NEXT:    vpsrld $1, %ymm4, %ymm4
> -; AVX2-NEXT:    vpsrld $1, %ymm3, %ymm3
> -; AVX2-NEXT:    vpsrld $1, %ymm2, %ymm2
>  ; AVX2-NEXT:    vpsrld $1, %ymm0, %ymm0
> -; AVX2-NEXT:    vpsrld $1, %ymm1, %ymm1
> -; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm0[2,3],ymm1[2,3]
> -; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
> +; AVX2-NEXT:    vpsrld $1, %ymm3, %ymm3
> +; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm0[2,3],ymm3[2,3]
> +; AVX2-NEXT:    vinserti128 $1, %xmm3, %ymm0, %ymm0
>  ; AVX2-NEXT:    vpackusdw %ymm6, %ymm0, %ymm0
> -; AVX2-NEXT:    vmovdqa {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
> -; AVX2-NEXT:    vpand %ymm1, %ymm0, %ymm0
> -; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm3[2,3],ymm2[2,3]
> -; AVX2-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> -; AVX2-NEXT:    vpackusdw %ymm6, %ymm2, %ymm2
> -; AVX2-NEXT:    vpand %ymm1, %ymm2, %ymm2
> -; AVX2-NEXT:    vinserti128 $1, %xmm2, %ymm0, %ymm3
> +; AVX2-NEXT:    vmovdqa {{.*#+}} ymm3 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
> +; AVX2-NEXT:    vpand %ymm3, %ymm0, %ymm0
> +; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm6 = ymm1[2,3],ymm4[2,3]
> +; AVX2-NEXT:    vinserti128 $1, %xmm4, %ymm1, %ymm1
> +; AVX2-NEXT:    vpackusdw %ymm6, %ymm1, %ymm1
> +; AVX2-NEXT:    vpand %ymm3, %ymm1, %ymm1
> +; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm4
>  ; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm0
> -; AVX2-NEXT:    vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm2[4,5,6,7]
> -; AVX2-NEXT:    vpackuswb %ymm0, %ymm3, %ymm0
> -; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm2 = ymm5[2,3],ymm4[2,3]
> -; AVX2-NEXT:    vinserti128 $1, %xmm4, %ymm5, %ymm3
> -; AVX2-NEXT:    vpackusdw %ymm2, %ymm3, %ymm2
> -; AVX2-NEXT:    vpand %ymm1, %ymm2, %ymm1
> +; AVX2-NEXT:    vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
> +; AVX2-NEXT:    vpackuswb %ymm0, %ymm4, %ymm0
> +; AVX2-NEXT:    vperm2i128 {{.*#+}} ymm1 = ymm2[2,3],ymm5[2,3]
> +; AVX2-NEXT:    vinserti128 $1, %xmm5, %ymm2, %ymm2
> +; AVX2-NEXT:    vpackusdw %ymm1, %ymm2, %ymm1
> +; AVX2-NEXT:    vpand %ymm3, %ymm1, %ymm1
>  ; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
>  ; AVX2-NEXT:    vpackuswb %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovdqu %xmm1, (%rax)
> @@ -1897,118 +1899,178 @@ define void @not_avg_v16i8_wide_constant
>  ; SSE2-NEXT:    pushq %r13
>  ; SSE2-NEXT:    pushq %r12
>  ; SSE2-NEXT:    pushq %rbx
> -; SSE2-NEXT:    movaps (%rdi), %xmm0
> -; SSE2-NEXT:    movaps (%rsi), %xmm1
> -; SSE2-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
> +; SSE2-NEXT:    movaps (%rdi), %xmm1
> +; SSE2-NEXT:    movaps (%rsi), %xmm0
> +; SSE2-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
>  ; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r13d
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
>  ; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
>  ; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r14d
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r15d
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r13d
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r12d
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r15d
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r11d
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r10d
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r9d
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r8d
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ecx
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edi
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> +; SSE2-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebp
> +; SSE2-NEXT:    addq %r11, %rbp
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r14d
> +; SSE2-NEXT:    addq %r10, %r14
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> +; SSE2-NEXT:    addq %r9, %rbx
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r11d
> +; SSE2-NEXT:    addq %r8, %r11
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r10d
> +; SSE2-NEXT:    addq %rdx, %r10
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %r8d
> +; SSE2-NEXT:    addq %rcx, %r8
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edi
> +; SSE2-NEXT:    addq %rax, %rdi
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> -; SSE2-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
> +; SSE2-NEXT:    addq %rsi, %rdx
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> -; SSE2-NEXT:    leal -1(%rdx,%rsi), %edx
> -; SSE2-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> -; SSE2-NEXT:    leal -1(%rbx,%rdx), %edx
> -; SSE2-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> -; SSE2-NEXT:    leal -1(%rbp,%rdx), %edx
> -; SSE2-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> -; SSE2-NEXT:    leal -1(%rdi,%rdx), %r8d
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %edx
> -; SSE2-NEXT:    leal -1(%rax,%rdx), %edi
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> -; SSE2-NEXT:    leal -1(%rcx,%rax), %edx
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> -; SSE2-NEXT:    leal -1(%r9,%rax), %ecx
> +; SSE2-NEXT:    leaq -1(%r15,%rsi), %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> -; SSE2-NEXT:    leal -1(%r10,%rsi), %eax
> +; SSE2-NEXT:    leaq -1(%r12,%rsi), %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
>  ; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> -; SSE2-NEXT:    leaq -1(%r11,%rsi), %rsi
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    leaq -1(%r12,%rbx), %r12
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    leaq -1(%r15,%rbx), %r15
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    leaq -1(%r14,%rbx), %r14
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> -; SSE2-NEXT:    leaq -1(%rbp,%rbx), %r11
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> -; SSE2-NEXT:    leaq -1(%rbp,%rbx), %r10
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    leaq -1(%r13,%rbx), %r9
> -; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %ebx
> -; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r13 # 8-byte Reload
> -; SSE2-NEXT:    leaq -1(%r13,%rbx), %rbx
> -; SSE2-NEXT:    shrl %eax
> -; SSE2-NEXT:    movd %eax, %xmm8
> -; SSE2-NEXT:    shrl %ecx
> -; SSE2-NEXT:    movd %ecx, %xmm15
> -; SSE2-NEXT:    shrl %edx
> -; SSE2-NEXT:    movd %edx, %xmm9
> -; SSE2-NEXT:    shrl %edi
> -; SSE2-NEXT:    movd %edi, %xmm2
> -; SSE2-NEXT:    shrl %r8d
> -; SSE2-NEXT:    movd %r8d, %xmm10
> -; SSE2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; SSE2-NEXT:    shrl %eax
> -; SSE2-NEXT:    movd %eax, %xmm6
> -; SSE2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; SSE2-NEXT:    shrl %eax
> -; SSE2-NEXT:    movd %eax, %xmm11
> -; SSE2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; SSE2-NEXT:    shrl %eax
> -; SSE2-NEXT:    movd %eax, %xmm4
> -; SSE2-NEXT:    shrq %rsi
> -; SSE2-NEXT:    movd %esi, %xmm12
> -; SSE2-NEXT:    shrq %r12
> -; SSE2-NEXT:    movd %r12d, %xmm3
> -; SSE2-NEXT:    shrq %r15
> -; SSE2-NEXT:    movd %r15d, %xmm13
> -; SSE2-NEXT:    shrq %r14
> -; SSE2-NEXT:    movd %r14d, %xmm7
> -; SSE2-NEXT:    shrq %r11
> -; SSE2-NEXT:    movd %r11d, %xmm14
> -; SSE2-NEXT:    shrq %r10
> -; SSE2-NEXT:    movd %r10d, %xmm5
> -; SSE2-NEXT:    shrq %r9
> -; SSE2-NEXT:    movd %r9d, %xmm0
> -; SSE2-NEXT:    shrq %rbx
> -; SSE2-NEXT:    movd %ebx, %xmm1
> +; SSE2-NEXT:    leaq -1(%r13,%rsi), %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rsi
> +; SSE2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %esi
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    leaq -1(%rax,%rsi), %rsi
> +; SSE2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    addq $-1, %rbp
> +; SSE2-NEXT:    movl $0, %r9d
> +; SSE2-NEXT:    adcq $-1, %r9
> +; SSE2-NEXT:    addq $-1, %r14
> +; SSE2-NEXT:    movl $0, %esi
> +; SSE2-NEXT:    adcq $-1, %rsi
> +; SSE2-NEXT:    addq $-1, %rbx
> +; SSE2-NEXT:    movl $0, %eax
> +; SSE2-NEXT:    adcq $-1, %rax
> +; SSE2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; SSE2-NEXT:    addq $-1, %r11
> +; SSE2-NEXT:    movl $0, %r12d
> +; SSE2-NEXT:    adcq $-1, %r12
> +; SSE2-NEXT:    addq $-1, %r10
> +; SSE2-NEXT:    movl $0, %r13d
> +; SSE2-NEXT:    adcq $-1, %r13
> +; SSE2-NEXT:    addq $-1, %r8
> +; SSE2-NEXT:    movl $0, %r15d
> +; SSE2-NEXT:    adcq $-1, %r15
> +; SSE2-NEXT:    addq $-1, %rdi
> +; SSE2-NEXT:    movl $0, %ecx
> +; SSE2-NEXT:    adcq $-1, %rcx
> +; SSE2-NEXT:    addq $-1, %rdx
> +; SSE2-NEXT:    movl $0, %eax
> +; SSE2-NEXT:    adcq $-1, %rax
> +; SSE2-NEXT:    shldq $63, %rdx, %rax
> +; SSE2-NEXT:    shldq $63, %rdi, %rcx
> +; SSE2-NEXT:    movq %rcx, %rdx
> +; SSE2-NEXT:    shldq $63, %r8, %r15
> +; SSE2-NEXT:    shldq $63, %r10, %r13
> +; SSE2-NEXT:    shldq $63, %r11, %r12
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdi # 8-byte Reload
> +; SSE2-NEXT:    shldq $63, %rbx, %rdi
> +; SSE2-NEXT:    shldq $63, %r14, %rsi
> +; SSE2-NEXT:    shldq $63, %rbp, %r9
> +; SSE2-NEXT:    movq %r9, %xmm8
> +; SSE2-NEXT:    movq %rsi, %xmm15
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; SSE2-NEXT:    shrq %rcx
> +; SSE2-NEXT:    movq %rcx, %xmm9
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; SSE2-NEXT:    shrq %rcx
> +; SSE2-NEXT:    movq %rcx, %xmm2
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; SSE2-NEXT:    shrq %rcx
> +; SSE2-NEXT:    movq %rcx, %xmm10
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; SSE2-NEXT:    shrq %rcx
> +; SSE2-NEXT:    movq %rcx, %xmm4
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; SSE2-NEXT:    shrq %rcx
> +; SSE2-NEXT:    movq %rcx, %xmm11
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; SSE2-NEXT:    shrq %rcx
> +; SSE2-NEXT:    movq %rcx, %xmm7
> +; SSE2-NEXT:    movq %rdi, %xmm12
> +; SSE2-NEXT:    movq %r12, %xmm0
> +; SSE2-NEXT:    movq %r13, %xmm13
> +; SSE2-NEXT:    movq %r15, %xmm6
> +; SSE2-NEXT:    movq %rdx, %xmm14
> +; SSE2-NEXT:    movq %rax, %xmm5
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    shrq %rax
> +; SSE2-NEXT:    movq %rax, %xmm3
> +; SSE2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; SSE2-NEXT:    shrq %rax
> +; SSE2-NEXT:    movq %rax, %xmm1
>  ; SSE2-NEXT:    punpcklbw {{.*#+}} xmm15 = xmm15[0],xmm8[0],xmm15[1],xmm8[1],xmm15[2],xmm8[2],xmm15[3],xmm8[3],xmm15[4],xmm8[4],xmm15[5],xmm8[5],xmm15[6],xmm8[6],xmm15[7],xmm8[7]
>  ; SSE2-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm9[0],xmm2[1],xmm9[1],xmm2[2],xmm9[2],xmm2[3],xmm9[3],xmm2[4],xmm9[4],xmm2[5],xmm9[5],xmm2[6],xmm9[6],xmm2[7],xmm9[7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm15[0],xmm2[1],xmm15[1],xmm2[2],xmm15[2],xmm2[3],xmm15[3]
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm10[0],xmm6[1],xmm10[1],xmm6[2],xmm10[2],xmm6[3],xmm10[3],xmm6[4],xmm10[4],xmm6[5],xmm10[5],xmm6[6],xmm10[6],xmm6[7],xmm10[7]
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm11[0],xmm4[1],xmm11[1],xmm4[2],xmm11[2],xmm4[3],xmm11[3],xmm4[4],xmm11[4],xmm4[5],xmm11[5],xmm4[6],xmm11[6],xmm4[7],xmm11[7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm6[0],xmm4[1],xmm6[1],xmm4[2],xmm6[2],xmm4[3],xmm6[3]
> -; SSE2-NEXT:    punpckldq {{.*#+}} xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1]
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm12[0],xmm3[1],xmm12[1],xmm3[2],xmm12[2],xmm3[3],xmm12[3],xmm3[4],xmm12[4],xmm3[5],xmm12[5],xmm3[6],xmm12[6],xmm3[7],xmm12[7]
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm13[0],xmm7[1],xmm13[1],xmm7[2],xmm13[2],xmm7[3],xmm13[3],xmm7[4],xmm13[4],xmm7[5],xmm13[5],xmm7[6],xmm13[6],xmm7[7],xmm13[7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm3[0],xmm7[1],xmm3[1],xmm7[2],xmm3[2],xmm7[3],xmm3[3]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm8 = xmm15[0,1,2,0]
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm8
> +; SSE2-NEXT:    pslldq {{.*#+}} xmm2 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0,1]
> +; SSE2-NEXT:    por %xmm8, %xmm2
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm10[0],xmm4[1],xmm10[1],xmm4[2],xmm10[2],xmm4[3],xmm10[3],xmm4[4],xmm10[4],xmm4[5],xmm10[5],xmm4[6],xmm10[6],xmm4[7],xmm10[7]
> +; SSE2-NEXT:    pslldq {{.*#+}} xmm4 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm4[0,1,2,3,4,5]
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm11[0],xmm7[1],xmm11[1],xmm7[2],xmm11[2],xmm7[3],xmm11[3],xmm7[4],xmm11[4],xmm7[5],xmm11[5],xmm7[6],xmm11[6],xmm7[7],xmm11[7]
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [65535,65535,65535,65535,65535,0,65535,65535]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm7[0,1,0,1]
> +; SSE2-NEXT:    pand %xmm8, %xmm7
> +; SSE2-NEXT:    pandn %xmm4, %xmm8
> +; SSE2-NEXT:    por %xmm7, %xmm8
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm8[0,1,2,2]
> +; SSE2-NEXT:    punpckhdq {{.*#+}} xmm4 = xmm4[2],xmm2[2],xmm4[3],xmm2[3]
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm12[0],xmm0[1],xmm12[1],xmm0[2],xmm12[2],xmm0[3],xmm12[3],xmm0[4],xmm12[4],xmm0[5],xmm12[5],xmm0[6],xmm12[6],xmm0[7],xmm12[7]
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm13[0],xmm6[1],xmm13[1],xmm6[2],xmm13[2],xmm6[3],xmm13[3],xmm6[4],xmm13[4],xmm6[5],xmm13[5],xmm6[6],xmm13[6],xmm6[7],xmm13[7]
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [65535,0,65535,65535,65535,65535,65535,65535]
> +; SSE2-NEXT:    pand %xmm2, %xmm0
> +; SSE2-NEXT:    pslld $16, %xmm6
> +; SSE2-NEXT:    pandn %xmm6, %xmm2
> +; SSE2-NEXT:    por %xmm0, %xmm2
>  ; SSE2-NEXT:    punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm14[0],xmm5[1],xmm14[1],xmm5[2],xmm14[2],xmm5[3],xmm14[3],xmm5[4],xmm14[4],xmm5[5],xmm14[5],xmm5[6],xmm14[6],xmm5[7],xmm14[7]
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3]
> -; SSE2-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm7[0],xmm1[1],xmm7[1]
> -; SSE2-NEXT:    punpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm1[0]
> -; SSE2-NEXT:    movdqu %xmm4, (%rax)
> +; SSE2-NEXT:    psllq $48, %xmm5
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm0 = [65535,65535,65535,0,65535,65535,65535,65535]
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3],xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,0,1,1]
> +; SSE2-NEXT:    pand %xmm0, %xmm1
> +; SSE2-NEXT:    pandn %xmm5, %xmm0
> +; SSE2-NEXT:    por %xmm1, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> +; SSE2-NEXT:    punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
> +; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[0,1],xmm4[2,3]
> +; SSE2-NEXT:    movups %xmm2, (%rax)
>  ; SSE2-NEXT:    popq %rbx
>  ; SSE2-NEXT:    popq %r12
>  ; SSE2-NEXT:    popq %r13
> @@ -2025,118 +2087,181 @@ define void @not_avg_v16i8_wide_constant
>  ; AVX1-NEXT:    pushq %r13
>  ; AVX1-NEXT:    pushq %r12
>  ; AVX1-NEXT:    pushq %rbx
> -; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> -; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> +; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
>  ; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> -; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm5 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm6 = xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
> -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm4 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
> -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm7 = xmm4[2],xmm2[2],xmm4[3],xmm2[3]
> -; AVX1-NEXT:    vpextrq $1, %xmm7, %r15
> -; AVX1-NEXT:    vmovq %xmm7, %r14
> -; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm4 = xmm4[0],zero,xmm4[1],zero
> -; AVX1-NEXT:    vpextrq $1, %xmm4, %r11
> -; AVX1-NEXT:    vmovq %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> +; AVX1-NEXT:    vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> +; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm5 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> +; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm5[2],xmm3[2],xmm5[3],xmm3[3]
> +; AVX1-NEXT:    vmovq %xmm5, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX1-NEXT:    vpextrq $1, %xmm5, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 = xmm5[2],xmm3[2],xmm5[3],xmm3[3]
> +; AVX1-NEXT:    vmovq %xmm6, %r10
> +; AVX1-NEXT:    vpextrq $1, %xmm6, %r9
> +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm6 = xmm4[4],xmm3[4],xmm4[5],xmm3[5],xmm4[6],xmm3[6],xmm4[7],xmm3[7]
> +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm7 = xmm6[0],zero,xmm6[1],zero
> +; AVX1-NEXT:    vmovq %xmm7, %r8
> +; AVX1-NEXT:    vpextrq $1, %xmm7, %rdi
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 = xmm6[2],xmm3[2],xmm6[3],xmm3[3]
> +; AVX1-NEXT:    vpextrq $1, %xmm6, %rcx
> +; AVX1-NEXT:    vmovq %xmm6, %r14
> +; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm6 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 = xmm6[2],xmm3[2],xmm6[3],xmm3[3]
> +; AVX1-NEXT:    vpextrq $1, %xmm6, %rax
> +; AVX1-NEXT:    vmovq %xmm6, %rbp
> +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm5 = xmm5[0],zero,xmm5[1],zero
> +; AVX1-NEXT:    vpextrq $1, %xmm5, %r11
> +; AVX1-NEXT:    vmovq %xmm5, %r15
> +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm8 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
> +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm4 = xmm4[0],zero,zero,zero,xmm4[1],zero,zero,zero
> +; AVX1-NEXT:    vpextrq $1, %xmm4, %rbx
> +; AVX1-NEXT:    vmovq %xmm4, %rdx
>  ; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm4 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm4 = xmm4[2],xmm2[2],xmm4[3],xmm2[3]
> -; AVX1-NEXT:    vpextrq $1, %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> -; AVX1-NEXT:    vmovq %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm4 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm7 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero
> -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm8 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
> -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm1 = xmm5[4],xmm2[4],xmm5[5],xmm2[5],xmm5[6],xmm2[6],xmm5[7],xmm2[7]
> -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
> -; AVX1-NEXT:    vmovd %xmm6, %ecx
> -; AVX1-NEXT:    vpextrd $1, %xmm6, %edx
> -; AVX1-NEXT:    vpextrd $2, %xmm6, %r13d
> -; AVX1-NEXT:    vpextrd $3, %xmm6, %r12d
> -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm6 = xmm3[2],xmm2[2],xmm3[3],xmm2[3]
> -; AVX1-NEXT:    vmovd %xmm1, %ebx
> -; AVX1-NEXT:    vpextrd $1, %xmm1, %ebp
> -; AVX1-NEXT:    vpextrd $2, %xmm1, %esi
> -; AVX1-NEXT:    vpextrd $3, %xmm1, %edi
> -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm1 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero
> -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm5 = xmm5[0],zero,zero,zero,xmm5[1],zero,zero,zero
> -; AVX1-NEXT:    vmovd %xmm7, %r8d
> -; AVX1-NEXT:    leal -1(%r12,%rdi), %eax
> -; AVX1-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX1-NEXT:    vpextrd $2, %xmm7, %eax
> -; AVX1-NEXT:    leal -1(%r13,%rsi), %esi
> -; AVX1-NEXT:    movl %esi, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX1-NEXT:    vpextrd $2, %xmm4, %edi
> -; AVX1-NEXT:    leal -1(%rdx,%rbp), %edx
> -; AVX1-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX1-NEXT:    vpextrd $3, %xmm4, %edx
> -; AVX1-NEXT:    leal -1(%rcx,%rbx), %r10d
> -; AVX1-NEXT:    vpextrd $3, %xmm1, %ecx
> -; AVX1-NEXT:    leal -1(%rdx,%rcx), %r9d
> -; AVX1-NEXT:    vpextrd $2, %xmm1, %ecx
> -; AVX1-NEXT:    leal -1(%rdi,%rcx), %edi
> -; AVX1-NEXT:    vpextrd $2, %xmm5, %ecx
> -; AVX1-NEXT:    leal -1(%rax,%rcx), %eax
> -; AVX1-NEXT:    vmovd %xmm5, %ecx
> -; AVX1-NEXT:    leal -1(%r8,%rcx), %r8d
> -; AVX1-NEXT:    vpextrq $1, %xmm6, %rdx
> -; AVX1-NEXT:    leal -1(%r15,%rdx), %r15d
> -; AVX1-NEXT:    vmovq %xmm6, %rdx
> -; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm3[0],zero,xmm3[1],zero
> -; AVX1-NEXT:    leal -1(%r14,%rdx), %r14d
> -; AVX1-NEXT:    vpextrq $1, %xmm1, %rdx
> -; AVX1-NEXT:    leal -1(%r11,%rdx), %edx
> -; AVX1-NEXT:    vmovq %xmm1, %rcx
> -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm1 = xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> -; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
> -; AVX1-NEXT:    leal -1(%rsi,%rcx), %ecx
> -; AVX1-NEXT:    vpextrq $1, %xmm1, %rsi
> -; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> -; AVX1-NEXT:    leal -1(%rbp,%rsi), %esi
> -; AVX1-NEXT:    vmovq %xmm1, %rbx
> -; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> -; AVX1-NEXT:    leal -1(%rbp,%rbx), %ebx
> -; AVX1-NEXT:    vpextrq $1, %xmm8, %r11
> -; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
> -; AVX1-NEXT:    vpextrq $1, %xmm0, %r12
> -; AVX1-NEXT:    leal -1(%r11,%r12), %r11d
> -; AVX1-NEXT:    vmovq %xmm8, %r12
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm4 = xmm4[2],xmm3[2],xmm4[3],xmm3[3]
> +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm7 = xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm5 = xmm7[2],xmm3[2],xmm7[3],xmm3[3]
> +; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm0 = xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
> +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm6 = xmm0[0],zero,xmm0[1],zero
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm0 = xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> +; AVX1-NEXT:    vpextrq $1, %xmm0, %rsi
> +; AVX1-NEXT:    addq %rcx, %rsi
>  ; AVX1-NEXT:    vmovq %xmm0, %r13
> -; AVX1-NEXT:    leal -1(%r12,%r13), %ebp
> -; AVX1-NEXT:    shrl %ebp
> -; AVX1-NEXT:    vmovd %ebp, %xmm0
> -; AVX1-NEXT:    shrl %r11d
> -; AVX1-NEXT:    vpinsrb $1, %r11d, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %ebx
> -; AVX1-NEXT:    vpinsrb $2, %ebx, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %esi
> -; AVX1-NEXT:    vpinsrb $3, %esi, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %ecx
> -; AVX1-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %edx
> -; AVX1-NEXT:    vpinsrb $5, %edx, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %r14d
> -; AVX1-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %r15d
> -; AVX1-NEXT:    vpinsrb $7, %r15d, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %r8d
> -; AVX1-NEXT:    vpinsrb $8, %r8d, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %eax
> -; AVX1-NEXT:    vpinsrb $9, %eax, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %edi
> -; AVX1-NEXT:    vpinsrb $10, %edi, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %r9d
> -; AVX1-NEXT:    vpinsrb $11, %r9d, %xmm0, %xmm0
> -; AVX1-NEXT:    shrl %r10d
> -; AVX1-NEXT:    vpinsrb $12, %r10d, %xmm0, %xmm0
> -; AVX1-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX1-NEXT:    shrl %eax
> -; AVX1-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> -; AVX1-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX1-NEXT:    shrl %eax
> -; AVX1-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> -; AVX1-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX1-NEXT:    shrl %eax
> -; AVX1-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> +; AVX1-NEXT:    addq %r14, %r13
> +; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm0 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX1-NEXT:    vpunpckhdq {{.*#+}} xmm0 = xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> +; AVX1-NEXT:    vpextrq $1, %xmm0, %r12
> +; AVX1-NEXT:    addq %rax, %r12
> +; AVX1-NEXT:    vmovq %xmm0, %r14
> +; AVX1-NEXT:    addq %rbp, %r14
> +; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm7[0],zero,xmm7[1],zero
> +; AVX1-NEXT:    vpextrq $1, %xmm0, %rbp
> +; AVX1-NEXT:    addq %r11, %rbp
> +; AVX1-NEXT:    vmovq %xmm0, %r11
> +; AVX1-NEXT:    addq %r15, %r11
> +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm0 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero
> +; AVX1-NEXT:    vpextrq $1, %xmm0, %r15
> +; AVX1-NEXT:    addq %rbx, %r15
> +; AVX1-NEXT:    vmovq %xmm0, %rbx
> +; AVX1-NEXT:    addq %rdx, %rbx
> +; AVX1-NEXT:    vpextrq $1, %xmm6, %rax
> +; AVX1-NEXT:    leaq -1(%rdi,%rax), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vmovq %xmm6, %rax
> +; AVX1-NEXT:    leaq -1(%r8,%rax), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vpextrq $1, %xmm5, %rax
> +; AVX1-NEXT:    leaq -1(%r9,%rax), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vmovq %xmm5, %rax
> +; AVX1-NEXT:    leaq -1(%r10,%rax), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vpextrq $1, %xmm4, %rax
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; AVX1-NEXT:    leaq -1(%rcx,%rax), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vmovq %xmm4, %rax
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; AVX1-NEXT:    leaq -1(%rcx,%rax), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vpextrq $1, %xmm8, %rax
> +; AVX1-NEXT:    vpmovzxwq {{.*#+}} xmm0 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
> +; AVX1-NEXT:    vpextrq $1, %xmm0, %rcx
> +; AVX1-NEXT:    leaq -1(%rax,%rcx), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    vmovq %xmm8, %rax
> +; AVX1-NEXT:    vmovq %xmm0, %rcx
> +; AVX1-NEXT:    leaq -1(%rax,%rcx), %rax
> +; AVX1-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    xorl %r10d, %r10d
> +; AVX1-NEXT:    addq $-1, %rsi
> +; AVX1-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX1-NEXT:    movl $0, %ecx
> +; AVX1-NEXT:    adcq $-1, %rcx
> +; AVX1-NEXT:    addq $-1, %r13
> +; AVX1-NEXT:    movl $0, %eax
> +; AVX1-NEXT:    adcq $-1, %rax
> +; AVX1-NEXT:    addq $-1, %r12
> +; AVX1-NEXT:    movl $0, %edi
> +; AVX1-NEXT:    adcq $-1, %rdi
> +; AVX1-NEXT:    addq $-1, %r14
> +; AVX1-NEXT:    movl $0, %esi
> +; AVX1-NEXT:    adcq $-1, %rsi
> +; AVX1-NEXT:    addq $-1, %rbp
> +; AVX1-NEXT:    movl $0, %r9d
> +; AVX1-NEXT:    adcq $-1, %r9
> +; AVX1-NEXT:    addq $-1, %r11
> +; AVX1-NEXT:    movl $0, %r8d
> +; AVX1-NEXT:    adcq $-1, %r8
> +; AVX1-NEXT:    addq $-1, %r15
> +; AVX1-NEXT:    movl $0, %edx
> +; AVX1-NEXT:    adcq $-1, %rdx
> +; AVX1-NEXT:    addq $-1, %rbx
> +; AVX1-NEXT:    adcq $-1, %r10
> +; AVX1-NEXT:    shldq $63, %r11, %r8
> +; AVX1-NEXT:    shldq $63, %rbp, %r9
> +; AVX1-NEXT:    shldq $63, %r14, %rsi
> +; AVX1-NEXT:    shldq $63, %r12, %rdi
> +; AVX1-NEXT:    shldq $63, %r13, %rax
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> +; AVX1-NEXT:    shldq $63, %rbp, %rcx
> +; AVX1-NEXT:    shldq $63, %rbx, %r10
> +; AVX1-NEXT:    shldq $63, %r15, %rdx
> +; AVX1-NEXT:    vmovq %rcx, %xmm8
> +; AVX1-NEXT:    vmovq %rax, %xmm9
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm0
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm11
> +; AVX1-NEXT:    vmovq %rdi, %xmm12
> +; AVX1-NEXT:    vmovq %rsi, %xmm13
> +; AVX1-NEXT:    vmovq %rdx, %xmm14
> +; AVX1-NEXT:    vmovq %r10, %xmm15
> +; AVX1-NEXT:    vmovq %r9, %xmm10
> +; AVX1-NEXT:    vmovq %r8, %xmm1
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm2
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm3
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm4
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm5
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm6
> +; AVX1-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX1-NEXT:    shrq %rax
> +; AVX1-NEXT:    vmovq %rax, %xmm7
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm8 = xmm9[0],xmm8[0],xmm9[1],xmm8[1],xmm9[2],xmm8[2],xmm9[3],xmm8[3],xmm9[4],xmm8[4],xmm9[5],xmm8[5],xmm9[6],xmm8[6],xmm9[7],xmm8[7]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm9 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3],xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
> +; AVX1-NEXT:    vpsllq $48, %xmm8, %xmm8
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm9[0,0,1,1]
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm8 = xmm0[0,1,2],xmm8[3],xmm0[4,5,6,7]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm0 = xmm13[0],xmm12[0],xmm13[1],xmm12[1],xmm13[2],xmm12[2],xmm13[3],xmm12[3],xmm13[4],xmm12[4],xmm13[5],xmm12[5],xmm13[6],xmm12[6],xmm13[7],xmm12[7]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm9 = xmm15[0],xmm14[0],xmm15[1],xmm14[1],xmm15[2],xmm14[2],xmm15[3],xmm14[3],xmm15[4],xmm14[4],xmm15[5],xmm14[5],xmm15[6],xmm14[6],xmm15[7],xmm14[7]
> +; AVX1-NEXT:    vpslld $16, %xmm0, %xmm0
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm9[0],xmm0[1],xmm9[2,3,4,5,6,7]
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm8[2,3],xmm0[4,5,6,7]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm1 = xmm1[0],xmm10[0],xmm1[1],xmm10[1],xmm1[2],xmm10[2],xmm1[3],xmm10[3],xmm1[4],xmm10[4],xmm1[5],xmm10[5],xmm1[6],xmm10[6],xmm1[7],xmm10[7]
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,1,2,0]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3],xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
> +; AVX1-NEXT:    vpslldq {{.*#+}} xmm2 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0,1]
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,5,6],xmm2[7]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm2 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
> +; AVX1-NEXT:    vpslldq {{.*#+}} xmm2 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0,1,2,3,4,5]
> +; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm3 = xmm7[0],xmm6[0],xmm7[1],xmm6[1],xmm7[2],xmm6[2],xmm7[3],xmm6[3],xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[0,1,0,1]
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 = xmm3[0,1,2,3,4],xmm2[5],xmm3[6,7]
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm2[0,1,2,3,4,5],xmm1[6,7]
> +; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
>  ; AVX1-NEXT:    vmovdqu %xmm0, (%rax)
>  ; AVX1-NEXT:    popq %rbx
>  ; AVX1-NEXT:    popq %r12
> @@ -2154,123 +2279,230 @@ define void @not_avg_v16i8_wide_constant
>  ; AVX2-NEXT:    pushq %r13
>  ; AVX2-NEXT:    pushq %r12
>  ; AVX2-NEXT:    pushq %rbx
> +; AVX2-NEXT:    subq $16, %rsp
>  ; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
>  ; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> -; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm10 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm1
> +; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm3, %xmm4
> +; AVX2-NEXT:    vpextrq $1, %xmm4, %rbx
> +; AVX2-NEXT:    vmovq %xmm4, %rbp
> +; AVX2-NEXT:    vpextrq $1, %xmm3, %rdi
> +; AVX2-NEXT:    vmovq %xmm3, %rcx
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX2-NEXT:    vpextrq $1, %xmm3, %rdx
> +; AVX2-NEXT:    vmovq %xmm3, %r9
> +; AVX2-NEXT:    vpextrq $1, %xmm2, %r13
> +; AVX2-NEXT:    vmovq %xmm2, %r12
>  ; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm3
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm5 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm5, %xmm4
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm9 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm9, %xmm7
> -; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm1
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
>  ; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> -; AVX2-NEXT:    vpextrq $1, %xmm2, %r15
> -; AVX2-NEXT:    vmovq %xmm2, %r14
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX2-NEXT:    vpextrq $1, %xmm3, %r14
> +; AVX2-NEXT:    vmovq %xmm3, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX2-NEXT:    vpextrq $1, %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX2-NEXT:    vmovq %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
>  ; AVX2-NEXT:    vpextrq $1, %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> -; AVX2-NEXT:    vmovq %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> -; AVX2-NEXT:    vextracti128 $1, %ymm10, %xmm1
> -; AVX2-NEXT:    vpextrq $1, %xmm1, %r13
> -; AVX2-NEXT:    vmovq %xmm1, %r11
> -; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm11 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm0
> +; AVX2-NEXT:    vmovq %xmm1, %r10
> +; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm1
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm2
> +; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm3, %xmm4
> +; AVX2-NEXT:    vpextrq $1, %xmm4, %rax
> +; AVX2-NEXT:    addq %rbx, %rax
> +; AVX2-NEXT:    movq %rax, %rbx
> +; AVX2-NEXT:    vmovq %xmm4, %rsi
> +; AVX2-NEXT:    addq %rbp, %rsi
> +; AVX2-NEXT:    vpextrq $1, %xmm3, %rax
> +; AVX2-NEXT:    addq %rdi, %rax
> +; AVX2-NEXT:    movq %rax, %rdi
> +; AVX2-NEXT:    vmovq %xmm3, %r11
> +; AVX2-NEXT:    addq %rcx, %r11
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX2-NEXT:    vpextrq $1, %xmm3, %rcx
> +; AVX2-NEXT:    addq %rdx, %rcx
> +; AVX2-NEXT:    vmovq %xmm3, %r8
> +; AVX2-NEXT:    addq %r9, %r8
> +; AVX2-NEXT:    vpextrq $1, %xmm2, %r9
> +; AVX2-NEXT:    addq %r13, %r9
> +; AVX2-NEXT:    vmovq %xmm2, %r15
> +; AVX2-NEXT:    addq %r12, %r15
>  ; AVX2-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm1
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm8 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm8, %xmm1
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX2-NEXT:    vextracti128 $1, %ymm3, %xmm6
> -; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm0
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX2-NEXT:    vmovd %xmm9, %r12d
> -; AVX2-NEXT:    vpextrd $2, %xmm9, %r9d
> -; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm0
> -; AVX2-NEXT:    vmovd %xmm7, %ecx
> -; AVX2-NEXT:    vpextrd $2, %xmm7, %edi
> -; AVX2-NEXT:    vmovd %xmm5, %ebx
> -; AVX2-NEXT:    vpextrd $2, %xmm5, %esi
> -; AVX2-NEXT:    vmovd %xmm4, %edx
> -; AVX2-NEXT:    vpextrd $2, %xmm4, %ebp
> -; AVX2-NEXT:    vpextrd $2, %xmm1, %eax
> -; AVX2-NEXT:    leal -1(%rbp,%rax), %eax
> -; AVX2-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX2-NEXT:    vmovd %xmm1, %eax
> -; AVX2-NEXT:    leal -1(%rdx,%rax), %eax
> -; AVX2-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX2-NEXT:    vpextrd $2, %xmm8, %eax
> -; AVX2-NEXT:    leal -1(%rsi,%rax), %eax
> -; AVX2-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX2-NEXT:    vmovd %xmm8, %eax
> -; AVX2-NEXT:    leal -1(%rbx,%rax), %r10d
> -; AVX2-NEXT:    vpextrd $2, %xmm6, %eax
> -; AVX2-NEXT:    leal -1(%rdi,%rax), %r8d
> -; AVX2-NEXT:    vmovd %xmm6, %eax
> -; AVX2-NEXT:    leal -1(%rcx,%rax), %edi
> -; AVX2-NEXT:    vpextrd $2, %xmm3, %eax
> -; AVX2-NEXT:    leal -1(%r9,%rax), %r9d
> -; AVX2-NEXT:    vmovd %xmm3, %ecx
> -; AVX2-NEXT:    leal -1(%r12,%rcx), %r12d
> -; AVX2-NEXT:    vpextrq $1, %xmm0, %rcx
> -; AVX2-NEXT:    leal -1(%r15,%rcx), %r15d
> -; AVX2-NEXT:    vmovq %xmm0, %rcx
> -; AVX2-NEXT:    leal -1(%r14,%rcx), %r14d
> -; AVX2-NEXT:    vpextrq $1, %xmm2, %rdx
> -; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> -; AVX2-NEXT:    leal -1(%rax,%rdx), %edx
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm2
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX2-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX2-NEXT:    vpextrq $1, %xmm3, %rax
> +; AVX2-NEXT:    addq %r14, %rax
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    vmovq %xmm3, %rax
> +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded Reload
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    vpextrq $1, %xmm2, %rax
> +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded Reload
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
>  ; AVX2-NEXT:    vmovq %xmm2, %rax
> -; AVX2-NEXT:    vextracti128 $1, %ymm11, %xmm0
> +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded Reload
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    vpmovzxdq {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> +; AVX2-NEXT:    vpextrq $1, %xmm0, %rbp
> +; AVX2-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Folded Reload
> +; AVX2-NEXT:    vmovq %xmm0, %r12
> +; AVX2-NEXT:    addq %r10, %r12
> +; AVX2-NEXT:    vpextrq $1, %xmm1, %rax
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm0
> +; AVX2-NEXT:    vpextrq $1, %xmm0, %r10
> +; AVX2-NEXT:    addq %rax, %r10
> +; AVX2-NEXT:    vmovq %xmm1, %rax
> +; AVX2-NEXT:    vmovq %xmm0, %rdx
> +; AVX2-NEXT:    addq %rax, %rdx
> +; AVX2-NEXT:    addq $-1, %rbx
> +; AVX2-NEXT:    movq %rbx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    addq $-1, %rsi
> +; AVX2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, (%rsp) # 8-byte Spill
> +; AVX2-NEXT:    addq $-1, %rdi
> +; AVX2-NEXT:    movq %rdi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    addq $-1, %r11
> +; AVX2-NEXT:    movq %r11, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    addq $-1, %rcx
> +; AVX2-NEXT:    movq %rcx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    addq $-1, %r8
> +; AVX2-NEXT:    movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    addq $-1, %r9
> +; AVX2-NEXT:    movq %r9, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %eax
> +; AVX2-NEXT:    adcq $-1, %rax
> +; AVX2-NEXT:    movq %rax, %rsi
> +; AVX2-NEXT:    addq $-1, %r15
> +; AVX2-NEXT:    movq %r15, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movl $0, %r15d
> +; AVX2-NEXT:    adcq $-1, %r15
> +; AVX2-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX2-NEXT:    movl $0, %r13d
> +; AVX2-NEXT:    adcq $-1, %r13
> +; AVX2-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX2-NEXT:    movl $0, %r14d
> +; AVX2-NEXT:    adcq $-1, %r14
> +; AVX2-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX2-NEXT:    movl $0, %ebx
> +; AVX2-NEXT:    adcq $-1, %rbx
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    addq $-1, %rax
> +; AVX2-NEXT:    movl $0, %r11d
> +; AVX2-NEXT:    adcq $-1, %r11
> +; AVX2-NEXT:    addq $-1, %rbp
> +; AVX2-NEXT:    movl $0, %r9d
> +; AVX2-NEXT:    adcq $-1, %r9
> +; AVX2-NEXT:    addq $-1, %r12
> +; AVX2-NEXT:    movl $0, %r8d
> +; AVX2-NEXT:    adcq $-1, %r8
> +; AVX2-NEXT:    addq $-1, %r10
> +; AVX2-NEXT:    movl $0, %edi
> +; AVX2-NEXT:    adcq $-1, %rdi
> +; AVX2-NEXT:    addq $-1, %rdx
> +; AVX2-NEXT:    movl $0, %ecx
> +; AVX2-NEXT:    adcq $-1, %rcx
> +; AVX2-NEXT:    shldq $63, %rdx, %rcx
> +; AVX2-NEXT:    movq %rcx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    shldq $63, %r10, %rdi
> +; AVX2-NEXT:    shldq $63, %r12, %r8
> +; AVX2-NEXT:    shldq $63, %rbp, %r9
> +; AVX2-NEXT:    shldq $63, %rax, %r11
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rdx, %rbx
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rdx, %r14
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rdx, %r13
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rax, %r15
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rax, %rsi
> +; AVX2-NEXT:    movq %rsi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rax, %rsi
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r12 # 8-byte Reload
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rax, %r12
>  ; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> -; AVX2-NEXT:    leal -1(%rcx,%rax), %eax
> -; AVX2-NEXT:    vpextrq $1, %xmm0, %rsi
> -; AVX2-NEXT:    leal -1(%r13,%rsi), %esi
> -; AVX2-NEXT:    vmovq %xmm0, %rbx
> -; AVX2-NEXT:    leal -1(%r11,%rbx), %ebx
> -; AVX2-NEXT:    vpextrq $1, %xmm10, %rcx
> -; AVX2-NEXT:    vpextrq $1, %xmm11, %r13
> -; AVX2-NEXT:    leal -1(%rcx,%r13), %ecx
> -; AVX2-NEXT:    vmovq %xmm10, %r13
> -; AVX2-NEXT:    vmovq %xmm11, %r11
> -; AVX2-NEXT:    leaq -1(%r13,%r11), %rbp
> -; AVX2-NEXT:    shrq %rbp
> -; AVX2-NEXT:    vmovd %ebp, %xmm0
> -; AVX2-NEXT:    shrl %ecx
> -; AVX2-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %ebx
> -; AVX2-NEXT:    vpinsrb $2, %ebx, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %esi
> -; AVX2-NEXT:    vpinsrb $3, %esi, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %eax
> -; AVX2-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %edx
> -; AVX2-NEXT:    vpinsrb $5, %edx, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %r14d
> -; AVX2-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %r15d
> -; AVX2-NEXT:    vpinsrb $7, %r15d, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %r12d
> -; AVX2-NEXT:    vpinsrb $8, %r12d, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %r9d
> -; AVX2-NEXT:    vpinsrb $9, %r9d, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %edi
> -; AVX2-NEXT:    vpinsrb $10, %edi, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %r8d
> -; AVX2-NEXT:    vpinsrb $11, %r8d, %xmm0, %xmm0
> -; AVX2-NEXT:    shrl %r10d
> -; AVX2-NEXT:    vpinsrb $12, %r10d, %xmm0, %xmm0
> -; AVX2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX2-NEXT:    shrl %eax
> -; AVX2-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> -; AVX2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX2-NEXT:    shrl %eax
> -; AVX2-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> -; AVX2-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX2-NEXT:    shrl %eax
> -; AVX2-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rax, %rcx
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r10 # 8-byte Reload
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rax, %r10
> +; AVX2-NEXT:    movq (%rsp), %rax # 8-byte Reload
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rdx, %rax
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> +; AVX2-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rbp # 8-byte Reload
> +; AVX2-NEXT:    shldq $63, %rdx, %rbp
> +; AVX2-NEXT:    vmovq %rbp, %xmm8
> +; AVX2-NEXT:    vmovq %rax, %xmm9
> +; AVX2-NEXT:    vmovq %r10, %xmm0
> +; AVX2-NEXT:    vmovq %rcx, %xmm1
> +; AVX2-NEXT:    vmovq %r12, %xmm12
> +; AVX2-NEXT:    vmovq %rsi, %xmm13
> +; AVX2-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm14 # 8-byte Folded Reload
> +; AVX2-NEXT:    # xmm14 = mem[0],zero
> +; AVX2-NEXT:    vmovq %r15, %xmm15
> +; AVX2-NEXT:    vmovq %r13, %xmm10
> +; AVX2-NEXT:    vmovq %r14, %xmm11
> +; AVX2-NEXT:    vmovq %rbx, %xmm2
> +; AVX2-NEXT:    vmovq %r11, %xmm3
> +; AVX2-NEXT:    vmovq %r9, %xmm4
> +; AVX2-NEXT:    vmovq %r8, %xmm5
> +; AVX2-NEXT:    vmovq %rdi, %xmm6
> +; AVX2-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm7 # 8-byte Folded Reload
> +; AVX2-NEXT:    # xmm7 = mem[0],zero
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm8 = xmm9[0],xmm8[0],xmm9[1],xmm8[1],xmm9[2],xmm8[2],xmm9[3],xmm8[3],xmm9[4],xmm8[4],xmm9[5],xmm8[5],xmm9[6],xmm8[6],xmm9[7],xmm8[7]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm9 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> +; AVX2-NEXT:    vpbroadcastw %xmm8, %xmm8
> +; AVX2-NEXT:    vpbroadcastw %xmm9, %xmm0
> +; AVX2-NEXT:    vpblendw {{.*#+}} xmm8 = xmm0[0,1,2,3,4,5,6],xmm8[7]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm0 = xmm13[0],xmm12[0],xmm13[1],xmm12[1],xmm13[2],xmm12[2],xmm13[3],xmm12[3],xmm13[4],xmm12[4],xmm13[5],xmm12[5],xmm13[6],xmm12[6],xmm13[7],xmm12[7]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm9 = xmm15[0],xmm14[0],xmm15[1],xmm14[1],xmm15[2],xmm14[2],xmm15[3],xmm14[3],xmm15[4],xmm14[4],xmm15[5],xmm14[5],xmm15[6],xmm14[6],xmm15[7],xmm14[7]
> +; AVX2-NEXT:    vpbroadcastw %xmm0, %xmm0
> +; AVX2-NEXT:    vpbroadcastw %xmm9, %xmm1
> +; AVX2-NEXT:    vpblendw {{.*#+}} xmm0 = xmm1[0,1,2,3,4],xmm0[5],xmm1[6,7]
> +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0,1,2],xmm8[3]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm1 = xmm11[0],xmm10[0],xmm11[1],xmm10[1],xmm11[2],xmm10[2],xmm11[3],xmm10[3],xmm11[4],xmm10[4],xmm11[5],xmm10[5],xmm11[6],xmm10[6],xmm11[7],xmm10[7]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3],xmm3[4],xmm2[4],xmm3[5],xmm2[5],xmm3[6],xmm2[6],xmm3[7],xmm2[7]
> +; AVX2-NEXT:    vpbroadcastw %xmm1, %xmm1
> +; AVX2-NEXT:    vpbroadcastw %xmm2, %xmm2
> +; AVX2-NEXT:    vpblendw {{.*#+}} xmm1 = xmm2[0,1,2],xmm1[3],xmm2[4,5,6,7]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm2 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
> +; AVX2-NEXT:    vpunpcklbw {{.*#+}} xmm3 = xmm7[0],xmm6[0],xmm7[1],xmm6[1],xmm7[2],xmm6[2],xmm7[3],xmm6[3],xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
> +; AVX2-NEXT:    vpbroadcastw %xmm3, %xmm3
> +; AVX2-NEXT:    vpblendw {{.*#+}} xmm2 = xmm2[0],xmm3[1],xmm2[2,3,4,5,6,7]
> +; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm2[0],xmm1[1],xmm2[2,3]
> +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]
>  ; AVX2-NEXT:    vmovdqu %xmm0, (%rax)
> +; AVX2-NEXT:    addq $16, %rsp
>  ; AVX2-NEXT:    popq %rbx
>  ; AVX2-NEXT:    popq %r12
>  ; AVX2-NEXT:    popq %r13
> @@ -2280,139 +2512,414 @@ define void @not_avg_v16i8_wide_constant
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> -; AVX512-LABEL: not_avg_v16i8_wide_constants:
> -; AVX512:       # %bb.0:
> -; AVX512-NEXT:    pushq %rbp
> -; AVX512-NEXT:    pushq %r15
> -; AVX512-NEXT:    pushq %r14
> -; AVX512-NEXT:    pushq %r13
> -; AVX512-NEXT:    pushq %r12
> -; AVX512-NEXT:    pushq %rbx
> -; AVX512-NEXT:    vpmovzxbw {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> -; AVX512-NEXT:    vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm10 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm1
> -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm3
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm5 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm5, %xmm4
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm9 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm9, %xmm7
> -; AVX512-NEXT:    vextracti128 $1, %ymm2, %xmm1
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm1, %xmm2
> -; AVX512-NEXT:    vpextrq $1, %xmm2, %r15
> -; AVX512-NEXT:    vmovq %xmm2, %r14
> -; AVX512-NEXT:    vpextrq $1, %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> -; AVX512-NEXT:    vmovq %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> -; AVX512-NEXT:    vextracti128 $1, %ymm10, %xmm1
> -; AVX512-NEXT:    vpextrq $1, %xmm1, %r13
> -; AVX512-NEXT:    vmovq %xmm1, %r11
> -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm11 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm0, %xmm0
> -; AVX512-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm0, %xmm1
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm8 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm8, %xmm1
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX512-NEXT:    vextracti128 $1, %ymm3, %xmm6
> -; AVX512-NEXT:    vextracti128 $1, %ymm2, %xmm0
> -; AVX512-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX512-NEXT:    vmovd %xmm9, %r12d
> -; AVX512-NEXT:    vpextrd $2, %xmm9, %r9d
> -; AVX512-NEXT:    vextracti128 $1, %ymm2, %xmm0
> -; AVX512-NEXT:    vmovd %xmm7, %ecx
> -; AVX512-NEXT:    vpextrd $2, %xmm7, %edi
> -; AVX512-NEXT:    vmovd %xmm5, %ebx
> -; AVX512-NEXT:    vpextrd $2, %xmm5, %esi
> -; AVX512-NEXT:    vmovd %xmm4, %edx
> -; AVX512-NEXT:    vpextrd $2, %xmm4, %ebp
> -; AVX512-NEXT:    vpextrd $2, %xmm1, %eax
> -; AVX512-NEXT:    leal -1(%rbp,%rax), %eax
> -; AVX512-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX512-NEXT:    vmovd %xmm1, %eax
> -; AVX512-NEXT:    leal -1(%rdx,%rax), %eax
> -; AVX512-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX512-NEXT:    vpextrd $2, %xmm8, %eax
> -; AVX512-NEXT:    leal -1(%rsi,%rax), %eax
> -; AVX512-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> -; AVX512-NEXT:    vmovd %xmm8, %eax
> -; AVX512-NEXT:    leal -1(%rbx,%rax), %r10d
> -; AVX512-NEXT:    vpextrd $2, %xmm6, %eax
> -; AVX512-NEXT:    leal -1(%rdi,%rax), %r8d
> -; AVX512-NEXT:    vmovd %xmm6, %eax
> -; AVX512-NEXT:    leal -1(%rcx,%rax), %edi
> -; AVX512-NEXT:    vpextrd $2, %xmm3, %eax
> -; AVX512-NEXT:    leal -1(%r9,%rax), %r9d
> -; AVX512-NEXT:    vmovd %xmm3, %ecx
> -; AVX512-NEXT:    leal -1(%r12,%rcx), %r12d
> -; AVX512-NEXT:    vpextrq $1, %xmm0, %rcx
> -; AVX512-NEXT:    leal -1(%r15,%rcx), %r15d
> -; AVX512-NEXT:    vmovq %xmm0, %rcx
> -; AVX512-NEXT:    leal -1(%r14,%rcx), %r14d
> -; AVX512-NEXT:    vpextrq $1, %xmm2, %rdx
> -; AVX512-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> -; AVX512-NEXT:    leal -1(%rax,%rdx), %edx
> -; AVX512-NEXT:    vmovq %xmm2, %rax
> -; AVX512-NEXT:    vextracti128 $1, %ymm11, %xmm0
> -; AVX512-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> -; AVX512-NEXT:    leal -1(%rcx,%rax), %eax
> -; AVX512-NEXT:    vpextrq $1, %xmm0, %rsi
> -; AVX512-NEXT:    leal -1(%r13,%rsi), %esi
> -; AVX512-NEXT:    vmovq %xmm0, %rbx
> -; AVX512-NEXT:    leal -1(%r11,%rbx), %ebx
> -; AVX512-NEXT:    vpextrq $1, %xmm10, %rcx
> -; AVX512-NEXT:    vpextrq $1, %xmm11, %r13
> -; AVX512-NEXT:    leal -1(%rcx,%r13), %ecx
> -; AVX512-NEXT:    vmovq %xmm10, %r13
> -; AVX512-NEXT:    vmovq %xmm11, %r11
> -; AVX512-NEXT:    leaq -1(%r13,%r11), %rbp
> -; AVX512-NEXT:    shrq %rbp
> -; AVX512-NEXT:    vmovd %ebp, %xmm0
> -; AVX512-NEXT:    shrl %ecx
> -; AVX512-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %ebx
> -; AVX512-NEXT:    vpinsrb $2, %ebx, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %esi
> -; AVX512-NEXT:    vpinsrb $3, %esi, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %eax
> -; AVX512-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %edx
> -; AVX512-NEXT:    vpinsrb $5, %edx, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %r14d
> -; AVX512-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %r15d
> -; AVX512-NEXT:    vpinsrb $7, %r15d, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %r12d
> -; AVX512-NEXT:    vpinsrb $8, %r12d, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %r9d
> -; AVX512-NEXT:    vpinsrb $9, %r9d, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %edi
> -; AVX512-NEXT:    vpinsrb $10, %edi, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %r8d
> -; AVX512-NEXT:    vpinsrb $11, %r8d, %xmm0, %xmm0
> -; AVX512-NEXT:    shrl %r10d
> -; AVX512-NEXT:    vpinsrb $12, %r10d, %xmm0, %xmm0
> -; AVX512-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX512-NEXT:    shrl %eax
> -; AVX512-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> -; AVX512-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX512-NEXT:    shrl %eax
> -; AVX512-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> -; AVX512-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> -; AVX512-NEXT:    shrl %eax
> -; AVX512-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> -; AVX512-NEXT:    vmovdqu %xmm0, (%rax)
> -; AVX512-NEXT:    popq %rbx
> -; AVX512-NEXT:    popq %r12
> -; AVX512-NEXT:    popq %r13
> -; AVX512-NEXT:    popq %r14
> -; AVX512-NEXT:    popq %r15
> -; AVX512-NEXT:    popq %rbp
> -; AVX512-NEXT:    vzeroupper
> -; AVX512-NEXT:    retq
> +; AVX512F-LABEL: not_avg_v16i8_wide_constants:
> +; AVX512F:       # %bb.0:
> +; AVX512F-NEXT:    pushq %rbp
> +; AVX512F-NEXT:    pushq %r15
> +; AVX512F-NEXT:    pushq %r14
> +; AVX512F-NEXT:    pushq %r13
> +; AVX512F-NEXT:    pushq %r12
> +; AVX512F-NEXT:    pushq %rbx
> +; AVX512F-NEXT:    vpmovzxbw {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> +; AVX512F-NEXT:    vpmovzxbw {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm0 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm1
> +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm4
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm4 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm4, %xmm5
> +; AVX512F-NEXT:    vpextrq $1, %xmm5, %rdx
> +; AVX512F-NEXT:    vmovq %xmm5, %rcx
> +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rax
> +; AVX512F-NEXT:    vmovq %xmm4, %rbx
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm4
> +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rdi
> +; AVX512F-NEXT:    vmovq %xmm4, %rsi
> +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r13
> +; AVX512F-NEXT:    vmovq %xmm1, %r15
> +; AVX512F-NEXT:    vextracti128 $1, %ymm2, %xmm1
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX512F-NEXT:    vpextrq $1, %xmm2, %r12
> +; AVX512F-NEXT:    vmovq %xmm2, %r14
> +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r11
> +; AVX512F-NEXT:    vmovq %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512F-NEXT:    vextracti128 $1, %ymm0, %xmm1
> +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r10
> +; AVX512F-NEXT:    vmovq %xmm1, %r9
> +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm3, %xmm3
> +; AVX512F-NEXT:    vpmovzxwd {{.*#+}} ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm3, %xmm4
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm4 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm4, %xmm5
> +; AVX512F-NEXT:    vpextrq $1, %xmm5, %rbp
> +; AVX512F-NEXT:    leal -1(%rdx,%rbp), %edx
> +; AVX512F-NEXT:    movl %edx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> +; AVX512F-NEXT:    vmovq %xmm5, %rbp
> +; AVX512F-NEXT:    leal -1(%rcx,%rbp), %ecx
> +; AVX512F-NEXT:    movl %ecx, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rbp
> +; AVX512F-NEXT:    leal -1(%rax,%rbp), %eax
> +; AVX512F-NEXT:    movl %eax, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
> +; AVX512F-NEXT:    vmovq %xmm4, %rbp
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm3, %xmm4
> +; AVX512F-NEXT:    leal -1(%rbx,%rbp), %r8d
> +; AVX512F-NEXT:    vpextrq $1, %xmm4, %rbp
> +; AVX512F-NEXT:    leal -1(%rdi,%rbp), %edi
> +; AVX512F-NEXT:    vmovq %xmm4, %rbp
> +; AVX512F-NEXT:    leal -1(%rsi,%rbp), %esi
> +; AVX512F-NEXT:    vpextrq $1, %xmm3, %rbp
> +; AVX512F-NEXT:    leal -1(%r13,%rbp), %r13d
> +; AVX512F-NEXT:    vmovq %xmm3, %rbp
> +; AVX512F-NEXT:    vextracti128 $1, %ymm2, %xmm2
> +; AVX512F-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512F-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX512F-NEXT:    leal -1(%r15,%rbp), %r15d
> +; AVX512F-NEXT:    vpextrq $1, %xmm3, %rbp
> +; AVX512F-NEXT:    leal -1(%r12,%rbp), %r12d
> +; AVX512F-NEXT:    vmovq %xmm3, %rbp
> +; AVX512F-NEXT:    leal -1(%r14,%rbp), %r14d
> +; AVX512F-NEXT:    vpextrq $1, %xmm2, %rdx
> +; AVX512F-NEXT:    leal -1(%r11,%rdx), %r11d
> +; AVX512F-NEXT:    vmovq %xmm2, %rbp
> +; AVX512F-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX512F-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512F-NEXT:    leal -1(%rax,%rbp), %ebp
> +; AVX512F-NEXT:    vpextrq $1, %xmm2, %rcx
> +; AVX512F-NEXT:    leal -1(%r10,%rcx), %ecx
> +; AVX512F-NEXT:    vmovq %xmm2, %rax
> +; AVX512F-NEXT:    leal -1(%r9,%rax), %eax
> +; AVX512F-NEXT:    vpextrq $1, %xmm0, %rdx
> +; AVX512F-NEXT:    vpextrq $1, %xmm1, %r10
> +; AVX512F-NEXT:    leal -1(%rdx,%r10), %edx
> +; AVX512F-NEXT:    vmovq %xmm0, %r10
> +; AVX512F-NEXT:    vmovq %xmm1, %r9
> +; AVX512F-NEXT:    leaq -1(%r10,%r9), %rbx
> +; AVX512F-NEXT:    shrq %rbx
> +; AVX512F-NEXT:    vmovd %ebx, %xmm0
> +; AVX512F-NEXT:    shrl %edx
> +; AVX512F-NEXT:    vpinsrb $1, %edx, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %eax
> +; AVX512F-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %ecx
> +; AVX512F-NEXT:    vpinsrb $3, %ecx, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %ebp
> +; AVX512F-NEXT:    vpinsrb $4, %ebp, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %r11d
> +; AVX512F-NEXT:    vpinsrb $5, %r11d, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %r14d
> +; AVX512F-NEXT:    vpinsrb $6, %r14d, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %r12d
> +; AVX512F-NEXT:    vpinsrb $7, %r12d, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %r15d
> +; AVX512F-NEXT:    vpinsrb $8, %r15d, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %r13d
> +; AVX512F-NEXT:    vpinsrb $9, %r13d, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %esi
> +; AVX512F-NEXT:    vpinsrb $10, %esi, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %edi
> +; AVX512F-NEXT:    vpinsrb $11, %edi, %xmm0, %xmm0
> +; AVX512F-NEXT:    shrl %r8d
> +; AVX512F-NEXT:    vpinsrb $12, %r8d, %xmm0, %xmm0
> +; AVX512F-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> +; AVX512F-NEXT:    shrl %eax
> +; AVX512F-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> +; AVX512F-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> +; AVX512F-NEXT:    shrl %eax
> +; AVX512F-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> +; AVX512F-NEXT:    movl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Reload
> +; AVX512F-NEXT:    shrl %eax
> +; AVX512F-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> +; AVX512F-NEXT:    vmovdqu %xmm0, (%rax)
> +; AVX512F-NEXT:    popq %rbx
> +; AVX512F-NEXT:    popq %r12
> +; AVX512F-NEXT:    popq %r13
> +; AVX512F-NEXT:    popq %r14
> +; AVX512F-NEXT:    popq %r15
> +; AVX512F-NEXT:    popq %rbp
> +; AVX512F-NEXT:    vzeroupper
> +; AVX512F-NEXT:    retq
> +;
> +; AVX512BW-LABEL: not_avg_v16i8_wide_constants:
> +; AVX512BW:       # %bb.0:
> +; AVX512BW-NEXT:    pushq %rbp
> +; AVX512BW-NEXT:    pushq %r15
> +; AVX512BW-NEXT:    pushq %r14
> +; AVX512BW-NEXT:    pushq %r13
> +; AVX512BW-NEXT:    pushq %r12
> +; AVX512BW-NEXT:    pushq %rbx
> +; AVX512BW-NEXT:    subq $24, %rsp
> +; AVX512BW-NEXT:    vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> +; AVX512BW-NEXT:    vpmovzxbw {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
> +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm3, %xmm4
> +; AVX512BW-NEXT:    vmovq %xmm4, %rbx
> +; AVX512BW-NEXT:    vpextrq $1, %xmm4, %rbp
> +; AVX512BW-NEXT:    vmovq %xmm3, %rdi
> +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %rsi
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm2
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX512BW-NEXT:    vmovq %xmm3, %rdx
> +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %r15
> +; AVX512BW-NEXT:    vmovq %xmm2, %r8
> +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r14
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm0
> +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX512BW-NEXT:    vmovq %xmm3, %r9
> +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %r10
> +; AVX512BW-NEXT:    vmovq %xmm2, %r11
> +; AVX512BW-NEXT:    vpextrq $1, %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm0
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm2
> +; AVX512BW-NEXT:    vmovq %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r13
> +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm3 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm3, %xmm4
> +; AVX512BW-NEXT:    vmovq %xmm4, %rax
> +; AVX512BW-NEXT:    addq %rbx, %rax
> +; AVX512BW-NEXT:    movq %rax, %rbx
> +; AVX512BW-NEXT:    vpextrq $1, %xmm4, %rax
> +; AVX512BW-NEXT:    addq %rbp, %rax
> +; AVX512BW-NEXT:    movq %rax, %rbp
> +; AVX512BW-NEXT:    vmovq %xmm3, %rcx
> +; AVX512BW-NEXT:    addq %rdi, %rcx
> +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %r12
> +; AVX512BW-NEXT:    addq %rsi, %r12
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm2
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX512BW-NEXT:    vmovq %xmm3, %rax
> +; AVX512BW-NEXT:    addq %rdx, %rax
> +; AVX512BW-NEXT:    movq %rax, %rdx
> +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %rax
> +; AVX512BW-NEXT:    addq %r15, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    vmovq %xmm2, %rax
> +; AVX512BW-NEXT:    addq %r8, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %rax
> +; AVX512BW-NEXT:    addq %r14, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm1
> +; AVX512BW-NEXT:    vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm2, %xmm3
> +; AVX512BW-NEXT:    vmovq %xmm3, %rax
> +; AVX512BW-NEXT:    addq %r9, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    vpextrq $1, %xmm3, %rax
> +; AVX512BW-NEXT:    addq %r10, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    vmovq %xmm2, %rax
> +; AVX512BW-NEXT:    addq %r11, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r14
> +; AVX512BW-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %r14 # 8-byte Folded Reload
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm1
> +; AVX512BW-NEXT:    vpmovzxdq {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX512BW-NEXT:    vmovq %xmm2, %r10
> +; AVX512BW-NEXT:    addq {{[-0-9]+}}(%r{{[sb]}}p), %r10 # 8-byte Folded Reload
> +; AVX512BW-NEXT:    vpextrq $1, %xmm2, %r9
> +; AVX512BW-NEXT:    addq %r13, %r9
> +; AVX512BW-NEXT:    vmovq %xmm0, %rax
> +; AVX512BW-NEXT:    vmovq %xmm1, %r8
> +; AVX512BW-NEXT:    addq %rax, %r8
> +; AVX512BW-NEXT:    vpextrq $1, %xmm0, %rdi
> +; AVX512BW-NEXT:    vpextrq $1, %xmm1, %rsi
> +; AVX512BW-NEXT:    addq %rdi, %rsi
> +; AVX512BW-NEXT:    addq $-1, %rbx
> +; AVX512BW-NEXT:    movq %rbx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    movl $0, %r15d
> +; AVX512BW-NEXT:    adcq $-1, %r15
> +; AVX512BW-NEXT:    addq $-1, %rbp
> +; AVX512BW-NEXT:    movq %rbp, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    movl $0, %ebx
> +; AVX512BW-NEXT:    adcq $-1, %rbx
> +; AVX512BW-NEXT:    addq $-1, %rcx
> +; AVX512BW-NEXT:    movq %rcx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    movl $0, %r11d
> +; AVX512BW-NEXT:    adcq $-1, %r11
> +; AVX512BW-NEXT:    addq $-1, %r12
> +; AVX512BW-NEXT:    movq %r12, (%rsp) # 8-byte Spill
> +; AVX512BW-NEXT:    movl $0, %edi
> +; AVX512BW-NEXT:    adcq $-1, %rdi
> +; AVX512BW-NEXT:    addq $-1, %rdx
> +; AVX512BW-NEXT:    movq %rdx, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    movl $0, %r13d
> +; AVX512BW-NEXT:    adcq $-1, %r13
> +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    movl $0, %r12d
> +; AVX512BW-NEXT:    adcq $-1, %r12
> +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    addq $-1, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Folded Spill
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; AVX512BW-NEXT:    addq $-1, %rcx
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    addq $-1, %r14
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    addq $-1, %r10
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    addq $-1, %r9
> +; AVX512BW-NEXT:    movl $0, %edx
> +; AVX512BW-NEXT:    adcq $-1, %rdx
> +; AVX512BW-NEXT:    addq $-1, %r8
> +; AVX512BW-NEXT:    movl $0, %eax
> +; AVX512BW-NEXT:    adcq $-1, %rax
> +; AVX512BW-NEXT:    addq $-1, %rsi
> +; AVX512BW-NEXT:    movl $0, %ebp
> +; AVX512BW-NEXT:    adcq $-1, %rbp
> +; AVX512BW-NEXT:    shldq $63, %rsi, %rbp
> +; AVX512BW-NEXT:    movq %rbp, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    shldq $63, %r8, %rax
> +; AVX512BW-NEXT:    movq %rax, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
> +; AVX512BW-NEXT:    shldq $63, %r9, %rdx
> +; AVX512BW-NEXT:    movq %rdx, %rbp
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r8 # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %r10, %r8
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r10 # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %r14, %r10
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r9 # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rcx, %r9
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %r14 # 8-byte Reload
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %r14
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %rsi
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %r12
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %r13
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rdx # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %rdx
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rcx # 8-byte Reload
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %rcx
> +; AVX512BW-NEXT:    movq (%rsp), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %rdi
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %r11
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %rbx
> +; AVX512BW-NEXT:    movq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Reload
> +; AVX512BW-NEXT:    shldq $63, %rax, %r15
> +; AVX512BW-NEXT:    vmovq %r15, %xmm0
> +; AVX512BW-NEXT:    vmovq %rbx, %xmm1
> +; AVX512BW-NEXT:    vmovq %r11, %xmm2
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
> +; AVX512BW-NEXT:    vmovq %rdi, %xmm1
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm1, %ymm2, %ymm1
> +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm0, %zmm1, %zmm0
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm0, %xmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm0, %eax
> +; AVX512BW-NEXT:    vmovd %eax, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $1, %eax, %xmm2, %xmm1
> +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm0, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $2, %eax, %xmm1, %xmm1
> +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm0, %xmm0
> +; AVX512BW-NEXT:    vpextrb $0, %xmm0, %eax
> +; AVX512BW-NEXT:    vpinsrb $3, %eax, %xmm1, %xmm0
> +; AVX512BW-NEXT:    vmovq %rcx, %xmm1
> +; AVX512BW-NEXT:    vmovq %rdx, %xmm2
> +; AVX512BW-NEXT:    vmovq %r13, %xmm3
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
> +; AVX512BW-NEXT:    vmovq %r12, %xmm2
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm1, %zmm2, %zmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm1, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm1, %xmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vmovq %rsi, %xmm1
> +; AVX512BW-NEXT:    vmovq %r14, %xmm2
> +; AVX512BW-NEXT:    vmovq %r9, %xmm3
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
> +; AVX512BW-NEXT:    vmovq %r10, %xmm2
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm1, %zmm2, %zmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $8, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $9, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm1, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $10, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm1, %xmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $11, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vmovq %r8, %xmm1
> +; AVX512BW-NEXT:    vmovq %rbp, %xmm2
> +; AVX512BW-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm3 # 8-byte Folded Reload
> +; AVX512BW-NEXT:    # xmm3 = mem[0],zero
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
> +; AVX512BW-NEXT:    vmovq {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 8-byte Folded Reload
> +; AVX512BW-NEXT:    # xmm2 = mem[0],zero
> +; AVX512BW-NEXT:    vinserti128 $1, %xmm2, %ymm3, %ymm2
> +; AVX512BW-NEXT:    vinserti64x4 $1, %ymm1, %zmm2, %zmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $12, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti128 $1, %ymm1, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $13, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti32x4 $2, %zmm1, %xmm2
> +; AVX512BW-NEXT:    vpextrb $0, %xmm2, %eax
> +; AVX512BW-NEXT:    vpinsrb $14, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vextracti32x4 $3, %zmm1, %xmm1
> +; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
> +; AVX512BW-NEXT:    vpinsrb $15, %eax, %xmm0, %xmm0
> +; AVX512BW-NEXT:    vmovdqu %xmm0, (%rax)
> +; AVX512BW-NEXT:    addq $24, %rsp
> +; AVX512BW-NEXT:    popq %rbx
> +; AVX512BW-NEXT:    popq %r12
> +; AVX512BW-NEXT:    popq %r13
> +; AVX512BW-NEXT:    popq %r14
> +; AVX512BW-NEXT:    popq %r15
> +; AVX512BW-NEXT:    popq %rbp
> +; AVX512BW-NEXT:    vzeroupper
> +; AVX512BW-NEXT:    retq
>    %1 = load <16 x i8>, <16 x i8>* %a
>    %2 = load <16 x i8>, <16 x i8>* %b
>    %3 = zext <16 x i8> %1 to <16 x i128>
>
> Modified: llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx-cvt-2.ll Wed Aug  7 09:24:26 2019
> @@ -40,7 +40,7 @@ define void @fptoui8(%f32vec_t %a, %i8ve
>  ; CHECK:       # %bb.0:
>  ; CHECK-NEXT:    vcvttps2dq %ymm0, %ymm0
>  ; CHECK-NEXT:    vextractf128 $1, %ymm0, %xmm1
> -; CHECK-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> +; CHECK-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
>  ; CHECK-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; CHECK-NEXT:    vmovq %xmm0, (%rdi)
>  ; CHECK-NEXT:    vzeroupper
>
> Modified: llvm/trunk/test/CodeGen/X86/avx-fp2int.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx-fp2int.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx-fp2int.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx-fp2int.ll Wed Aug  7 09:24:26 2019
> @@ -7,6 +7,7 @@ define <4 x i8> @test1(<4 x double> %d)
>  ; CHECK-LABEL: test1:
>  ; CHECK:       ## %bb.0:
>  ; CHECK-NEXT:    vcvttpd2dq %ymm0, %xmm0
> +; CHECK-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; CHECK-NEXT:    vzeroupper
>  ; CHECK-NEXT:    retl
>    %c = fptoui <4 x double> %d to <4 x i8>
> @@ -16,6 +17,7 @@ define <4 x i8> @test2(<4 x double> %d)
>  ; CHECK-LABEL: test2:
>  ; CHECK:       ## %bb.0:
>  ; CHECK-NEXT:    vcvttpd2dq %ymm0, %xmm0
> +; CHECK-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; CHECK-NEXT:    vzeroupper
>  ; CHECK-NEXT:    retl
>    %c = fptosi <4 x double> %d to <4 x i8>
>
> Modified: llvm/trunk/test/CodeGen/X86/avx2-conversions.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx2-conversions.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx2-conversions.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx2-conversions.ll Wed Aug  7 09:24:26 2019
> @@ -117,14 +117,12 @@ define <8 x i32> @zext8(<8 x i16> %A) no
>  define <8 x i32> @zext_8i8_8i32(<8 x i8> %A) nounwind {
>  ; X32-LABEL: zext_8i8_8i32:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    vpand {{\.LCPI.*}}, %xmm0, %xmm0
> -; X32-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; X32-NEXT:    vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: zext_8i8_8i32:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> -; X64-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; X64-NEXT:    vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
>  ; X64-NEXT:    retq
>    %B = zext <8 x i8> %A to <8 x i32>
>    ret <8 x i32>%B
>
> Modified: llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx2-masked-gather.ll Wed Aug  7 09:24:26 2019
> @@ -9,23 +9,21 @@ declare <2 x i32> @llvm.masked.gather.v2
>  define <2 x i32> @masked_gather_v2i32(<2 x i32*>* %ptr, <2 x i1> %masks, <2 x i32> %passthro) {
>  ; X86-LABEL: masked_gather_v2i32:
>  ; X86:       # %bb.0: # %entry
> -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> -; X86-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; X86-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
>  ; X86-NEXT:    vpslld $31, %xmm0, %xmm0
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
>  ; X86-NEXT:    vpgatherdd %xmm0, (,%xmm2), %xmm1
> -; X86-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; X86-NEXT:    vmovdqa %xmm1, %xmm0
>  ; X86-NEXT:    retl
>  ;
>  ; X64-LABEL: masked_gather_v2i32:
>  ; X64:       # %bb.0: # %entry
>  ; X64-NEXT:    vmovdqa (%rdi), %xmm2
> -; X64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; X64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; X64-NEXT:    vpslld $31, %xmm0, %xmm0
>  ; X64-NEXT:    vpgatherqd %xmm0, (,%xmm2), %xmm1
> -; X64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; X64-NEXT:    vmovdqa %xmm1, %xmm0
>  ; X64-NEXT:    retq
>  ;
>  ; NOGATHER-LABEL: masked_gather_v2i32:
> @@ -43,14 +41,12 @@ define <2 x i32> @masked_gather_v2i32(<2
>  ; NOGATHER-NEXT:    retq
>  ; NOGATHER-NEXT:  .LBB0_1: # %cond.load
>  ; NOGATHER-NEXT:    vmovq %xmm2, %rcx
> -; NOGATHER-NEXT:    movl (%rcx), %ecx
> -; NOGATHER-NEXT:    vpinsrq $0, %rcx, %xmm1, %xmm1
> +; NOGATHER-NEXT:    vpinsrd $0, (%rcx), %xmm1, %xmm1
>  ; NOGATHER-NEXT:    testb $2, %al
>  ; NOGATHER-NEXT:    je .LBB0_4
>  ; NOGATHER-NEXT:  .LBB0_3: # %cond.load1
>  ; NOGATHER-NEXT:    vpextrq $1, %xmm2, %rax
> -; NOGATHER-NEXT:    movl (%rax), %eax
> -; NOGATHER-NEXT:    vpinsrq $1, %rax, %xmm1, %xmm1
> +; NOGATHER-NEXT:    vpinsrd $1, (%rax), %xmm1, %xmm1
>  ; NOGATHER-NEXT:    vmovdqa %xmm1, %xmm0
>  ; NOGATHER-NEXT:    retq
>  entry:
> @@ -62,11 +58,10 @@ entry:
>  define <4 x i32> @masked_gather_v2i32_concat(<2 x i32*>* %ptr, <2 x i1> %masks, <2 x i32> %passthro) {
>  ; X86-LABEL: masked_gather_v2i32_concat:
>  ; X86:       # %bb.0: # %entry
> -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> -; X86-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; X86-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
>  ; X86-NEXT:    vpslld $31, %xmm0, %xmm0
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
>  ; X86-NEXT:    vpgatherdd %xmm0, (,%xmm2), %xmm1
>  ; X86-NEXT:    vmovdqa %xmm1, %xmm0
>  ; X86-NEXT:    retl
> @@ -74,7 +69,6 @@ define <4 x i32> @masked_gather_v2i32_co
>  ; X64-LABEL: masked_gather_v2i32_concat:
>  ; X64:       # %bb.0: # %entry
>  ; X64-NEXT:    vmovdqa (%rdi), %xmm2
> -; X64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; X64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; X64-NEXT:    vpslld $31, %xmm0, %xmm0
>  ; X64-NEXT:    vpgatherqd %xmm0, (,%xmm2), %xmm1
> @@ -92,19 +86,17 @@ define <4 x i32> @masked_gather_v2i32_co
>  ; NOGATHER-NEXT:    testb $2, %al
>  ; NOGATHER-NEXT:    jne .LBB1_3
>  ; NOGATHER-NEXT:  .LBB1_4: # %else2
> -; NOGATHER-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> +; NOGATHER-NEXT:    vmovdqa %xmm1, %xmm0
>  ; NOGATHER-NEXT:    retq
>  ; NOGATHER-NEXT:  .LBB1_1: # %cond.load
>  ; NOGATHER-NEXT:    vmovq %xmm2, %rcx
> -; NOGATHER-NEXT:    movl (%rcx), %ecx
> -; NOGATHER-NEXT:    vpinsrq $0, %rcx, %xmm1, %xmm1
> +; NOGATHER-NEXT:    vpinsrd $0, (%rcx), %xmm1, %xmm1
>  ; NOGATHER-NEXT:    testb $2, %al
>  ; NOGATHER-NEXT:    je .LBB1_4
>  ; NOGATHER-NEXT:  .LBB1_3: # %cond.load1
>  ; NOGATHER-NEXT:    vpextrq $1, %xmm2, %rax
> -; NOGATHER-NEXT:    movl (%rax), %eax
> -; NOGATHER-NEXT:    vpinsrq $1, %rax, %xmm1, %xmm1
> -; NOGATHER-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> +; NOGATHER-NEXT:    vpinsrd $1, (%rax), %xmm1, %xmm1
> +; NOGATHER-NEXT:    vmovdqa %xmm1, %xmm0
>  ; NOGATHER-NEXT:    retq
>  entry:
>    %ld  = load <2 x i32*>, <2 x i32*>* %ptr
> @@ -714,10 +706,10 @@ declare <2 x i64> @llvm.masked.gather.v2
>  define <2 x i64> @masked_gather_v2i64(<2 x i64*>* %ptr, <2 x i1> %masks, <2 x i64> %passthro) {
>  ; X86-LABEL: masked_gather_v2i64:
>  ; X86:       # %bb.0: # %entry
> -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; X86-NEXT:    vpmovsxdq (%eax), %xmm2
>  ; X86-NEXT:    vpsllq $63, %xmm0, %xmm0
> -; X86-NEXT:    vpgatherqq %xmm0, (,%xmm2), %xmm1
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; X86-NEXT:    vmovq {{.*#+}} xmm2 = mem[0],zero
> +; X86-NEXT:    vpgatherdq %xmm0, (,%xmm2), %xmm1
>  ; X86-NEXT:    vmovdqa %xmm1, %xmm0
>  ; X86-NEXT:    retl
>  ;
> @@ -763,10 +755,10 @@ declare <2 x double> @llvm.masked.gather
>  define <2 x double> @masked_gather_v2double(<2 x double*>* %ptr, <2 x i1> %masks, <2 x double> %passthro) {
>  ; X86-LABEL: masked_gather_v2double:
>  ; X86:       # %bb.0: # %entry
> -; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; X86-NEXT:    vpmovsxdq (%eax), %xmm2
>  ; X86-NEXT:    vpsllq $63, %xmm0, %xmm0
> -; X86-NEXT:    vgatherqpd %xmm0, (,%xmm2), %xmm1
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; X86-NEXT:    vmovsd {{.*#+}} xmm2 = mem[0],zero
> +; X86-NEXT:    vgatherdpd %xmm0, (,%xmm2), %xmm1
>  ; X86-NEXT:    vmovapd %xmm1, %xmm0
>  ; X86-NEXT:    retl
>  ;
>
> Modified: llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx2-vbroadcast.ll Wed Aug  7 09:24:26 2019
> @@ -657,12 +657,12 @@ define <4 x float> @_e2(float* %ptr) nou
>  define <8 x i8> @_e4(i8* %ptr) nounwind uwtable readnone ssp {
>  ; X32-LABEL: _e4:
>  ; X32:       ## %bb.0:
> -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [52,52,52,52,52,52,52,52]
> +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <52,52,52,52,52,52,52,52,u,u,u,u,u,u,u,u>
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: _e4:
>  ; X64:       ## %bb.0:
> -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [52,52,52,52,52,52,52,52]
> +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <52,52,52,52,52,52,52,52,u,u,u,u,u,u,u,u>
>  ; X64-NEXT:    retq
>    %vecinit0.i = insertelement <8 x i8> undef, i8       52, i32 0
>    %vecinit1.i = insertelement <8 x i8> %vecinit0.i, i8 52, i32 1
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-any_extend_load.ll Wed Aug  7 09:24:26 2019
> @@ -4,13 +4,25 @@
>
>
>  define void @any_extend_load_v8i64(<8 x i8> * %ptr) {
> -; ALL-LABEL: any_extend_load_v8i64:
> -; ALL:       # %bb.0:
> -; ALL-NEXT:    vpmovzxbq {{.*#+}} zmm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero,mem[4],zero,zero,zero,zero,zero,zero,zero,mem[5],zero,zero,zero,zero,zero,zero,zero,mem[6],zero,zero,zero,zero,zero,zero,zero,mem[7],zero,zero,zero,zero,zero,zero,zero
> -; ALL-NEXT:    vpaddq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> -; ALL-NEXT:    vpmovqb %zmm0, (%rdi)
> -; ALL-NEXT:    vzeroupper
> -; ALL-NEXT:    retq
> +; KNL-LABEL: any_extend_load_v8i64:
> +; KNL:       # %bb.0:
> +; KNL-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> +; KNL-NEXT:    vpmovzxbq {{.*#+}} ymm1 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,xmm0[2],zero,zero,zero,zero,zero,zero,zero,xmm0[3],zero,zero,zero,zero,zero,zero,zero
> +; KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
> +; KNL-NEXT:    vpmovzxbq {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,xmm0[2],zero,zero,zero,zero,zero,zero,zero,xmm0[3],zero,zero,zero,zero,zero,zero,zero
> +; KNL-NEXT:    vinserti64x4 $1, %ymm0, %zmm1, %zmm0
> +; KNL-NEXT:    vpaddq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> +; KNL-NEXT:    vpmovqb %zmm0, (%rdi)
> +; KNL-NEXT:    vzeroupper
> +; KNL-NEXT:    retq
> +;
> +; SKX-LABEL: any_extend_load_v8i64:
> +; SKX:       # %bb.0:
> +; SKX-NEXT:    vpmovzxbq {{.*#+}} zmm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero,mem[4],zero,zero,zero,zero,zero,zero,zero,mem[5],zero,zero,zero,zero,zero,zero,zero,mem[6],zero,zero,zero,zero,zero,zero,zero,mem[7],zero,zero,zero,zero,zero,zero,zero
> +; SKX-NEXT:    vpaddq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> +; SKX-NEXT:    vpmovqb %zmm0, (%rdi)
> +; SKX-NEXT:    vzeroupper
> +; SKX-NEXT:    retq
>    %wide.load = load <8 x i8>, <8 x i8>* %ptr, align 1
>    %1 = zext <8 x i8> %wide.load to <8 x i64>
>    %2 = add nuw nsw <8 x i64> %1, <i64 4, i64 4, i64 4, i64 4, i64 4, i64 4, i64 4, i64 4>
> @@ -23,10 +35,12 @@ define void @any_extend_load_v8i64(<8 x
>  define void @any_extend_load_v8i32(<8 x i8> * %ptr) {
>  ; KNL-LABEL: any_extend_load_v8i32:
>  ; KNL:       # %bb.0:
> -; KNL-NEXT:    vpmovzxbw {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
> -; KNL-NEXT:    vpaddw {{.*}}(%rip), %xmm0, %xmm0
> -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> +; KNL-NEXT:    vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
> +; KNL-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [4,4,4,4,4,4,4,4]
> +; KNL-NEXT:    vpaddd %ymm1, %ymm0, %ymm0
> +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
>  ; KNL-NEXT:    vmovq %xmm0, (%rdi)
> +; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: any_extend_load_v8i32:
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-cvt.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-cvt.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-cvt.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-cvt.ll Wed Aug  7 09:24:26 2019
> @@ -513,15 +513,14 @@ define <8 x i8> @f64to8uc(<8 x double> %
>  ; NOVL-LABEL: f64to8uc:
>  ; NOVL:       # %bb.0:
>  ; NOVL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> -; NOVL-NEXT:    vpmovdw %zmm0, %ymm0
> -; NOVL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
> +; NOVL-NEXT:    vpmovdb %zmm0, %xmm0
>  ; NOVL-NEXT:    vzeroupper
>  ; NOVL-NEXT:    retq
>  ;
>  ; VL-LABEL: f64to8uc:
>  ; VL:       # %bb.0:
>  ; VL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> -; VL-NEXT:    vpmovdw %ymm0, %xmm0
> +; VL-NEXT:    vpmovdb %ymm0, %xmm0
>  ; VL-NEXT:    vzeroupper
>  ; VL-NEXT:    retq
>    %res = fptoui <8 x double> %f to <8 x i8>
> @@ -657,15 +656,14 @@ define <8 x i8> @f64to8sc(<8 x double> %
>  ; NOVL-LABEL: f64to8sc:
>  ; NOVL:       # %bb.0:
>  ; NOVL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> -; NOVL-NEXT:    vpmovdw %zmm0, %ymm0
> -; NOVL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
> +; NOVL-NEXT:    vpmovdb %zmm0, %xmm0
>  ; NOVL-NEXT:    vzeroupper
>  ; NOVL-NEXT:    retq
>  ;
>  ; VL-LABEL: f64to8sc:
>  ; VL:       # %bb.0:
>  ; VL-NEXT:    vcvttpd2dq %zmm0, %ymm0
> -; VL-NEXT:    vpmovdw %ymm0, %xmm0
> +; VL-NEXT:    vpmovdb %ymm0, %xmm0
>  ; VL-NEXT:    vzeroupper
>  ; VL-NEXT:    retq
>    %res = fptosi <8 x double> %f to <8 x i8>
> @@ -1557,9 +1555,7 @@ define <8 x double> @ssto16f64(<8 x i16>
>  define <8 x double> @scto8f64(<8 x i8> %a) {
>  ; ALL-LABEL: scto8f64:
>  ; ALL:       # %bb.0:
> -; ALL-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; ALL-NEXT:    vpslld $24, %ymm0, %ymm0
> -; ALL-NEXT:    vpsrad $24, %ymm0, %ymm0
> +; ALL-NEXT:    vpmovsxbd %xmm0, %ymm0
>  ; ALL-NEXT:    vcvtdq2pd %ymm0, %zmm0
>  ; ALL-NEXT:    retq
>    %1 = sitofp <8 x i8> %a to <8 x double>
> @@ -1724,13 +1720,30 @@ define <2 x float> @sbto2f32(<2 x float>
>  }
>
>  define <2 x double> @sbto2f64(<2 x double> %a) {
> -; ALL-LABEL: sbto2f64:
> -; ALL:       # %bb.0:
> -; ALL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> -; ALL-NEXT:    vcmpltpd %xmm0, %xmm1, %xmm0
> -; ALL-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; ALL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> -; ALL-NEXT:    retq
> +; NOVL-LABEL: sbto2f64:
> +; NOVL:       # %bb.0:
> +; NOVL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> +; NOVL-NEXT:    vcmpltpd %xmm0, %xmm1, %xmm0
> +; NOVL-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; NOVL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> +; NOVL-NEXT:    retq
> +;
> +; VLDQ-LABEL: sbto2f64:
> +; VLDQ:       # %bb.0:
> +; VLDQ-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> +; VLDQ-NEXT:    vcmpltpd %xmm0, %xmm1, %k0
> +; VLDQ-NEXT:    vpmovm2d %k0, %xmm0
> +; VLDQ-NEXT:    vcvtdq2pd %xmm0, %xmm0
> +; VLDQ-NEXT:    retq
> +;
> +; VLNODQ-LABEL: sbto2f64:
> +; VLNODQ:       # %bb.0:
> +; VLNODQ-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> +; VLNODQ-NEXT:    vcmpltpd %xmm0, %xmm1, %k1
> +; VLNODQ-NEXT:    vpcmpeqd %xmm0, %xmm0, %xmm0
> +; VLNODQ-NEXT:    vmovdqa32 %xmm0, %xmm0 {%k1} {z}
> +; VLNODQ-NEXT:    vcvtdq2pd %xmm0, %xmm0
> +; VLNODQ-NEXT:    retq
>    %cmpres = fcmp ogt <2 x double> %a, zeroinitializer
>    %1 = sitofp <2 x i1> %cmpres to <2 x double>
>    ret <2 x double> %1
> @@ -1749,8 +1762,7 @@ define <16 x float> @ucto16f32(<16 x i8>
>  define <8 x double> @ucto8f64(<8 x i8> %a) {
>  ; ALL-LABEL: ucto8f64:
>  ; ALL:       # %bb.0:
> -; ALL-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> -; ALL-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; ALL-NEXT:    vpmovzxbd {{.*#+}} ymm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero
>  ; ALL-NEXT:    vcvtdq2pd %ymm0, %zmm0
>  ; ALL-NEXT:    retq
>    %b = uitofp <8 x i8> %a to <8 x double>
> @@ -1993,29 +2005,42 @@ define <4 x double> @ubto4f64(<4 x i32>
>  }
>
>  define <2 x float> @ubto2f32(<2 x i32> %a) {
> -; ALL-LABEL: ubto2f32:
> -; ALL:       # %bb.0:
> -; ALL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> -; ALL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> -; ALL-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> -; ALL-NEXT:    vpandn {{.*}}(%rip), %xmm0, %xmm0
> -; ALL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; ALL-NEXT:    retq
> +; NOVL-LABEL: ubto2f32:
> +; NOVL:       # %bb.0:
> +; NOVL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> +; NOVL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> +; NOVL-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [1065353216,1065353216,1065353216,1065353216]
> +; NOVL-NEXT:    vpandn %xmm1, %xmm0, %xmm0
> +; NOVL-NEXT:    retq
> +;
> +; VL-LABEL: ubto2f32:
> +; VL:       # %bb.0:
> +; VL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> +; VL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> +; VL-NEXT:    vpandnd {{.*}}(%rip){1to4}, %xmm0, %xmm0
> +; VL-NEXT:    retq
>    %mask = icmp ne <2 x i32> %a, zeroinitializer
>    %1 = uitofp <2 x i1> %mask to <2 x float>
>    ret <2 x float> %1
>  }
>
>  define <2 x double> @ubto2f64(<2 x i32> %a) {
> -; ALL-LABEL: ubto2f64:
> -; ALL:       # %bb.0:
> -; ALL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> -; ALL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> -; ALL-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> -; ALL-NEXT:    vpandn {{.*}}(%rip), %xmm0, %xmm0
> -; ALL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; ALL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> -; ALL-NEXT:    retq
> +; NOVL-LABEL: ubto2f64:
> +; NOVL:       # %bb.0:
> +; NOVL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> +; NOVL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> +; NOVL-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [1,1,1,1]
> +; NOVL-NEXT:    vpandn %xmm1, %xmm0, %xmm0
> +; NOVL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> +; NOVL-NEXT:    retq
> +;
> +; VL-LABEL: ubto2f64:
> +; VL:       # %bb.0:
> +; VL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> +; VL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> +; VL-NEXT:    vpandnd {{.*}}(%rip){1to4}, %xmm0, %xmm0
> +; VL-NEXT:    vcvtdq2pd %xmm0, %xmm0
> +; VL-NEXT:    retq
>    %mask = icmp ne <2 x i32> %a, zeroinitializer
>    %1 = uitofp <2 x i1> %mask to <2 x double>
>    ret <2 x double> %1
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-ext.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-ext.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-ext.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-ext.ll Wed Aug  7 09:24:26 2019
> @@ -2134,28 +2134,53 @@ define <32 x i8> @zext_32xi1_to_32xi8(<3
>  }
>
>  define <4 x i32> @zext_4xi1_to_4x32(<4 x i8> %x, <4 x i8> %y) #0 {
> -; ALL-LABEL: zext_4xi1_to_4x32:
> -; ALL:       # %bb.0:
> -; ALL-NEXT:    vpbroadcastd {{.*#+}} xmm2 = [255,255,255,255]
> -; ALL-NEXT:    vpand %xmm2, %xmm1, %xmm1
> -; ALL-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; ALL-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> -; ALL-NEXT:    vpsrld $31, %xmm0, %xmm0
> -; ALL-NEXT:    retq
> +; KNL-LABEL: zext_4xi1_to_4x32:
> +; KNL:       # %bb.0:
> +; KNL-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> +; KNL-NEXT:    vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
> +; KNL-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [1,1,1,1]
> +; KNL-NEXT:    vpand %xmm1, %xmm0, %xmm0
> +; KNL-NEXT:    retq
> +;
> +; SKX-LABEL: zext_4xi1_to_4x32:
> +; SKX:       # %bb.0:
> +; SKX-NEXT:    vpcmpeqb %xmm1, %xmm0, %k0
> +; SKX-NEXT:    vpmovm2d %k0, %xmm0
> +; SKX-NEXT:    vpsrld $31, %xmm0, %xmm0
> +; SKX-NEXT:    retq
> +;
> +; AVX512DQNOBW-LABEL: zext_4xi1_to_4x32:
> +; AVX512DQNOBW:       # %bb.0:
> +; AVX512DQNOBW-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> +; AVX512DQNOBW-NEXT:    vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
> +; AVX512DQNOBW-NEXT:    vpandd {{.*}}(%rip){1to4}, %xmm0, %xmm0
> +; AVX512DQNOBW-NEXT:    retq
>    %mask = icmp eq <4 x i8> %x, %y
>    %1 = zext <4 x i1> %mask to <4 x i32>
>    ret <4 x i32> %1
>  }
>
>  define <2 x i64> @zext_2xi1_to_2xi64(<2 x i8> %x, <2 x i8> %y) #0 {
> -; ALL-LABEL: zext_2xi1_to_2xi64:
> -; ALL:       # %bb.0:
> -; ALL-NEXT:    vpbroadcastq {{.*#+}} xmm2 = [255,255]
> -; ALL-NEXT:    vpand %xmm2, %xmm1, %xmm1
> -; ALL-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; ALL-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> -; ALL-NEXT:    vpsrlq $63, %xmm0, %xmm0
> -; ALL-NEXT:    retq
> +; KNL-LABEL: zext_2xi1_to_2xi64:
> +; KNL:       # %bb.0:
> +; KNL-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> +; KNL-NEXT:    vpmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero
> +; KNL-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> +; KNL-NEXT:    retq
> +;
> +; SKX-LABEL: zext_2xi1_to_2xi64:
> +; SKX:       # %bb.0:
> +; SKX-NEXT:    vpcmpeqb %xmm1, %xmm0, %k0
> +; SKX-NEXT:    vpmovm2q %k0, %xmm0
> +; SKX-NEXT:    vpsrlq $63, %xmm0, %xmm0
> +; SKX-NEXT:    retq
> +;
> +; AVX512DQNOBW-LABEL: zext_2xi1_to_2xi64:
> +; AVX512DQNOBW:       # %bb.0:
> +; AVX512DQNOBW-NEXT:    vpcmpeqb %xmm1, %xmm0, %xmm0
> +; AVX512DQNOBW-NEXT:    vpmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero
> +; AVX512DQNOBW-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
> +; AVX512DQNOBW-NEXT:    retq
>    %mask = icmp eq <2 x i8> %x, %y
>    %1 = zext <2 x i1> %mask to <2 x i64>
>    ret <2 x i64> %1
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-intrinsics-upgrade.ll Wed Aug  7 09:24:26 2019
> @@ -5478,19 +5478,19 @@ define <8 x i8> @test_cmp_q_512(<8 x i64
>  ; CHECK-NEXT:    vpcmpgtq %zmm1, %zmm0, %k5 ## encoding: [0x62,0xf2,0xfd,0x48,0x37,0xe9]
>  ; CHECK-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $255, %eax ## encoding: [0xb8,0xff,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
>  ; CHECK-NEXT:    ret{{[l|q]}} ## encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.512(<8 x i64> %a0, <8 x i64> %a1, i32 0, i8 -1)
> @@ -5515,7 +5515,7 @@ define <8 x i8> @test_cmp_q_512(<8 x i64
>  define <8 x i8> @test_mask_cmp_q_512(<8 x i64> %a0, <8 x i64> %a1, i8 %mask) {
>  ; X86-LABEL: test_mask_cmp_q_512:
>  ; X86:       ## %bb.0:
> -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax ## encoding: [0x0f,0xb7,0x44,0x24,0x04]
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]
>  ; X86-NEXT:    kmovw %eax, %k1 ## encoding: [0xc5,0xf8,0x92,0xc8]
>  ; X86-NEXT:    vpcmpeqq %zmm1, %zmm0, %k0 {%k1} ## encoding: [0x62,0xf2,0xfd,0x49,0x29,0xc1]
>  ; X86-NEXT:    vpcmpgtq %zmm0, %zmm1, %k2 {%k1} ## encoding: [0x62,0xf2,0xf5,0x49,0x37,0xd0]
> @@ -5525,18 +5525,18 @@ define <8 x i8> @test_mask_cmp_q_512(<8
>  ; X86-NEXT:    vpcmpgtq %zmm1, %zmm0, %k1 {%k1} ## encoding: [0x62,0xf2,0xfd,0x49,0x37,0xc9]
>  ; X86-NEXT:    kmovw %k0, %ecx ## encoding: [0xc5,0xf8,0x93,0xc8]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
>  ; X86-NEXT:    kmovw %k2, %ecx ## encoding: [0xc5,0xf8,0x93,0xca]
> -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
>  ; X86-NEXT:    kmovw %k3, %ecx ## encoding: [0xc5,0xf8,0x93,0xcb]
> -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
>  ; X86-NEXT:    kmovw %k4, %ecx ## encoding: [0xc5,0xf8,0x93,0xcc]
> -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
>  ; X86-NEXT:    kmovw %k5, %ecx ## encoding: [0xc5,0xf8,0x93,0xcd]
> -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
>  ; X86-NEXT:    kmovw %k1, %ecx ## encoding: [0xc5,0xf8,0x93,0xc9]
> -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
>  ; X86-NEXT:    retl ## encoding: [0xc3]
>  ;
> @@ -5551,18 +5551,18 @@ define <8 x i8> @test_mask_cmp_q_512(<8
>  ; X64-NEXT:    vpcmpgtq %zmm1, %zmm0, %k1 {%k1} ## encoding: [0x62,0xf2,0xfd,0x49,0x37,0xc9]
>  ; X64-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
>  ; X64-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
>  ; X64-NEXT:    retq ## encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.512(<8 x i64> %a0, <8 x i64> %a1, i32 0, i8 %mask)
> @@ -5597,19 +5597,19 @@ define <8 x i8> @test_ucmp_q_512(<8 x i6
>  ; CHECK-NEXT:    vpcmpnleuq %zmm1, %zmm0, %k5 ## encoding: [0x62,0xf3,0xfd,0x48,0x1e,0xe9,0x06]
>  ; CHECK-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $255, %eax ## encoding: [0xb8,0xff,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
>  ; CHECK-NEXT:    ret{{[l|q]}} ## encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.512(<8 x i64> %a0, <8 x i64> %a1, i32 0, i8 -1)
> @@ -5634,7 +5634,7 @@ define <8 x i8> @test_ucmp_q_512(<8 x i6
>  define <8 x i8> @test_mask_ucmp_q_512(<8 x i64> %a0, <8 x i64> %a1, i8 %mask) {
>  ; X86-LABEL: test_mask_ucmp_q_512:
>  ; X86:       ## %bb.0:
> -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax ## encoding: [0x0f,0xb7,0x44,0x24,0x04]
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]
>  ; X86-NEXT:    kmovw %eax, %k1 ## encoding: [0xc5,0xf8,0x92,0xc8]
>  ; X86-NEXT:    vpcmpeqq %zmm1, %zmm0, %k0 {%k1} ## encoding: [0x62,0xf2,0xfd,0x49,0x29,0xc1]
>  ; X86-NEXT:    vpcmpltuq %zmm1, %zmm0, %k2 {%k1} ## encoding: [0x62,0xf3,0xfd,0x49,0x1e,0xd1,0x01]
> @@ -5644,18 +5644,18 @@ define <8 x i8> @test_mask_ucmp_q_512(<8
>  ; X86-NEXT:    vpcmpnleuq %zmm1, %zmm0, %k1 {%k1} ## encoding: [0x62,0xf3,0xfd,0x49,0x1e,0xc9,0x06]
>  ; X86-NEXT:    kmovw %k0, %ecx ## encoding: [0xc5,0xf8,0x93,0xc8]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
>  ; X86-NEXT:    kmovw %k2, %ecx ## encoding: [0xc5,0xf8,0x93,0xca]
> -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
>  ; X86-NEXT:    kmovw %k3, %ecx ## encoding: [0xc5,0xf8,0x93,0xcb]
> -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
>  ; X86-NEXT:    kmovw %k4, %ecx ## encoding: [0xc5,0xf8,0x93,0xcc]
> -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
>  ; X86-NEXT:    kmovw %k5, %ecx ## encoding: [0xc5,0xf8,0x93,0xcd]
> -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
>  ; X86-NEXT:    kmovw %k1, %ecx ## encoding: [0xc5,0xf8,0x93,0xc9]
> -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
>  ; X86-NEXT:    retl ## encoding: [0xc3]
>  ;
> @@ -5670,18 +5670,18 @@ define <8 x i8> @test_mask_ucmp_q_512(<8
>  ; X64-NEXT:    vpcmpnleuq %zmm1, %zmm0, %k1 {%k1} ## encoding: [0x62,0xf3,0xfd,0x49,0x1e,0xc9,0x06]
>  ; X64-NEXT:    kmovw %k0, %eax ## encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k2, %eax ## encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax ## encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax ## encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax ## encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k1, %eax ## encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
>  ; X64-NEXT:    vzeroupper ## encoding: [0xc5,0xf8,0x77]
>  ; X64-NEXT:    retq ## encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.512(<8 x i64> %a0, <8 x i64> %a1, i32 0, i8 %mask)
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-mask-op.ll Wed Aug  7 09:24:26 2019
> @@ -2296,21 +2296,22 @@ define <2 x i16> @load_2i1(<2 x i1>* %a)
>  ; KNL-LABEL: load_2i1:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    kmovw (%rdi), %k1
> -; KNL-NEXT:    vpternlogq $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> +; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> +; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> +; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: load_2i1:
>  ; SKX:       ## %bb.0:
>  ; SKX-NEXT:    kmovb (%rdi), %k0
> -; SKX-NEXT:    vpmovm2q %k0, %xmm0
> +; SKX-NEXT:    vpmovm2w %k0, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: load_2i1:
>  ; AVX512BW:       ## %bb.0:
> -; AVX512BW-NEXT:    kmovw (%rdi), %k1
> -; AVX512BW-NEXT:    vpternlogq $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> +; AVX512BW-NEXT:    kmovw (%rdi), %k0
> +; AVX512BW-NEXT:    vpmovm2w %k0, %zmm0
>  ; AVX512BW-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2318,8 +2319,9 @@ define <2 x i16> @load_2i1(<2 x i1>* %a)
>  ; AVX512DQ-LABEL: load_2i1:
>  ; AVX512DQ:       ## %bb.0:
>  ; AVX512DQ-NEXT:    kmovb (%rdi), %k0
> -; AVX512DQ-NEXT:    vpmovm2q %k0, %zmm0
> -; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> +; AVX512DQ-NEXT:    vpmovm2d %k0, %zmm0
> +; AVX512DQ-NEXT:    vpmovdw %zmm0, %ymm0
> +; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
>  ; AVX512DQ-NEXT:    vzeroupper
>  ; AVX512DQ-NEXT:    retq
>  ;
> @@ -2327,7 +2329,7 @@ define <2 x i16> @load_2i1(<2 x i1>* %a)
>  ; X86:       ## %bb.0:
>  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; X86-NEXT:    kmovb (%eax), %k0
> -; X86-NEXT:    vpmovm2q %k0, %xmm0
> +; X86-NEXT:    vpmovm2w %k0, %xmm0
>  ; X86-NEXT:    retl
>    %b = load <2 x i1>, <2 x i1>* %a
>    %c = sext <2 x i1> %b to <2 x i16>
> @@ -2339,20 +2341,21 @@ define <4 x i16> @load_4i1(<4 x i1>* %a)
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    kmovw (%rdi), %k1
>  ; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> +; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> +; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: load_4i1:
>  ; SKX:       ## %bb.0:
>  ; SKX-NEXT:    kmovb (%rdi), %k0
> -; SKX-NEXT:    vpmovm2d %k0, %xmm0
> +; SKX-NEXT:    vpmovm2w %k0, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: load_4i1:
>  ; AVX512BW:       ## %bb.0:
> -; AVX512BW-NEXT:    kmovw (%rdi), %k1
> -; AVX512BW-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
> +; AVX512BW-NEXT:    kmovw (%rdi), %k0
> +; AVX512BW-NEXT:    vpmovm2w %k0, %zmm0
>  ; AVX512BW-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2361,7 +2364,8 @@ define <4 x i16> @load_4i1(<4 x i1>* %a)
>  ; AVX512DQ:       ## %bb.0:
>  ; AVX512DQ-NEXT:    kmovb (%rdi), %k0
>  ; AVX512DQ-NEXT:    vpmovm2d %k0, %zmm0
> -; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
> +; AVX512DQ-NEXT:    vpmovdw %zmm0, %ymm0
> +; AVX512DQ-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
>  ; AVX512DQ-NEXT:    vzeroupper
>  ; AVX512DQ-NEXT:    retq
>  ;
> @@ -2369,7 +2373,7 @@ define <4 x i16> @load_4i1(<4 x i1>* %a)
>  ; X86:       ## %bb.0:
>  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; X86-NEXT:    kmovb (%eax), %k0
> -; X86-NEXT:    vpmovm2d %k0, %xmm0
> +; X86-NEXT:    vpmovm2w %k0, %xmm0
>  ; X86-NEXT:    retl
>    %b = load <4 x i1>, <4 x i1>* %a
>    %c = sext <4 x i1> %b to <4 x i16>
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-trunc.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-trunc.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-trunc.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-trunc.ll Wed Aug  7 09:24:26 2019
> @@ -36,7 +36,7 @@ define <16 x i16> @trunc_v16i32_to_v16i1
>  define <8 x i8> @trunc_qb_512(<8 x i64> %i) #0 {
>  ; ALL-LABEL: trunc_qb_512:
>  ; ALL:       ## %bb.0:
> -; ALL-NEXT:    vpmovqw %zmm0, %xmm0
> +; ALL-NEXT:    vpmovqb %zmm0, %xmm0
>  ; ALL-NEXT:    vzeroupper
>  ; ALL-NEXT:    retq
>    %x = trunc <8 x i64> %i to <8 x i8>
> @@ -58,14 +58,13 @@ define <4 x i8> @trunc_qb_256(<4 x i64>
>  ; KNL-LABEL: trunc_qb_256:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> +; KNL-NEXT:    vpmovqb %zmm0, %xmm0
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: trunc_qb_256:
>  ; SKX:       ## %bb.0:
> -; SKX-NEXT:    vpmovqd %ymm0, %xmm0
> +; SKX-NEXT:    vpmovqb %ymm0, %xmm0
>  ; SKX-NEXT:    vzeroupper
>  ; SKX-NEXT:    retq
>    %x = trunc <4 x i64> %i to <4 x i8>
> @@ -76,8 +75,7 @@ define void @trunc_qb_256_mem(<4 x i64>
>  ; KNL-LABEL: trunc_qb_256_mem:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> +; KNL-NEXT:    vpmovqb %zmm0, %xmm0
>  ; KNL-NEXT:    vmovd %xmm0, (%rdi)
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
> @@ -95,6 +93,7 @@ define void @trunc_qb_256_mem(<4 x i64>
>  define <2 x i8> @trunc_qb_128(<2 x i64> %i) #0 {
>  ; ALL-LABEL: trunc_qb_128:
>  ; ALL:       ## %bb.0:
> +; ALL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; ALL-NEXT:    retq
>    %x = trunc <2 x i64> %i to <2 x i8>
>    ret <2 x i8> %x
> @@ -141,14 +140,13 @@ define <4 x i16> @trunc_qw_256(<4 x i64>
>  ; KNL-LABEL: trunc_qw_256:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> +; KNL-NEXT:    vpmovqw %zmm0, %xmm0
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: trunc_qw_256:
>  ; SKX:       ## %bb.0:
> -; SKX-NEXT:    vpmovqd %ymm0, %xmm0
> +; SKX-NEXT:    vpmovqw %ymm0, %xmm0
>  ; SKX-NEXT:    vzeroupper
>  ; SKX-NEXT:    retq
>    %x = trunc <4 x i64> %i to <4 x i16>
> @@ -159,8 +157,7 @@ define void @trunc_qw_256_mem(<4 x i64>
>  ; KNL-LABEL: trunc_qw_256_mem:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> -; KNL-NEXT:    vpmovqd %zmm0, %ymm0
> -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> +; KNL-NEXT:    vpmovqw %zmm0, %xmm0
>  ; KNL-NEXT:    vmovq %xmm0, (%rdi)
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
> @@ -176,9 +173,16 @@ define void @trunc_qw_256_mem(<4 x i64>
>  }
>
>  define <2 x i16> @trunc_qw_128(<2 x i64> %i) #0 {
> -; ALL-LABEL: trunc_qw_128:
> -; ALL:       ## %bb.0:
> -; ALL-NEXT:    retq
> +; KNL-LABEL: trunc_qw_128:
> +; KNL:       ## %bb.0:
> +; KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; KNL-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; KNL-NEXT:    retq
> +;
> +; SKX-LABEL: trunc_qw_128:
> +; SKX:       ## %bb.0:
> +; SKX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,8,9,8,9,10,11,8,9,10,11,12,13,14,15]
> +; SKX-NEXT:    retq
>    %x = trunc <2 x i64> %i to <2 x i16>
>    ret <2 x i16> %x
>  }
> @@ -260,6 +264,7 @@ define void @trunc_qd_256_mem(<4 x i64>
>  define <2 x i32> @trunc_qd_128(<2 x i64> %i) #0 {
>  ; ALL-LABEL: trunc_qd_128:
>  ; ALL:       ## %bb.0:
> +; ALL-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; ALL-NEXT:    retq
>    %x = trunc <2 x i64> %i to <2 x i32>
>    ret <2 x i32> %x
> @@ -306,14 +311,13 @@ define <8 x i8> @trunc_db_256(<8 x i32>
>  ; KNL-LABEL: trunc_db_256:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> -; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> -; KNL-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $ymm0
> +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: trunc_db_256:
>  ; SKX:       ## %bb.0:
> -; SKX-NEXT:    vpmovdw %ymm0, %xmm0
> +; SKX-NEXT:    vpmovdb %ymm0, %xmm0
>  ; SKX-NEXT:    vzeroupper
>  ; SKX-NEXT:    retq
>    %x = trunc <8 x i32> %i to <8 x i8>
> @@ -324,8 +328,7 @@ define void @trunc_db_256_mem(<8 x i32>
>  ; KNL-LABEL: trunc_db_256_mem:
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    ## kill: def $ymm0 killed $ymm0 def $zmm0
> -; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> -; KNL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
> +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
>  ; KNL-NEXT:    vmovq %xmm0, (%rdi)
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
> @@ -343,6 +346,7 @@ define void @trunc_db_256_mem(<8 x i32>
>  define <4 x i8> @trunc_db_128(<4 x i32> %i) #0 {
>  ; ALL-LABEL: trunc_db_128:
>  ; ALL:       ## %bb.0:
> +; ALL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; ALL-NEXT:    retq
>    %x = trunc <4 x i32> %i to <4 x i8>
>    ret <4 x i8> %x
> @@ -513,6 +517,7 @@ define void @trunc_wb_256_mem(<16 x i16>
>  define <8 x i8> @trunc_wb_128(<8 x i16> %i) #0 {
>  ; ALL-LABEL: trunc_wb_128:
>  ; ALL:       ## %bb.0:
> +; ALL-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; ALL-NEXT:    retq
>    %x = trunc <8 x i16> %i to <8 x i8>
>    ret <8 x i8> %x
> @@ -691,6 +696,7 @@ define <8 x i8> @usat_trunc_wb_128(<8 x
>  ; ALL-LABEL: usat_trunc_wb_128:
>  ; ALL:       ## %bb.0:
>  ; ALL-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> +; ALL-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; ALL-NEXT:    retq
>    %x3 = icmp ult <8 x i16> %i, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
>    %x5 = select <8 x i1> %x3, <8 x i16> %i, <8 x i16> <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
> @@ -716,16 +722,14 @@ define <16 x i8> @usat_trunc_db_256(<8 x
>  ; KNL:       ## %bb.0:
>  ; KNL-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255]
>  ; KNL-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> -; KNL-NEXT:    vpmovdw %zmm0, %ymm0
> -; KNL-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> +; KNL-NEXT:    vpmovdb %zmm0, %xmm0
>  ; KNL-NEXT:    vzeroupper
>  ; KNL-NEXT:    retq
>  ;
>  ; SKX-LABEL: usat_trunc_db_256:
>  ; SKX:       ## %bb.0:
>  ; SKX-NEXT:    vpminud {{.*}}(%rip){1to8}, %ymm0, %ymm0
> -; SKX-NEXT:    vpmovdw %ymm0, %xmm0
> -; SKX-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
> +; SKX-NEXT:    vpmovdb %ymm0, %xmm0
>  ; SKX-NEXT:    vzeroupper
>  ; SKX-NEXT:    retq
>    %tmp1 = icmp ult <8 x i32> %x, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-vec-cmp.ll Wed Aug  7 09:24:26 2019
> @@ -886,22 +886,14 @@ define <8 x double> @test43(<8 x double>
>  define <4 x i32> @test44(<4 x i16> %x, <4 x i16> %y) #0 {
>  ; AVX512-LABEL: test44:
>  ; AVX512:       ## %bb.0:
> -; AVX512-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## encoding: [0xc5,0xe9,0xef,0xd2]
> -; AVX512-NEXT:    vpblendw $170, %xmm2, %xmm1, %xmm1 ## encoding: [0xc4,0xe3,0x71,0x0e,0xca,0xaa]
> -; AVX512-NEXT:    ## xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3],xmm1[4],xmm2[5],xmm1[6],xmm2[7]
> -; AVX512-NEXT:    vpblendw $170, %xmm2, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x0e,0xc2,0xaa]
> -; AVX512-NEXT:    ## xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3],xmm0[4],xmm2[5],xmm0[6],xmm2[7]
> -; AVX512-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0x76,0xc1]
> +; AVX512-NEXT:    vpcmpeqw %xmm1, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0x75,0xc1]
> +; AVX512-NEXT:    vpmovsxwd %xmm0, %xmm0 ## encoding: [0xc4,0xe2,0x79,0x23,0xc0]
>  ; AVX512-NEXT:    retq ## encoding: [0xc3]
>  ;
>  ; SKX-LABEL: test44:
>  ; SKX:       ## %bb.0:
> -; SKX-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## EVEX TO VEX Compression encoding: [0xc5,0xe9,0xef,0xd2]
> -; SKX-NEXT:    vpblendw $170, %xmm2, %xmm1, %xmm1 ## encoding: [0xc4,0xe3,0x71,0x0e,0xca,0xaa]
> -; SKX-NEXT:    ## xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3],xmm1[4],xmm2[5],xmm1[6],xmm2[7]
> -; SKX-NEXT:    vpblendw $170, %xmm2, %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x0e,0xc2,0xaa]
> -; SKX-NEXT:    ## xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3],xmm0[4],xmm2[5],xmm0[6],xmm2[7]
> -; SKX-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0x76,0xc1]
> +; SKX-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 ## encoding: [0x62,0xf1,0x7d,0x08,0x75,0xc1]
> +; SKX-NEXT:    vpmovm2d %k0, %xmm0 ## encoding: [0x62,0xf2,0x7e,0x08,0x38,0xc0]
>  ; SKX-NEXT:    retq ## encoding: [0xc3]
>    %mask = icmp eq <4 x i16> %x, %y
>    %1 = sext <4 x i1> %mask to <4 x i32>
> @@ -911,23 +903,17 @@ define <4 x i32> @test44(<4 x i16> %x, <
>  define <2 x i64> @test45(<2 x i16> %x, <2 x i16> %y) #0 {
>  ; AVX512-LABEL: test45:
>  ; AVX512:       ## %bb.0:
> -; AVX512-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## encoding: [0xc5,0xe9,0xef,0xd2]
> -; AVX512-NEXT:    vpblendw $17, %xmm1, %xmm2, %xmm1 ## encoding: [0xc4,0xe3,0x69,0x0e,0xc9,0x11]
> -; AVX512-NEXT:    ## xmm1 = xmm1[0],xmm2[1,2,3],xmm1[4],xmm2[5,6,7]
> -; AVX512-NEXT:    vpblendw $17, %xmm0, %xmm2, %xmm0 ## encoding: [0xc4,0xe3,0x69,0x0e,0xc0,0x11]
> -; AVX512-NEXT:    ## xmm0 = xmm0[0],xmm2[1,2,3],xmm0[4],xmm2[5,6,7]
> -; AVX512-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0 ## encoding: [0xc4,0xe2,0x79,0x29,0xc1]
> -; AVX512-NEXT:    vpsrlq $63, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0x73,0xd0,0x3f]
> +; AVX512-NEXT:    vpcmpeqw %xmm1, %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0x75,0xc1]
> +; AVX512-NEXT:    vpmovzxwq %xmm0, %xmm0 ## encoding: [0xc4,0xe2,0x79,0x34,0xc0]
> +; AVX512-NEXT:    ## xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
> +; AVX512-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0 ## encoding: [0xc5,0xf9,0xdb,0x05,A,A,A,A]
> +; AVX512-NEXT:    ## fixup A - offset: 4, value: LCPI46_0-4, kind: reloc_riprel_4byte
>  ; AVX512-NEXT:    retq ## encoding: [0xc3]
>  ;
>  ; SKX-LABEL: test45:
>  ; SKX:       ## %bb.0:
> -; SKX-NEXT:    vpxor %xmm2, %xmm2, %xmm2 ## EVEX TO VEX Compression encoding: [0xc5,0xe9,0xef,0xd2]
> -; SKX-NEXT:    vpblendw $17, %xmm1, %xmm2, %xmm1 ## encoding: [0xc4,0xe3,0x69,0x0e,0xc9,0x11]
> -; SKX-NEXT:    ## xmm1 = xmm1[0],xmm2[1,2,3],xmm1[4],xmm2[5,6,7]
> -; SKX-NEXT:    vpblendw $17, %xmm0, %xmm2, %xmm0 ## encoding: [0xc4,0xe3,0x69,0x0e,0xc0,0x11]
> -; SKX-NEXT:    ## xmm0 = xmm0[0],xmm2[1,2,3],xmm0[4],xmm2[5,6,7]
> -; SKX-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0 ## encoding: [0xc4,0xe2,0x79,0x29,0xc1]
> +; SKX-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 ## encoding: [0x62,0xf1,0x7d,0x08,0x75,0xc1]
> +; SKX-NEXT:    vpmovm2q %k0, %xmm0 ## encoding: [0x62,0xf2,0xfe,0x08,0x38,0xc0]
>  ; SKX-NEXT:    vpsrlq $63, %xmm0, %xmm0 ## EVEX TO VEX Compression encoding: [0xc5,0xf9,0x73,0xd0,0x3f]
>  ; SKX-NEXT:    retq ## encoding: [0xc3]
>    %mask = icmp eq <2 x i16> %x, %y
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512-vec3-crash.ll Wed Aug  7 09:24:26 2019
> @@ -6,19 +6,15 @@ define <3 x i8 > @foo(<3 x i8>%x, <3 x i
>  ; CHECK-LABEL: foo:
>  ; CHECK:       # %bb.0:
>  ; CHECK-NEXT:    vmovd %edi, %xmm0
> -; CHECK-NEXT:    vpinsrd $1, %esi, %xmm0, %xmm0
> -; CHECK-NEXT:    vpinsrd $2, %edx, %xmm0, %xmm0
> -; CHECK-NEXT:    vpslld $24, %xmm0, %xmm0
> +; CHECK-NEXT:    vpinsrb $1, %esi, %xmm0, %xmm0
> +; CHECK-NEXT:    vpinsrb $2, %edx, %xmm0, %xmm0
>  ; CHECK-NEXT:    vmovd %ecx, %xmm1
> -; CHECK-NEXT:    vpinsrd $1, %r8d, %xmm1, %xmm1
> -; CHECK-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; CHECK-NEXT:    vpinsrd $2, %r9d, %xmm1, %xmm1
> -; CHECK-NEXT:    vpslld $24, %xmm1, %xmm1
> -; CHECK-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; CHECK-NEXT:    vpcmpgtd %xmm0, %xmm1, %xmm0
> +; CHECK-NEXT:    vpinsrb $1, %r8d, %xmm1, %xmm1
> +; CHECK-NEXT:    vpinsrb $2, %r9d, %xmm1, %xmm1
> +; CHECK-NEXT:    vpcmpgtb %xmm0, %xmm1, %xmm0
>  ; CHECK-NEXT:    vpextrb $0, %xmm0, %eax
> -; CHECK-NEXT:    vpextrb $4, %xmm0, %edx
> -; CHECK-NEXT:    vpextrb $8, %xmm0, %ecx
> +; CHECK-NEXT:    vpextrb $1, %xmm0, %edx
> +; CHECK-NEXT:    vpextrb $2, %xmm0, %ecx
>  ; CHECK-NEXT:    # kill: def $al killed $al killed $eax
>  ; CHECK-NEXT:    # kill: def $dl killed $dl killed $edx
>  ; CHECK-NEXT:    # kill: def $cl killed $cl killed $ecx
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll Wed Aug  7 09:24:26 2019
> @@ -5133,19 +5133,19 @@ define <8 x i8> @test_cmp_w_128(<8 x i16
>  ; CHECK-NEXT:    vpcmpgtw %xmm1, %xmm0, %k5 # encoding: [0x62,0xf1,0x7d,0x08,0x65,0xe9]
>  ; CHECK-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.w.128(<8 x i16> %a0, <8 x i16> %a1, i32 0, i8 -1)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -5169,7 +5169,7 @@ define <8 x i8> @test_cmp_w_128(<8 x i16
>  define <8 x i8> @test_mask_cmp_w_128(<8 x i16> %a0, <8 x i16> %a1, i8 %mask) {
>  ; X86-LABEL: test_mask_cmp_w_128:
>  ; X86:       # %bb.0:
> -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding: [0x0f,0xb7,0x44,0x24,0x04]
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding: [0x8b,0x44,0x24,0x04]
>  ; X86-NEXT:    kmovd %eax, %k1 # encoding: [0xc5,0xfb,0x92,0xc8]
>  ; X86-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 {%k1} # encoding: [0x62,0xf1,0x7d,0x09,0x75,0xc1]
>  ; X86-NEXT:    vpcmpgtw %xmm0, %xmm1, %k2 {%k1} # encoding: [0x62,0xf1,0x75,0x09,0x65,0xd0]
> @@ -5179,18 +5179,18 @@ define <8 x i8> @test_mask_cmp_w_128(<8
>  ; X86-NEXT:    vpcmpgtw %xmm1, %xmm0, %k1 {%k1} # encoding: [0x62,0xf1,0x7d,0x09,0x65,0xc9]
>  ; X86-NEXT:    kmovd %k0, %ecx # encoding: [0xc5,0xfb,0x93,0xc8]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
>  ; X86-NEXT:    kmovd %k2, %ecx # encoding: [0xc5,0xfb,0x93,0xca]
> -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
>  ; X86-NEXT:    kmovd %k3, %ecx # encoding: [0xc5,0xfb,0x93,0xcb]
> -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
>  ; X86-NEXT:    kmovd %k4, %ecx # encoding: [0xc5,0xfb,0x93,0xcc]
> -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
>  ; X86-NEXT:    kmovd %k5, %ecx # encoding: [0xc5,0xfb,0x93,0xcd]
> -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
>  ; X86-NEXT:    kmovd %k1, %ecx # encoding: [0xc5,0xfb,0x93,0xc9]
> -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-LABEL: test_mask_cmp_w_128:
> @@ -5204,18 +5204,18 @@ define <8 x i8> @test_mask_cmp_w_128(<8
>  ; X64-NEXT:    vpcmpgtw %xmm1, %xmm0, %k1 {%k1} # encoding: [0x62,0xf1,0x7d,0x09,0x65,0xc9]
>  ; X64-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.w.128(<8 x i16> %a0, <8 x i16> %a1, i32 0, i8 %mask)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -5249,19 +5249,19 @@ define <8 x i8> @test_ucmp_w_128(<8 x i1
>  ; CHECK-NEXT:    vpcmpnleuw %xmm1, %xmm0, %k5 # encoding: [0x62,0xf3,0xfd,0x08,0x3e,0xe9,0x06]
>  ; CHECK-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.w.128(<8 x i16> %a0, <8 x i16> %a1, i32 0, i8 -1)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -5285,7 +5285,7 @@ define <8 x i8> @test_ucmp_w_128(<8 x i1
>  define <8 x i8> @test_mask_ucmp_w_128(<8 x i16> %a0, <8 x i16> %a1, i8 %mask) {
>  ; X86-LABEL: test_mask_ucmp_w_128:
>  ; X86:       # %bb.0:
> -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding: [0x0f,0xb7,0x44,0x24,0x04]
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding: [0x8b,0x44,0x24,0x04]
>  ; X86-NEXT:    kmovd %eax, %k1 # encoding: [0xc5,0xfb,0x92,0xc8]
>  ; X86-NEXT:    vpcmpeqw %xmm1, %xmm0, %k0 {%k1} # encoding: [0x62,0xf1,0x7d,0x09,0x75,0xc1]
>  ; X86-NEXT:    vpcmpltuw %xmm1, %xmm0, %k2 {%k1} # encoding: [0x62,0xf3,0xfd,0x09,0x3e,0xd1,0x01]
> @@ -5295,18 +5295,18 @@ define <8 x i8> @test_mask_ucmp_w_128(<8
>  ; X86-NEXT:    vpcmpnleuw %xmm1, %xmm0, %k1 {%k1} # encoding: [0x62,0xf3,0xfd,0x09,0x3e,0xc9,0x06]
>  ; X86-NEXT:    kmovd %k0, %ecx # encoding: [0xc5,0xfb,0x93,0xc8]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
>  ; X86-NEXT:    kmovd %k2, %ecx # encoding: [0xc5,0xfb,0x93,0xca]
> -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
>  ; X86-NEXT:    kmovd %k3, %ecx # encoding: [0xc5,0xfb,0x93,0xcb]
> -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
>  ; X86-NEXT:    kmovd %k4, %ecx # encoding: [0xc5,0xfb,0x93,0xcc]
> -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
>  ; X86-NEXT:    kmovd %k5, %ecx # encoding: [0xc5,0xfb,0x93,0xcd]
> -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
>  ; X86-NEXT:    kmovd %k1, %ecx # encoding: [0xc5,0xfb,0x93,0xc9]
> -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-LABEL: test_mask_ucmp_w_128:
> @@ -5320,18 +5320,18 @@ define <8 x i8> @test_mask_ucmp_w_128(<8
>  ; X64-NEXT:    vpcmpnleuw %xmm1, %xmm0, %k1 {%k1} # encoding: [0x62,0xf3,0xfd,0x09,0x3e,0xc9,0x06]
>  ; X64-NEXT:    kmovd %k0, %eax # encoding: [0xc5,0xfb,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovd %k2, %eax # encoding: [0xc5,0xfb,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovd %k3, %eax # encoding: [0xc5,0xfb,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovd %k4, %eax # encoding: [0xc5,0xfb,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovd %k5, %eax # encoding: [0xc5,0xfb,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovd %k1, %eax # encoding: [0xc5,0xfb,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.w.128(<8 x i16> %a0, <8 x i16> %a1, i32 0, i8 %mask)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-fast-isel.ll Wed Aug  7 09:24:26 2019
> @@ -3326,6 +3326,8 @@ define <2 x i64> @test_mm256_cvtepi64_ep
>  ; CHECK-LABEL: test_mm256_cvtepi64_epi8:
>  ; CHECK:       # %bb.0: # %entry
>  ; CHECK-NEXT:    vpmovqb %ymm0, %xmm0
> +; CHECK-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> +; CHECK-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3,4,5,6,7]
>  ; CHECK-NEXT:    vzeroupper
>  ; CHECK-NEXT:    ret{{[l|q]}}
>  entry:
> @@ -3339,6 +3341,7 @@ define <2 x i64> @test_mm256_cvtepi64_ep
>  ; CHECK-LABEL: test_mm256_cvtepi64_epi16:
>  ; CHECK:       # %bb.0: # %entry
>  ; CHECK-NEXT:    vpmovqw %ymm0, %xmm0
> +; CHECK-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
>  ; CHECK-NEXT:    vzeroupper
>  ; CHECK-NEXT:    ret{{[l|q]}}
>  entry:
> @@ -3352,6 +3355,7 @@ define <2 x i64> @test_mm256_cvtepi32_ep
>  ; CHECK-LABEL: test_mm256_cvtepi32_epi8:
>  ; CHECK:       # %bb.0: # %entry
>  ; CHECK-NEXT:    vpmovdb %ymm0, %xmm0
> +; CHECK-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
>  ; CHECK-NEXT:    vzeroupper
>  ; CHECK-NEXT:    ret{{[l|q]}}
>  entry:
>
> Modified: llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll Wed Aug  7 09:24:26 2019
> @@ -8069,19 +8069,19 @@ define <8 x i8> @test_cmp_d_256(<8 x i32
>  ; CHECK-NEXT:    vpcmpgtd %ymm1, %ymm0, %k5 # encoding: [0x62,0xf1,0x7d,0x28,0x66,0xe9]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.256(<8 x i32> %a0, <8 x i32> %a1, i32 0, i8 -1)
> @@ -8106,7 +8106,7 @@ define <8 x i8> @test_cmp_d_256(<8 x i32
>  define <8 x i8> @test_mask_cmp_d_256(<8 x i32> %a0, <8 x i32> %a1, i8 %mask) {
>  ; X86-LABEL: test_mask_cmp_d_256:
>  ; X86:       # %bb.0:
> -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding: [0x0f,0xb7,0x44,0x24,0x04]
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding: [0x8b,0x44,0x24,0x04]
>  ; X86-NEXT:    kmovw %eax, %k1 # encoding: [0xc5,0xf8,0x92,0xc8]
>  ; X86-NEXT:    vpcmpeqd %ymm1, %ymm0, %k0 {%k1} # encoding: [0x62,0xf1,0x7d,0x29,0x76,0xc1]
>  ; X86-NEXT:    vpcmpgtd %ymm0, %ymm1, %k2 {%k1} # encoding: [0x62,0xf1,0x75,0x29,0x66,0xd0]
> @@ -8116,18 +8116,18 @@ define <8 x i8> @test_mask_cmp_d_256(<8
>  ; X86-NEXT:    vpcmpgtd %ymm1, %ymm0, %k1 {%k1} # encoding: [0x62,0xf1,0x7d,0x29,0x66,0xc9]
>  ; X86-NEXT:    kmovw %k0, %ecx # encoding: [0xc5,0xf8,0x93,0xc8]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
>  ; X86-NEXT:    kmovw %k2, %ecx # encoding: [0xc5,0xf8,0x93,0xca]
> -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
>  ; X86-NEXT:    kmovw %k3, %ecx # encoding: [0xc5,0xf8,0x93,0xcb]
> -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
>  ; X86-NEXT:    kmovw %k4, %ecx # encoding: [0xc5,0xf8,0x93,0xcc]
> -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
>  ; X86-NEXT:    kmovw %k5, %ecx # encoding: [0xc5,0xf8,0x93,0xcd]
> -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
>  ; X86-NEXT:    kmovw %k1, %ecx # encoding: [0xc5,0xf8,0x93,0xc9]
> -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
> @@ -8142,18 +8142,18 @@ define <8 x i8> @test_mask_cmp_d_256(<8
>  ; X64-NEXT:    vpcmpgtd %ymm1, %ymm0, %k1 {%k1} # encoding: [0x62,0xf1,0x7d,0x29,0x66,0xc9]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
>  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.256(<8 x i32> %a0, <8 x i32> %a1, i32 0, i8 %mask)
> @@ -8188,19 +8188,19 @@ define <8 x i8> @test_ucmp_d_256(<8 x i3
>  ; CHECK-NEXT:    vpcmpnleud %ymm1, %ymm0, %k5 # encoding: [0x62,0xf3,0x7d,0x28,0x1e,0xe9,0x06]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $255, %eax # encoding: [0xb8,0xff,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.256(<8 x i32> %a0, <8 x i32> %a1, i32 0, i8 -1)
> @@ -8225,7 +8225,7 @@ define <8 x i8> @test_ucmp_d_256(<8 x i3
>  define <8 x i8> @test_mask_ucmp_d_256(<8 x i32> %a0, <8 x i32> %a1, i8 %mask) {
>  ; X86-LABEL: test_mask_ucmp_d_256:
>  ; X86:       # %bb.0:
> -; X86-NEXT:    movzwl {{[0-9]+}}(%esp), %eax # encoding: [0x0f,0xb7,0x44,0x24,0x04]
> +; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding: [0x8b,0x44,0x24,0x04]
>  ; X86-NEXT:    kmovw %eax, %k1 # encoding: [0xc5,0xf8,0x92,0xc8]
>  ; X86-NEXT:    vpcmpeqd %ymm1, %ymm0, %k0 {%k1} # encoding: [0x62,0xf1,0x7d,0x29,0x76,0xc1]
>  ; X86-NEXT:    vpcmpltud %ymm1, %ymm0, %k2 {%k1} # encoding: [0x62,0xf3,0x7d,0x29,0x1e,0xd1,0x01]
> @@ -8235,18 +8235,18 @@ define <8 x i8> @test_mask_ucmp_d_256(<8
>  ; X86-NEXT:    vpcmpnleud %ymm1, %ymm0, %k1 {%k1} # encoding: [0x62,0xf3,0x7d,0x29,0x1e,0xc9,0x06]
>  ; X86-NEXT:    kmovw %k0, %ecx # encoding: [0xc5,0xf8,0x93,0xc8]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x00]
> +; X86-NEXT:    vpinsrb $0, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x00]
>  ; X86-NEXT:    kmovw %k2, %ecx # encoding: [0xc5,0xf8,0x93,0xca]
> -; X86-NEXT:    vpinsrw $1, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x01]
> +; X86-NEXT:    vpinsrb $1, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x01]
>  ; X86-NEXT:    kmovw %k3, %ecx # encoding: [0xc5,0xf8,0x93,0xcb]
> -; X86-NEXT:    vpinsrw $2, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x02]
> +; X86-NEXT:    vpinsrb $2, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x02]
>  ; X86-NEXT:    kmovw %k4, %ecx # encoding: [0xc5,0xf8,0x93,0xcc]
> -; X86-NEXT:    vpinsrw $4, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x04]
> +; X86-NEXT:    vpinsrb $4, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x04]
>  ; X86-NEXT:    kmovw %k5, %ecx # encoding: [0xc5,0xf8,0x93,0xcd]
> -; X86-NEXT:    vpinsrw $5, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x05]
> +; X86-NEXT:    vpinsrb $5, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x05]
>  ; X86-NEXT:    kmovw %k1, %ecx # encoding: [0xc5,0xf8,0x93,0xc9]
> -; X86-NEXT:    vpinsrw $6, %ecx, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc1,0x06]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $6, %ecx, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc1,0x06]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
> @@ -8261,18 +8261,18 @@ define <8 x i8> @test_mask_ucmp_d_256(<8
>  ; X64-NEXT:    vpcmpnleud %ymm1, %ymm0, %k1 {%k1} # encoding: [0x62,0xf3,0x7d,0x29,0x1e,0xc9,0x06]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> -; X64-NEXT:    vpinsrw $7, %edi, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc7,0x07]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $7, %edi, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc7,0x07]
>  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.256(<8 x i32> %a0, <8 x i32> %a1, i32 0, i8 %mask)
> @@ -8307,19 +8307,19 @@ define <8 x i8> @test_cmp_q_256(<4 x i64
>  ; CHECK-NEXT:    vpcmpgtq %ymm1, %ymm0, %k5 # encoding: [0x62,0xf2,0xfd,0x28,0x37,0xe9]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.256(<4 x i64> %a0, <4 x i64> %a1, i32 0, i8 -1)
> @@ -8356,19 +8356,19 @@ define <8 x i8> @test_mask_cmp_q_256(<4
>  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
> @@ -8385,19 +8385,19 @@ define <8 x i8> @test_mask_cmp_q_256(<4
>  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.256(<4 x i64> %a0, <4 x i64> %a1, i32 0, i8 %mask)
> @@ -8432,19 +8432,19 @@ define <8 x i8> @test_ucmp_q_256(<4 x i6
>  ; CHECK-NEXT:    vpcmpnleuq %ymm1, %ymm0, %k5 # encoding: [0x62,0xf3,0xfd,0x28,0x1e,0xe9,0x06]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.256(<4 x i64> %a0, <4 x i64> %a1, i32 0, i8 -1)
> @@ -8481,19 +8481,19 @@ define <8 x i8> @test_mask_ucmp_q_256(<4
>  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
> @@ -8510,19 +8510,19 @@ define <8 x i8> @test_mask_ucmp_q_256(<4
>  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X64-NEXT:    vzeroupper # encoding: [0xc5,0xf8,0x77]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.256(<4 x i64> %a0, <4 x i64> %a1, i32 0, i8 %mask)
> @@ -8557,19 +8557,19 @@ define <8 x i8> @test_cmp_d_128(<4 x i32
>  ; CHECK-NEXT:    vpcmpgtd %xmm1, %xmm0, %k5 # encoding: [0x62,0xf1,0x7d,0x08,0x66,0xe9]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.128(<4 x i32> %a0, <4 x i32> %a1, i32 0, i8 -1)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8605,19 +8605,19 @@ define <8 x i8> @test_mask_cmp_d_128(<4
>  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-LABEL: test_mask_cmp_d_128:
> @@ -8633,19 +8633,19 @@ define <8 x i8> @test_mask_cmp_d_128(<4
>  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.d.128(<4 x i32> %a0, <4 x i32> %a1, i32 0, i8 %mask)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8679,19 +8679,19 @@ define <8 x i8> @test_ucmp_d_128(<4 x i3
>  ; CHECK-NEXT:    vpcmpnleud %xmm1, %xmm0, %k5 # encoding: [0x62,0xf3,0x7d,0x08,0x1e,0xe9,0x06]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $15, %eax # encoding: [0xb8,0x0f,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.128(<4 x i32> %a0, <4 x i32> %a1, i32 0, i8 -1)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8727,19 +8727,19 @@ define <8 x i8> @test_mask_ucmp_d_128(<4
>  ; X86-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-LABEL: test_mask_ucmp_d_128:
> @@ -8755,19 +8755,19 @@ define <8 x i8> @test_mask_ucmp_d_128(<4
>  ; X64-NEXT:    kshiftrw $12, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0c]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.d.128(<4 x i32> %a0, <4 x i32> %a1, i32 0, i8 %mask)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8801,19 +8801,19 @@ define <8 x i8> @test_cmp_q_128(<2 x i64
>  ; CHECK-NEXT:    vpcmpgtq %xmm1, %xmm0, %k5 # encoding: [0x62,0xf2,0xfd,0x08,0x37,0xe9]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $3, %eax # encoding: [0xb8,0x03,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.128(<2 x i64> %a0, <2 x i64> %a1, i32 0, i8 -1)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8849,19 +8849,19 @@ define <8 x i8> @test_mask_cmp_q_128(<2
>  ; X86-NEXT:    kshiftrw $14, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
>  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-LABEL: test_mask_cmp_q_128:
> @@ -8877,19 +8877,19 @@ define <8 x i8> @test_mask_cmp_q_128(<2
>  ; X64-NEXT:    kshiftrw $14, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.cmp.q.128(<2 x i64> %a0, <2 x i64> %a1, i32 0, i8 %mask)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8923,19 +8923,19 @@ define <8 x i8> @test_ucmp_q_128(<2 x i6
>  ; CHECK-NEXT:    vpcmpnleuq %xmm1, %xmm0, %k5 # encoding: [0x62,0xf3,0xfd,0x08,0x1e,0xe9,0x06]
>  ; CHECK-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; CHECK-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; CHECK-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; CHECK-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; CHECK-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; CHECK-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; CHECK-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; CHECK-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; CHECK-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; CHECK-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; CHECK-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; CHECK-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; CHECK-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; CHECK-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; CHECK-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; CHECK-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; CHECK-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; CHECK-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; CHECK-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; CHECK-NEXT:    movl $3, %eax # encoding: [0xb8,0x03,0x00,0x00,0x00]
> -; CHECK-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; CHECK-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; CHECK-NEXT:    ret{{[l|q]}} # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.128(<2 x i64> %a0, <2 x i64> %a1, i32 0, i8 -1)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
> @@ -8971,19 +8971,19 @@ define <8 x i8> @test_mask_ucmp_q_128(<2
>  ; X86-NEXT:    kshiftrw $14, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
>  ; X86-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X86-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X86-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X86-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X86-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X86-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X86-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X86-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X86-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X86-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X86-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X86-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X86-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X86-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X86-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X86-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X86-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X86-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X86-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X86-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X86-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X86-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X86-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-LABEL: test_mask_ucmp_q_128:
> @@ -8999,19 +8999,19 @@ define <8 x i8> @test_mask_ucmp_q_128(<2
>  ; X64-NEXT:    kshiftrw $14, %k2, %k2 # encoding: [0xc4,0xe3,0xf9,0x30,0xd2,0x0e]
>  ; X64-NEXT:    kmovw %k0, %eax # encoding: [0xc5,0xf8,0x93,0xc0]
>  ; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc5,0xf9,0xef,0xc0]
> -; X64-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x00]
> +; X64-NEXT:    vpinsrb $0, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x00]
>  ; X64-NEXT:    kmovw %k1, %eax # encoding: [0xc5,0xf8,0x93,0xc1]
> -; X64-NEXT:    vpinsrw $1, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x01]
> +; X64-NEXT:    vpinsrb $1, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x01]
>  ; X64-NEXT:    kmovw %k3, %eax # encoding: [0xc5,0xf8,0x93,0xc3]
> -; X64-NEXT:    vpinsrw $2, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x02]
> +; X64-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x02]
>  ; X64-NEXT:    kmovw %k4, %eax # encoding: [0xc5,0xf8,0x93,0xc4]
> -; X64-NEXT:    vpinsrw $4, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x04]
> +; X64-NEXT:    vpinsrb $4, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x04]
>  ; X64-NEXT:    kmovw %k5, %eax # encoding: [0xc5,0xf8,0x93,0xc5]
> -; X64-NEXT:    vpinsrw $5, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x05]
> +; X64-NEXT:    vpinsrb $5, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x05]
>  ; X64-NEXT:    kmovw %k6, %eax # encoding: [0xc5,0xf8,0x93,0xc6]
> -; X64-NEXT:    vpinsrw $6, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x06]
> +; X64-NEXT:    vpinsrb $6, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x06]
>  ; X64-NEXT:    kmovw %k2, %eax # encoding: [0xc5,0xf8,0x93,0xc2]
> -; X64-NEXT:    vpinsrw $7, %eax, %xmm0, %xmm0 # encoding: [0xc5,0xf9,0xc4,0xc0,0x07]
> +; X64-NEXT:    vpinsrb $7, %eax, %xmm0, %xmm0 # encoding: [0xc4,0xe3,0x79,0x20,0xc0,0x07]
>  ; X64-NEXT:    retq # encoding: [0xc3]
>    %res0 = call i8 @llvm.x86.avx512.mask.ucmp.q.128(<2 x i64> %a0, <2 x i64> %a1, i32 0, i8 %mask)
>    %vec0 = insertelement <8 x i8> undef, i8 %res0, i32 0
>
> Modified: llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/bitcast-and-setcc-128.ll Wed Aug  7 09:24:26 2019
> @@ -178,144 +178,63 @@ define i16 @v16i8(<16 x i8> %a, <16 x i8
>  }
>
>  define i2 @v2i8(<2 x i8> %a, <2 x i8> %b, <2 x i8> %c, <2 x i8> %d) {
> -; SSE2-SSSE3-LABEL: v2i8:
> -; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllq $56, %xmm2
> -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    psllq $56, %xmm3
> -; SSE2-SSSE3-NEXT:    movdqa %xmm3, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm3
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    psllq $56, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    psllq $56, %xmm1
> -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm1
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm5
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm5
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm5, %xmm1
> -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm3
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm2
> -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm3, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm2[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm3
> -; SSE2-SSSE3-NEXT:    por %xmm2, %xmm3
> -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm3
> -; SSE2-SSSE3-NEXT:    movmskpd %xmm3, %eax
> -; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> -; SSE2-SSSE3-NEXT:    retq
> +; SSE2-LABEL: v2i8:
> +; SSE2:       # %bb.0:
> +; SSE2-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> +; SSE2-NEXT:    pcmpgtb %xmm3, %xmm2
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,0,1,1]
> +; SSE2-NEXT:    pand %xmm0, %xmm1
> +; SSE2-NEXT:    movmskpd %xmm1, %eax
> +; SSE2-NEXT:    # kill: def $al killed $al killed $eax
> +; SSE2-NEXT:    retq
> +;
> +; SSSE3-LABEL: v2i8:
> +; SSSE3:       # %bb.0:
> +; SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSSE3-NEXT:    movdqa {{.*#+}} xmm1 = <u,u,0,0,u,u,0,0,u,u,1,1,u,u,1,1>
> +; SSSE3-NEXT:    pshufb %xmm1, %xmm0
> +; SSSE3-NEXT:    pcmpgtb %xmm3, %xmm2
> +; SSSE3-NEXT:    pshufb %xmm1, %xmm2
> +; SSSE3-NEXT:    pand %xmm0, %xmm2
> +; SSSE3-NEXT:    movmskpd %xmm2, %eax
> +; SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> +; SSSE3-NEXT:    retq
>  ;
> -; AVX1-LABEL: v2i8:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpsllq $56, %xmm3, %xmm3
> -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm4
> -; AVX1-NEXT:    vpsrad $24, %xmm3, %xmm3
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm3 = xmm3[0,1],xmm4[2,3],xmm3[4,5],xmm4[6,7]
> -; AVX1-NEXT:    vpsllq $56, %xmm2, %xmm2
> -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm4
> -; AVX1-NEXT:    vpsrad $24, %xmm2, %xmm2
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 = xmm2[0,1],xmm4[2,3],xmm2[4,5],xmm4[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> -; AVX1-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm3
> -; AVX1-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm3[2,3],xmm1[4,5],xmm3[6,7]
> -; AVX1-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm3
> -; AVX1-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm3[2,3],xmm0[4,5],xmm3[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: v2i8:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpsllq $56, %xmm3, %xmm3
> -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm4
> -; AVX2-NEXT:    vpsrad $24, %xmm3, %xmm3
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
> -; AVX2-NEXT:    vpsllq $56, %xmm2, %xmm2
> -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm4
> -; AVX2-NEXT:    vpsrad $24, %xmm2, %xmm2
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm2 = xmm2[0],xmm4[1],xmm2[2],xmm4[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> -; AVX2-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm3
> -; AVX2-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> -; AVX2-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm3
> -; AVX2-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX2-NEXT:    retq
> +; AVX12-LABEL: v2i8:
> +; AVX12:       # %bb.0:
> +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxbq %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm1
> +; AVX12-NEXT:    vpmovsxbq %xmm1, %xmm1
> +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v2i8:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllq $56, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsraq $56, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsllq $56, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsraq $56, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraq $56, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraq $56, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> -; AVX512F-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k1
> +; AVX512F-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v2i8:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllq $56, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsraq $56, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsllq $56, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsraq $56, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraq $56, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraq $56, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> -; AVX512BW-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtb %xmm3, %xmm2, %k1
> +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -329,142 +248,47 @@ define i2 @v2i8(<2 x i8> %a, <2 x i8> %b
>  define i2 @v2i16(<2 x i16> %a, <2 x i16> %b, <2 x i16> %c, <2 x i16> %d) {
>  ; SSE2-SSSE3-LABEL: v2i16:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllq $48, %xmm2
> -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    psllq $48, %xmm3
> -; SSE2-SSSE3-NEXT:    movdqa %xmm3, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm3
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    psllq $48, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    psllq $48, %xmm1
> -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm4
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1]
> -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm1
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm5
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm5
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm5, %xmm1
> -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm3
> -; SSE2-SSSE3-NEXT:    pxor %xmm4, %xmm2
> -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm3, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm2[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm3
> -; SSE2-SSSE3-NEXT:    por %xmm2, %xmm3
> -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm3
> -; SSE2-SSSE3-NEXT:    movmskpd %xmm3, %eax
> +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm3, %xmm2
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,0,1,1]
> +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> +; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
> -; AVX1-LABEL: v2i16:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpsllq $48, %xmm3, %xmm3
> -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm4
> -; AVX1-NEXT:    vpsrad $16, %xmm3, %xmm3
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm3 = xmm3[0,1],xmm4[2,3],xmm3[4,5],xmm4[6,7]
> -; AVX1-NEXT:    vpsllq $48, %xmm2, %xmm2
> -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm4
> -; AVX1-NEXT:    vpsrad $16, %xmm2, %xmm2
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 = xmm2[0,1],xmm4[2,3],xmm2[4,5],xmm4[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> -; AVX1-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm3
> -; AVX1-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm3[2,3],xmm1[4,5],xmm3[6,7]
> -; AVX1-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm3
> -; AVX1-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm3[2,3],xmm0[4,5],xmm3[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: v2i16:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpsllq $48, %xmm3, %xmm3
> -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm4
> -; AVX2-NEXT:    vpsrad $16, %xmm3, %xmm3
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
> -; AVX2-NEXT:    vpsllq $48, %xmm2, %xmm2
> -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm4
> -; AVX2-NEXT:    vpsrad $16, %xmm2, %xmm2
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm2 = xmm2[0],xmm4[1],xmm2[2],xmm4[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> -; AVX2-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm3
> -; AVX2-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> -; AVX2-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm3
> -; AVX2-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX2-NEXT:    retq
> +; AVX12-LABEL: v2i16:
> +; AVX12:       # %bb.0:
> +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxwq %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm1
> +; AVX12-NEXT:    vpmovsxwq %xmm1, %xmm1
> +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v2i16:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllq $48, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsraq $48, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsllq $48, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsraq $48, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraq $48, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraq $48, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> -; AVX512F-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> +; AVX512F-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm0
> +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k1
> +; AVX512F-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v2i16:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllq $48, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsraq $48, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsllq $48, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsraq $48, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraq $48, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraq $48, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> -; AVX512BW-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtw %xmm3, %xmm2, %k1
> +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -478,118 +302,40 @@ define i2 @v2i16(<2 x i16> %a, <2 x i16>
>  define i2 @v2i32(<2 x i32> %a, <2 x i32> %b, <2 x i32> %c, <2 x i32> %d) {
>  ; SSE2-SSSE3-LABEL: v2i32:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllq $32, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm4 = xmm4[0],xmm2[0],xmm4[1],xmm2[1]
> -; SSE2-SSSE3-NEXT:    psllq $32, %xmm3
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm3[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm3
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
> -; SSE2-SSSE3-NEXT:    psllq $32, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1]
> -; SSE2-SSSE3-NEXT:    psllq $32, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm1 = [2147483648,2147483648]
> -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm3
> -; SSE2-SSSE3-NEXT:    movdqa %xmm3, %xmm5
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm0, %xmm5
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm0, %xmm3
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm5, %xmm0
> -; SSE2-SSSE3-NEXT:    por %xmm3, %xmm0
> -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm4
> -; SSE2-SSSE3-NEXT:    movdqa %xmm4, %xmm1
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm2, %xmm1
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm2, %xmm4
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm4[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    por %xmm4, %xmm2
> -; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm2
> -; SSE2-SSSE3-NEXT:    movmskpd %xmm2, %eax
> +; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> +; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[0,0,1,1]
> +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> +; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
> -; AVX1-LABEL: v2i32:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpsllq $32, %xmm3, %xmm4
> -; AVX1-NEXT:    vpsrad $31, %xmm4, %xmm4
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm3 = xmm3[0,1],xmm4[2,3],xmm3[4,5],xmm4[6,7]
> -; AVX1-NEXT:    vpsllq $32, %xmm2, %xmm4
> -; AVX1-NEXT:    vpsrad $31, %xmm4, %xmm4
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm2 = xmm2[0,1],xmm4[2,3],xmm2[4,5],xmm4[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> -; AVX1-NEXT:    vpsllq $32, %xmm1, %xmm3
> -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm3[2,3],xmm1[4,5],xmm3[6,7]
> -; AVX1-NEXT:    vpsllq $32, %xmm0, %xmm3
> -; AVX1-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm3[2,3],xmm0[4,5],xmm3[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: v2i32:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpsllq $32, %xmm3, %xmm4
> -; AVX2-NEXT:    vpsrad $31, %xmm4, %xmm4
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
> -; AVX2-NEXT:    vpsllq $32, %xmm2, %xmm4
> -; AVX2-NEXT:    vpsrad $31, %xmm4, %xmm4
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm2 = xmm2[0],xmm4[1],xmm2[2],xmm4[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm2, %xmm2
> -; AVX2-NEXT:    vpsllq $32, %xmm1, %xmm3
> -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> -; AVX2-NEXT:    vpsllq $32, %xmm0, %xmm3
> -; AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX2-NEXT:    retq
> +; AVX12-LABEL: v2i32:
> +; AVX12:       # %bb.0:
> +; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxdq %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm1
> +; AVX12-NEXT:    vpmovsxdq %xmm1, %xmm1
> +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v2i32:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllq $32, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsraq $32, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsllq $32, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsraq $32, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraq $32, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraq $32, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> -; AVX512F-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> +; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtd %xmm3, %xmm2, %k1
> +; AVX512F-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v2i32:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllq $32, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsraq $32, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsllq $32, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsraq $32, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraq $32, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraq $32, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k1
> -; AVX512BW-NEXT:    vpcmpgtq %xmm3, %xmm2, %k0 {%k1}
> +; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtd %xmm3, %xmm2, %k1
> +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -700,66 +446,47 @@ define i2 @v2f64(<2 x double> %a, <2 x d
>  define i4 @v4i8(<4 x i8> %a, <4 x i8> %b, <4 x i8> %c, <4 x i8> %d) {
>  ; SSE2-SSSE3-LABEL: v4i8:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    pslld $24, %xmm3
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm3
> -; SSE2-SSSE3-NEXT:    pslld $24, %xmm2
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> -; SSE2-SSSE3-NEXT:    pslld $24, %xmm1
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> -; SSE2-SSSE3-NEXT:    pslld $24, %xmm0
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm3, %xmm2
> +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> +; SSE2-SSSE3-NEXT:    movmskps %xmm1, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX12-LABEL: v4i8:
>  ; AVX12:       # %bb.0:
> -; AVX12-NEXT:    vpslld $24, %xmm3, %xmm3
> -; AVX12-NEXT:    vpsrad $24, %xmm3, %xmm3
> -; AVX12-NEXT:    vpslld $24, %xmm2, %xmm2
> -; AVX12-NEXT:    vpsrad $24, %xmm2, %xmm2
> -; AVX12-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm2
> -; AVX12-NEXT:    vpslld $24, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX12-NEXT:    vpslld $24, %xmm0, %xmm0
> -; AVX12-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> -; AVX12-NEXT:    vpand %xmm2, %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxbd %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm1
> +; AVX12-NEXT:    vpmovsxbd %xmm1, %xmm1
> +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
>  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
>  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v4i8:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpslld $24, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsrad $24, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpslld $24, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsrad $24, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpslld $24, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpslld $24, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> -; AVX512F-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k1
> +; AVX512F-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v4i8:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpslld $24, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsrad $24, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpslld $24, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsrad $24, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpslld $24, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpslld $24, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> -; AVX512BW-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtb %xmm3, %xmm2, %k1
> +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -773,66 +500,45 @@ define i4 @v4i8(<4 x i8> %a, <4 x i8> %b
>  define i4 @v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c, <4 x i16> %d) {
>  ; SSE2-SSSE3-LABEL: v4i16:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    pslld $16, %xmm3
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm3
> -; SSE2-SSSE3-NEXT:    pslld $16, %xmm2
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm3, %xmm2
> -; SSE2-SSSE3-NEXT:    pslld $16, %xmm1
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> -; SSE2-SSSE3-NEXT:    pslld $16, %xmm0
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm3, %xmm2
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
> +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> +; SSE2-SSSE3-NEXT:    movmskps %xmm1, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX12-LABEL: v4i16:
>  ; AVX12:       # %bb.0:
> -; AVX12-NEXT:    vpslld $16, %xmm3, %xmm3
> -; AVX12-NEXT:    vpsrad $16, %xmm3, %xmm3
> -; AVX12-NEXT:    vpslld $16, %xmm2, %xmm2
> -; AVX12-NEXT:    vpsrad $16, %xmm2, %xmm2
> -; AVX12-NEXT:    vpcmpgtd %xmm3, %xmm2, %xmm2
> -; AVX12-NEXT:    vpslld $16, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX12-NEXT:    vpslld $16, %xmm0, %xmm0
> -; AVX12-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> -; AVX12-NEXT:    vpand %xmm2, %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxwd %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm1
> +; AVX12-NEXT:    vpmovsxwd %xmm1, %xmm1
> +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
>  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
>  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v4i16:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpslld $16, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsrad $16, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpslld $16, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsrad $16, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpslld $16, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpslld $16, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> -; AVX512F-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> +; AVX512F-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm0
> +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k1
> +; AVX512F-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v4i16:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpslld $16, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsrad $16, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpslld $16, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsrad $16, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpslld $16, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpslld $16, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k1
> -; AVX512BW-NEXT:    vpcmpgtd %xmm3, %xmm2, %k0 {%k1}
> +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtw %xmm3, %xmm2, %k1
> +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -846,35 +552,23 @@ define i4 @v4i16(<4 x i16> %a, <4 x i16>
>  define i8 @v8i8(<8 x i8> %a, <8 x i8> %b, <8 x i8> %c, <8 x i8> %d) {
>  ; SSE2-SSSE3-LABEL: v8i8:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllw $8, %xmm3
> -; SSE2-SSSE3-NEXT:    psraw $8, %xmm3
> -; SSE2-SSSE3-NEXT:    psllw $8, %xmm2
> -; SSE2-SSSE3-NEXT:    psraw $8, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpgtw %xmm3, %xmm2
> -; SSE2-SSSE3-NEXT:    psllw $8, %xmm1
> -; SSE2-SSSE3-NEXT:    psraw $8, %xmm1
> -; SSE2-SSSE3-NEXT:    psllw $8, %xmm0
> -; SSE2-SSSE3-NEXT:    psraw $8, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    packsswb %xmm0, %xmm0
> -; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm3, %xmm2
> +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
> +; SSE2-SSSE3-NEXT:    pand %xmm0, %xmm1
> +; SSE2-SSSE3-NEXT:    packsswb %xmm0, %xmm1
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm1, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX12-LABEL: v8i8:
>  ; AVX12:       # %bb.0:
> -; AVX12-NEXT:    vpsllw $8, %xmm3, %xmm3
> -; AVX12-NEXT:    vpsraw $8, %xmm3, %xmm3
> -; AVX12-NEXT:    vpsllw $8, %xmm2, %xmm2
> -; AVX12-NEXT:    vpsraw $8, %xmm2, %xmm2
> -; AVX12-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm2
> -; AVX12-NEXT:    vpsllw $8, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsraw $8, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsllw $8, %xmm0, %xmm0
> -; AVX12-NEXT:    vpsraw $8, %xmm0, %xmm0
> -; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> -; AVX12-NEXT:    vpand %xmm2, %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxbw %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm1
> +; AVX12-NEXT:    vpmovsxbw %xmm1, %xmm1
> +; AVX12-NEXT:    vpand %xmm1, %xmm0, %xmm0
>  ; AVX12-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
>  ; AVX12-NEXT:    vpmovmskb %xmm0, %eax
>  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> @@ -882,19 +576,13 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
>  ;
>  ; AVX512F-LABEL: v8i8:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllw $8, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsraw $8, %xmm3, %xmm3
> -; AVX512F-NEXT:    vpsllw $8, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsraw $8, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpsllw $8, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraw $8, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllw $8, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraw $8, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> -; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm3, %xmm2, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k1
> +; AVX512F-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512F-NEXT:    vzeroupper
> @@ -902,16 +590,9 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
>  ;
>  ; AVX512BW-LABEL: v8i8:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllw $8, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsraw $8, %xmm3, %xmm3
> -; AVX512BW-NEXT:    vpsllw $8, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsraw $8, %xmm2, %xmm2
> -; AVX512BW-NEXT:    vpsllw $8, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraw $8, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllw $8, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraw $8, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k1
> -; AVX512BW-NEXT:    vpcmpgtw %xmm3, %xmm2, %k0 {%k1}
> +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtb %xmm3, %xmm2, %k1
> +; AVX512BW-NEXT:    kandw %k1, %k0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/bitcast-setcc-128.ll Wed Aug  7 09:24:26 2019
> @@ -144,87 +144,45 @@ define i16 @v16i8(<16 x i8> %a, <16 x i8
>  }
>
>  define i2 @v2i8(<2 x i8> %a, <2 x i8> %b) {
> -; SSE2-SSSE3-LABEL: v2i8:
> -; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllq $56, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
> -; SSE2-SSSE3-NEXT:    psllq $56, %xmm1
> -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1]
> -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm2 = [2147483648,2147483648]
> -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm1
> -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm1
> -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> -; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
> -; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> -; SSE2-SSSE3-NEXT:    retq
> +; SSE2-LABEL: v2i8:
> +; SSE2:       # %bb.0:
> +; SSE2-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> +; SSE2-NEXT:    movmskpd %xmm0, %eax
> +; SSE2-NEXT:    # kill: def $al killed $al killed $eax
> +; SSE2-NEXT:    retq
> +;
> +; SSSE3-LABEL: v2i8:
> +; SSSE3:       # %bb.0:
> +; SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[u,u,0,0,u,u,0,0,u,u,1,1,u,u,1,1]
> +; SSSE3-NEXT:    movmskpd %xmm0, %eax
> +; SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> +; SSSE3-NEXT:    retq
>  ;
> -; AVX1-LABEL: v2i8:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm2
> -; AVX1-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm2
> -; AVX1-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: v2i8:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm2
> -; AVX2-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX2-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm2
> -; AVX2-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX2-NEXT:    retq
> +; AVX12-LABEL: v2i8:
> +; AVX12:       # %bb.0:
> +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxbq %xmm0, %xmm0
> +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v2i8:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraq $56, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraq $56, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v2i8:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllq $56, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraq $56, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllq $56, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraq $56, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -236,85 +194,34 @@ define i2 @v2i8(<2 x i8> %a, <2 x i8> %b
>  define i2 @v2i16(<2 x i16> %a, <2 x i16> %b) {
>  ; SSE2-SSSE3-LABEL: v2i16:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllq $48, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
> -; SSE2-SSSE3-NEXT:    psllq $48, %xmm1
> -; SSE2-SSSE3-NEXT:    movdqa %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1]
> -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm2 = [2147483648,2147483648]
> -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm1
> -; SSE2-SSSE3-NEXT:    pxor %xmm2, %xmm0
> -; SSE2-SSSE3-NEXT:    movdqa %xmm0, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm2, %xmm1
> -; SSE2-SSSE3-NEXT:    por %xmm0, %xmm1
> -; SSE2-SSSE3-NEXT:    movmskpd %xmm1, %eax
> +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
> +; SSE2-SSSE3-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
> -; AVX1-LABEL: v2i16:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX1-NEXT:    vpsrad $31, %xmm1, %xmm2
> -; AVX1-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX1-NEXT:    vpsrad $31, %xmm0, %xmm2
> -; AVX1-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: v2i16:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX2-NEXT:    vpsrad $31, %xmm1, %xmm2
> -; AVX2-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX2-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX2-NEXT:    vpsrad $31, %xmm0, %xmm2
> -; AVX2-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX2-NEXT:    retq
> +; AVX12-LABEL: v2i16:
> +; AVX12:       # %bb.0:
> +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxwq %xmm0, %xmm0
> +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v2i16:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraq $48, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraq $48, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v2i16:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllq $48, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraq $48, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllq $48, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraq $48, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -326,73 +233,30 @@ define i2 @v2i16(<2 x i16> %a, <2 x i16>
>  define i2 @v2i32(<2 x i32> %a, <2 x i32> %b) {
>  ; SSE2-SSSE3-LABEL: v2i32:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllq $32, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm0
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
> -; SSE2-SSSE3-NEXT:    psllq $32, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    psrad $31, %xmm1
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,2,3]
> -; SSE2-SSSE3-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> -; SSE2-SSSE3-NEXT:    movdqa {{.*#+}} xmm1 = [2147483648,2147483648]
> -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pxor %xmm1, %xmm2
> -; SSE2-SSSE3-NEXT:    movdqa %xmm2, %xmm1
> -; SSE2-SSSE3-NEXT:    pcmpeqd %xmm0, %xmm1
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm0, %xmm2
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,2,2]
> -; SSE2-SSSE3-NEXT:    pand %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    por %xmm2, %xmm0
> +; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
>  ; SSE2-SSSE3-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
> -; AVX1-LABEL: v2i32:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpsllq $32, %xmm1, %xmm2
> -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpsllq $32, %xmm0, %xmm2
> -; AVX1-NEXT:    vpsrad $31, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX1-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: v2i32:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpsllq $32, %xmm1, %xmm2
> -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX2-NEXT:    vpsllq $32, %xmm0, %xmm2
> -; AVX2-NEXT:    vpsrad $31, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vmovmskpd %xmm0, %eax
> -; AVX2-NEXT:    # kill: def $al killed $al killed $eax
> -; AVX2-NEXT:    retq
> +; AVX12-LABEL: v2i32:
> +; AVX12:       # %bb.0:
> +; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxdq %xmm0, %xmm0
> +; AVX12-NEXT:    vmovmskpd %xmm0, %eax
> +; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v2i32:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraq $32, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraq $32, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v2i32:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraq $32, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraq $32, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtq %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -478,44 +342,34 @@ define i2 @v2f64(<2 x double> %a, <2 x d
>  define i4 @v4i8(<4 x i8> %a, <4 x i8> %b) {
>  ; SSE2-SSSE3-LABEL: v4i8:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    pslld $24, %xmm1
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm1
> -; SSE2-SSSE3-NEXT:    pslld $24, %xmm0
> -; SSE2-SSSE3-NEXT:    psrad $24, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
>  ; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX12-LABEL: v4i8:
>  ; AVX12:       # %bb.0:
> -; AVX12-NEXT:    vpslld $24, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX12-NEXT:    vpslld $24, %xmm0, %xmm0
> -; AVX12-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxbd %xmm0, %xmm0
>  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
>  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v4i8:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpslld $24, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpslld $24, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v4i8:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpslld $24, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsrad $24, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpslld $24, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsrad $24, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -527,44 +381,33 @@ define i4 @v4i8(<4 x i8> %a, <4 x i8> %b
>  define i4 @v4i16(<4 x i16> %a, <4 x i16> %b) {
>  ; SSE2-SSSE3-LABEL: v4i16:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    pslld $16, %xmm1
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm1
> -; SSE2-SSSE3-NEXT:    pslld $16, %xmm0
> -; SSE2-SSSE3-NEXT:    psrad $16, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtd %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
>  ; SSE2-SSSE3-NEXT:    movmskps %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
>  ; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX12-LABEL: v4i16:
>  ; AVX12:       # %bb.0:
> -; AVX12-NEXT:    vpslld $16, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX12-NEXT:    vpslld $16, %xmm0, %xmm0
> -; AVX12-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX12-NEXT:    vpcmpgtd %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxwd %xmm0, %xmm0
>  ; AVX12-NEXT:    vmovmskps %xmm0, %eax
>  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX12-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: v4i16:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpslld $16, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpslld $16, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> +; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> +; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
> +; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
>  ; AVX512BW-LABEL: v4i16:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpslld $16, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsrad $16, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpslld $16, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsrad $16, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtd %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
> @@ -576,11 +419,8 @@ define i4 @v4i16(<4 x i16> %a, <4 x i16>
>  define i8 @v8i8(<8 x i8> %a, <8 x i8> %b) {
>  ; SSE2-SSSE3-LABEL: v8i8:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    psllw $8, %xmm1
> -; SSE2-SSSE3-NEXT:    psraw $8, %xmm1
> -; SSE2-SSSE3-NEXT:    psllw $8, %xmm0
> -; SSE2-SSSE3-NEXT:    psraw $8, %xmm0
> -; SSE2-SSSE3-NEXT:    pcmpgtw %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    pcmpgtb %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
>  ; SSE2-SSSE3-NEXT:    packsswb %xmm0, %xmm0
>  ; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $al killed $al killed $eax
> @@ -588,11 +428,8 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
>  ;
>  ; AVX12-LABEL: v8i8:
>  ; AVX12:       # %bb.0:
> -; AVX12-NEXT:    vpsllw $8, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsraw $8, %xmm1, %xmm1
> -; AVX12-NEXT:    vpsllw $8, %xmm0, %xmm0
> -; AVX12-NEXT:    vpsraw $8, %xmm0, %xmm0
> -; AVX12-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX12-NEXT:    vpmovsxbw %xmm0, %xmm0
>  ; AVX12-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
>  ; AVX12-NEXT:    vpmovmskb %xmm0, %eax
>  ; AVX12-NEXT:    # kill: def $al killed $al killed $eax
> @@ -600,13 +437,9 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
>  ;
>  ; AVX512F-LABEL: v8i8:
>  ; AVX512F:       # %bb.0:
> -; AVX512F-NEXT:    vpsllw $8, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsraw $8, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpsllw $8, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpsraw $8, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpcmpgtw %xmm1, %xmm0, %xmm0
> -; AVX512F-NEXT:    vpmovsxwd %xmm0, %ymm0
> -; AVX512F-NEXT:    vptestmd %ymm0, %ymm0, %k0
> +; AVX512F-NEXT:    vpcmpgtb %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpmovsxbd %xmm0, %zmm0
> +; AVX512F-NEXT:    vptestmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512F-NEXT:    vzeroupper
> @@ -614,11 +447,7 @@ define i8 @v8i8(<8 x i8> %a, <8 x i8> %b
>  ;
>  ; AVX512BW-LABEL: v8i8:
>  ; AVX512BW:       # %bb.0:
> -; AVX512BW-NEXT:    vpsllw $8, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsraw $8, %xmm1, %xmm1
> -; AVX512BW-NEXT:    vpsllw $8, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpsraw $8, %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpcmpgtw %xmm1, %xmm0, %k0
> +; AVX512BW-NEXT:    vpcmpgtb %xmm1, %xmm0, %k0
>  ; AVX512BW-NEXT:    kmovd %k0, %eax
>  ; AVX512BW-NEXT:    # kill: def $al killed $al killed $eax
>  ; AVX512BW-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/bitcast-vector-bool.ll Wed Aug  7 09:24:26 2019
> @@ -151,27 +151,14 @@ define i4 @bitcast_v8i16_to_v2i4(<8 x i1
>  }
>
>  define i8 @bitcast_v16i8_to_v2i8(<16 x i8> %a0) nounwind {
> -; SSE2-LABEL: bitcast_v16i8_to_v2i8:
> -; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pmovmskb %xmm0, %eax
> -; SSE2-NEXT:    movd %eax, %xmm0
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; SSE2-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> -; SSE2-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> -; SSE2-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> -; SSE2-NEXT:    retq
> -;
> -; SSSE3-LABEL: bitcast_v16i8_to_v2i8:
> -; SSSE3:       # %bb.0:
> -; SSSE3-NEXT:    pmovmskb %xmm0, %eax
> -; SSSE3-NEXT:    movd %eax, %xmm0
> -; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,u,u,u,u,u,u,u,1,u,u,u,u,u,u,u]
> -; SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> -; SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> -; SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> -; SSSE3-NEXT:    retq
> +; SSE2-SSSE3-LABEL: bitcast_v16i8_to_v2i8:
> +; SSE2-SSSE3:       # %bb.0:
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> +; SSE2-SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> +; SSE2-SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> +; SSE2-SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> +; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX12-LABEL: bitcast_v16i8_to_v2i8:
>  ; AVX12:       # %bb.0:
> @@ -187,7 +174,7 @@ define i8 @bitcast_v16i8_to_v2i8(<16 x i
>  ; AVX512:       # %bb.0:
>  ; AVX512-NEXT:    vpmovb2m %xmm0, %k0
>  ; AVX512-NEXT:    kmovw %k0, -{{[0-9]+}}(%rsp)
> -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; AVX512-NEXT:    vpextrb $0, %xmm0, %ecx
>  ; AVX512-NEXT:    vpextrb $1, %xmm0, %eax
>  ; AVX512-NEXT:    addb %cl, %al
> @@ -318,29 +305,15 @@ define i4 @bitcast_v8i32_to_v2i4(<8 x i3
>  }
>
>  define i8 @bitcast_v16i16_to_v2i8(<16 x i16> %a0) nounwind {
> -; SSE2-LABEL: bitcast_v16i16_to_v2i8:
> -; SSE2:       # %bb.0:
> -; SSE2-NEXT:    packsswb %xmm1, %xmm0
> -; SSE2-NEXT:    pmovmskb %xmm0, %eax
> -; SSE2-NEXT:    movd %eax, %xmm0
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; SSE2-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> -; SSE2-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> -; SSE2-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> -; SSE2-NEXT:    retq
> -;
> -; SSSE3-LABEL: bitcast_v16i16_to_v2i8:
> -; SSSE3:       # %bb.0:
> -; SSSE3-NEXT:    packsswb %xmm1, %xmm0
> -; SSSE3-NEXT:    pmovmskb %xmm0, %eax
> -; SSSE3-NEXT:    movd %eax, %xmm0
> -; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,u,u,u,u,u,u,u,1,u,u,u,u,u,u,u]
> -; SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> -; SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> -; SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> -; SSSE3-NEXT:    retq
> +; SSE2-SSSE3-LABEL: bitcast_v16i16_to_v2i8:
> +; SSE2-SSSE3:       # %bb.0:
> +; SSE2-SSSE3-NEXT:    packsswb %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> +; SSE2-SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> +; SSE2-SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> +; SSE2-SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> +; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX1-LABEL: bitcast_v16i16_to_v2i8:
>  ; AVX1:       # %bb.0:
> @@ -374,7 +347,7 @@ define i8 @bitcast_v16i16_to_v2i8(<16 x
>  ; AVX512:       # %bb.0:
>  ; AVX512-NEXT:    vpmovw2m %ymm0, %k0
>  ; AVX512-NEXT:    kmovw %k0, -{{[0-9]+}}(%rsp)
> -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; AVX512-NEXT:    vpextrb $0, %xmm0, %ecx
>  ; AVX512-NEXT:    vpextrb $1, %xmm0, %eax
>  ; AVX512-NEXT:    addb %cl, %al
> @@ -392,12 +365,10 @@ define i8 @bitcast_v16i16_to_v2i8(<16 x
>  define i16 @bitcast_v32i8_to_v2i16(<32 x i8> %a0) nounwind {
>  ; SSE2-SSSE3-LABEL: bitcast_v32i8_to_v2i16:
>  ; SSE2-SSSE3:       # %bb.0:
> -; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> -; SSE2-SSSE3-NEXT:    pmovmskb %xmm1, %ecx
> -; SSE2-SSSE3-NEXT:    shll $16, %ecx
> -; SSE2-SSSE3-NEXT:    orl %eax, %ecx
> -; SSE2-SSSE3-NEXT:    movd %ecx, %xmm0
> -; SSE2-SSSE3-NEXT:    pextrw $0, %xmm0, %ecx
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %ecx
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm1, %eax
> +; SSE2-SSSE3-NEXT:    shll $16, %eax
> +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
>  ; SSE2-SSSE3-NEXT:    pextrw $1, %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    addl %ecx, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -411,7 +382,6 @@ define i16 @bitcast_v32i8_to_v2i16(<32 x
>  ; AVX1-NEXT:    shll $16, %ecx
>  ; AVX1-NEXT:    orl %eax, %ecx
>  ; AVX1-NEXT:    vmovd %ecx, %xmm0
> -; AVX1-NEXT:    vpextrw $0, %xmm0, %ecx
>  ; AVX1-NEXT:    vpextrw $1, %xmm0, %eax
>  ; AVX1-NEXT:    addl %ecx, %eax
>  ; AVX1-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -420,9 +390,8 @@ define i16 @bitcast_v32i8_to_v2i16(<32 x
>  ;
>  ; AVX2-LABEL: bitcast_v32i8_to_v2i16:
>  ; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpmovmskb %ymm0, %eax
> -; AVX2-NEXT:    vmovd %eax, %xmm0
> -; AVX2-NEXT:    vpextrw $0, %xmm0, %ecx
> +; AVX2-NEXT:    vpmovmskb %ymm0, %ecx
> +; AVX2-NEXT:    vmovd %ecx, %xmm0
>  ; AVX2-NEXT:    vpextrw $1, %xmm0, %eax
>  ; AVX2-NEXT:    addl %ecx, %eax
>  ; AVX2-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -437,8 +406,8 @@ define i16 @bitcast_v32i8_to_v2i16(<32 x
>  ; AVX512-NEXT:    subq $32, %rsp
>  ; AVX512-NEXT:    vpmovb2m %ymm0, %k0
>  ; AVX512-NEXT:    kmovd %k0, (%rsp)
> -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> -; AVX512-NEXT:    vpextrw $0, %xmm0, %ecx
> +; AVX512-NEXT:    vmovdqa (%rsp), %xmm0
> +; AVX512-NEXT:    vmovd %xmm0, %ecx
>  ; AVX512-NEXT:    vpextrw $1, %xmm0, %eax
>  ; AVX512-NEXT:    addl %ecx, %eax
>  ; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -579,33 +548,17 @@ define i4 @bitcast_v8i64_to_v2i4(<8 x i6
>  }
>
>  define i8 @bitcast_v16i32_to_v2i8(<16 x i32> %a0) nounwind {
> -; SSE2-LABEL: bitcast_v16i32_to_v2i8:
> -; SSE2:       # %bb.0:
> -; SSE2-NEXT:    packssdw %xmm3, %xmm2
> -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> -; SSE2-NEXT:    packsswb %xmm2, %xmm0
> -; SSE2-NEXT:    pmovmskb %xmm0, %eax
> -; SSE2-NEXT:    movd %eax, %xmm0
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; SSE2-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> -; SSE2-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> -; SSE2-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> -; SSE2-NEXT:    retq
> -;
> -; SSSE3-LABEL: bitcast_v16i32_to_v2i8:
> -; SSSE3:       # %bb.0:
> -; SSSE3-NEXT:    packssdw %xmm3, %xmm2
> -; SSSE3-NEXT:    packssdw %xmm1, %xmm0
> -; SSSE3-NEXT:    packsswb %xmm2, %xmm0
> -; SSSE3-NEXT:    pmovmskb %xmm0, %eax
> -; SSSE3-NEXT:    movd %eax, %xmm0
> -; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,u,u,u,u,u,u,u,1,u,u,u,u,u,u,u]
> -; SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> -; SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> -; SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> -; SSSE3-NEXT:    retq
> +; SSE2-SSSE3-LABEL: bitcast_v16i32_to_v2i8:
> +; SSE2-SSSE3:       # %bb.0:
> +; SSE2-SSSE3-NEXT:    packssdw %xmm3, %xmm2
> +; SSE2-SSSE3-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-SSSE3-NEXT:    packsswb %xmm2, %xmm0
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
> +; SSE2-SSSE3-NEXT:    movdqa %xmm0, -{{[0-9]+}}(%rsp)
> +; SSE2-SSSE3-NEXT:    movb -{{[0-9]+}}(%rsp), %al
> +; SSE2-SSSE3-NEXT:    addb -{{[0-9]+}}(%rsp), %al
> +; SSE2-SSSE3-NEXT:    retq
>  ;
>  ; AVX1-LABEL: bitcast_v16i32_to_v2i8:
>  ; AVX1:       # %bb.0:
> @@ -646,7 +599,7 @@ define i8 @bitcast_v16i32_to_v2i8(<16 x
>  ; AVX512-NEXT:    vpxor %xmm1, %xmm1, %xmm1
>  ; AVX512-NEXT:    vpcmpgtd %zmm0, %zmm1, %k0
>  ; AVX512-NEXT:    kmovw %k0, -{{[0-9]+}}(%rsp)
> -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; AVX512-NEXT:    vpextrb $0, %xmm0, %ecx
>  ; AVX512-NEXT:    vpextrb $1, %xmm0, %eax
>  ; AVX512-NEXT:    addb %cl, %al
> @@ -665,13 +618,11 @@ define i16 @bitcast_v32i16_to_v2i16(<32
>  ; SSE2-SSSE3-LABEL: bitcast_v32i16_to_v2i16:
>  ; SSE2-SSSE3:       # %bb.0:
>  ; SSE2-SSSE3-NEXT:    packsswb %xmm1, %xmm0
> -; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %eax
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm0, %ecx
>  ; SSE2-SSSE3-NEXT:    packsswb %xmm3, %xmm2
> -; SSE2-SSSE3-NEXT:    pmovmskb %xmm2, %ecx
> -; SSE2-SSSE3-NEXT:    shll $16, %ecx
> -; SSE2-SSSE3-NEXT:    orl %eax, %ecx
> -; SSE2-SSSE3-NEXT:    movd %ecx, %xmm0
> -; SSE2-SSSE3-NEXT:    pextrw $0, %xmm0, %ecx
> +; SSE2-SSSE3-NEXT:    pmovmskb %xmm2, %eax
> +; SSE2-SSSE3-NEXT:    shll $16, %eax
> +; SSE2-SSSE3-NEXT:    movd %eax, %xmm0
>  ; SSE2-SSSE3-NEXT:    pextrw $1, %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    addl %ecx, %eax
>  ; SSE2-SSSE3-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -688,7 +639,6 @@ define i16 @bitcast_v32i16_to_v2i16(<32
>  ; AVX1-NEXT:    shll $16, %ecx
>  ; AVX1-NEXT:    orl %eax, %ecx
>  ; AVX1-NEXT:    vmovd %ecx, %xmm0
> -; AVX1-NEXT:    vpextrw $0, %xmm0, %ecx
>  ; AVX1-NEXT:    vpextrw $1, %xmm0, %eax
>  ; AVX1-NEXT:    addl %ecx, %eax
>  ; AVX1-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -699,9 +649,8 @@ define i16 @bitcast_v32i16_to_v2i16(<32
>  ; AVX2:       # %bb.0:
>  ; AVX2-NEXT:    vpacksswb %ymm1, %ymm0, %ymm0
>  ; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
> -; AVX2-NEXT:    vpmovmskb %ymm0, %eax
> -; AVX2-NEXT:    vmovd %eax, %xmm0
> -; AVX2-NEXT:    vpextrw $0, %xmm0, %ecx
> +; AVX2-NEXT:    vpmovmskb %ymm0, %ecx
> +; AVX2-NEXT:    vmovd %ecx, %xmm0
>  ; AVX2-NEXT:    vpextrw $1, %xmm0, %eax
>  ; AVX2-NEXT:    addl %ecx, %eax
>  ; AVX2-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -716,8 +665,8 @@ define i16 @bitcast_v32i16_to_v2i16(<32
>  ; AVX512-NEXT:    subq $32, %rsp
>  ; AVX512-NEXT:    vpmovw2m %zmm0, %k0
>  ; AVX512-NEXT:    kmovd %k0, (%rsp)
> -; AVX512-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
> -; AVX512-NEXT:    vpextrw $0, %xmm0, %ecx
> +; AVX512-NEXT:    vmovdqa (%rsp), %xmm0
> +; AVX512-NEXT:    vmovd %xmm0, %ecx
>  ; AVX512-NEXT:    vpextrw $1, %xmm0, %eax
>  ; AVX512-NEXT:    addl %ecx, %eax
>  ; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
> @@ -984,9 +933,9 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
>  ; SSE2-SSSE3-NEXT:    orl %ecx, %edx
>  ; SSE2-SSSE3-NEXT:    orl %eax, %edx
>  ; SSE2-SSSE3-NEXT:    movw %dx, -{{[0-9]+}}(%rsp)
> -; SSE2-SSSE3-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; SSE2-SSSE3-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; SSE2-SSSE3-NEXT:    movd %xmm0, %ecx
> -; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,0,1]
> +; SSE2-SSSE3-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
>  ; SSE2-SSSE3-NEXT:    movd %xmm0, %eax
>  ; SSE2-SSSE3-NEXT:    addl %ecx, %eax
>  ; SSE2-SSSE3-NEXT:    retq
> @@ -1246,7 +1195,7 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
>  ; AVX1-NEXT:    orl %ecx, %edx
>  ; AVX1-NEXT:    orl %eax, %edx
>  ; AVX1-NEXT:    movl %edx, -{{[0-9]+}}(%rsp)
> -; AVX1-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> +; AVX1-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; AVX1-NEXT:    vmovd %xmm0, %ecx
>  ; AVX1-NEXT:    vpextrd $1, %xmm0, %eax
>  ; AVX1-NEXT:    addl %ecx, %eax
> @@ -1506,7 +1455,7 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
>  ; AVX2-NEXT:    orl %ecx, %edx
>  ; AVX2-NEXT:    orl %eax, %edx
>  ; AVX2-NEXT:    movl %edx, -{{[0-9]+}}(%rsp)
> -; AVX2-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> +; AVX2-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; AVX2-NEXT:    vmovd %xmm0, %ecx
>  ; AVX2-NEXT:    vpextrd $1, %xmm0, %eax
>  ; AVX2-NEXT:    addl %ecx, %eax
> @@ -1517,7 +1466,7 @@ define i32 @bitcast_v64i8_to_v2i32(<64 x
>  ; AVX512:       # %bb.0:
>  ; AVX512-NEXT:    vpmovb2m %zmm0, %k0
>  ; AVX512-NEXT:    kmovq %k0, -{{[0-9]+}}(%rsp)
> -; AVX512-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
> +; AVX512-NEXT:    vmovdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; AVX512-NEXT:    vmovd %xmm0, %ecx
>  ; AVX512-NEXT:    vpextrd $1, %xmm0, %eax
>  ; AVX512-NEXT:    addl %ecx, %eax
>
> Modified: llvm/trunk/test/CodeGen/X86/bitreverse.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bitreverse.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/bitreverse.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/bitreverse.ll Wed Aug  7 09:24:26 2019
> @@ -55,13 +55,11 @@ define <2 x i16> @test_bitreverse_v2i16(
>  ; X64-NEXT:    pxor %xmm1, %xmm1
>  ; X64-NEXT:    movdqa %xmm0, %xmm2
>  ; X64-NEXT:    punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm1[8],xmm2[9],xmm1[9],xmm2[10],xmm1[10],xmm2[11],xmm1[11],xmm2[12],xmm1[12],xmm2[13],xmm1[13],xmm2[14],xmm1[14],xmm2[15],xmm1[15]
> -; X64-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
> -; X64-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[3,2,1,0,4,5,6,7]
> -; X64-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,7,6,5,4]
> +; X64-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[1,0,3,2,4,5,6,7]
> +; X64-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,5,4,7,6]
>  ; X64-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> -; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[3,2,1,0,4,5,6,7]
> -; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,7,6,5,4]
> +; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7]
> +; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6]
>  ; X64-NEXT:    packuswb %xmm2, %xmm0
>  ; X64-NEXT:    movdqa %xmm0, %xmm1
>  ; X64-NEXT:    psllw $4, %xmm1
> @@ -81,7 +79,6 @@ define <2 x i16> @test_bitreverse_v2i16(
>  ; X64-NEXT:    pand {{.*}}(%rip), %xmm0
>  ; X64-NEXT:    psrlw $1, %xmm0
>  ; X64-NEXT:    por %xmm1, %xmm0
> -; X64-NEXT:    psrlq $48, %xmm0
>  ; X64-NEXT:    retq
>    %b = call <2 x i16> @llvm.bitreverse.v2i16(<2 x i16> %a)
>    ret <2 x i16> %b
> @@ -410,7 +407,7 @@ define <2 x i16> @fold_v2i16() {
>  ;
>  ; X64-LABEL: fold_v2i16:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    movaps {{.*#+}} xmm0 = [61440,240]
> +; X64-NEXT:    movaps {{.*#+}} xmm0 = <61440,240,u,u,u,u,u,u>
>  ; X64-NEXT:    retq
>    %b = call <2 x i16> @llvm.bitreverse.v2i16(<2 x i16> <i16 15, i16 3840>)
>    ret <2 x i16> %b
>
> Modified: llvm/trunk/test/CodeGen/X86/bswap-vector.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/bswap-vector.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/bswap-vector.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/bswap-vector.ll Wed Aug  7 09:24:26 2019
> @@ -291,23 +291,22 @@ define <4 x i16> @test7(<4 x i16> %v) {
>  ; CHECK-NOSSSE3-NEXT:    pxor %xmm1, %xmm1
>  ; CHECK-NOSSSE3-NEXT:    movdqa %xmm0, %xmm2
>  ; CHECK-NOSSSE3-NEXT:    punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm1[8],xmm2[9],xmm1[9],xmm2[10],xmm1[10],xmm2[11],xmm1[11],xmm2[12],xmm1[12],xmm2[13],xmm1[13],xmm2[14],xmm1[14],xmm2[15],xmm1[15]
> -; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[3,2,1,0,4,5,6,7]
> -; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,7,6,5,4]
> +; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[1,0,3,2,4,5,6,7]
> +; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,5,4,7,6]
>  ; CHECK-NOSSSE3-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
> -; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[3,2,1,0,4,5,6,7]
> -; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,7,6,5,4]
> +; CHECK-NOSSSE3-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7]
> +; CHECK-NOSSSE3-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6]
>  ; CHECK-NOSSSE3-NEXT:    packuswb %xmm2, %xmm0
> -; CHECK-NOSSSE3-NEXT:    psrld $16, %xmm0
>  ; CHECK-NOSSSE3-NEXT:    retq
>  ;
>  ; CHECK-SSSE3-LABEL: test7:
>  ; CHECK-SSSE3:       # %bb.0: # %entry
> -; CHECK-SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[1,0],zero,zero,xmm0[5,4],zero,zero,xmm0[9,8],zero,zero,xmm0[13,12],zero,zero
> +; CHECK-SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
>  ; CHECK-SSSE3-NEXT:    retq
>  ;
>  ; CHECK-AVX-LABEL: test7:
>  ; CHECK-AVX:       # %bb.0: # %entry
> -; CHECK-AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[1,0],zero,zero,xmm0[5,4],zero,zero,xmm0[9,8],zero,zero,xmm0[13,12],zero,zero
> +; CHECK-AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
>  ; CHECK-AVX-NEXT:    retq
>  ;
>  ; CHECK-WIDE-AVX-LABEL: test7:
>
> Modified: llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/buildvec-insertvec.ll Wed Aug  7 09:24:26 2019
> @@ -6,22 +6,29 @@ define void @foo(<3 x float> %in, <4 x i
>  ; SSE2-LABEL: foo:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    cvttps2dq %xmm0, %xmm0
> -; SSE2-NEXT:    movl $255, %eax
> -; SSE2-NEXT:    movd %eax, %xmm1
> -; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[2,0]
> -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,0]
> -; SSE2-NEXT:    andps {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
> +; SSE2-NEXT:    movzbl -{{[0-9]+}}(%rsp), %eax
> +; SSE2-NEXT:    movl -{{[0-9]+}}(%rsp), %ecx
> +; SSE2-NEXT:    shll $8, %ecx
> +; SSE2-NEXT:    orl %eax, %ecx
> +; SSE2-NEXT:    movd %ecx, %xmm0
> +; SSE2-NEXT:    movl $65280, %eax # imm = 0xFF00
> +; SSE2-NEXT:    orl -{{[0-9]+}}(%rsp), %eax
> +; SSE2-NEXT:    pinsrw $1, %eax, %xmm0
>  ; SSE2-NEXT:    movd %xmm0, (%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE41-LABEL: foo:
>  ; SSE41:       # %bb.0:
>  ; SSE41-NEXT:    cvttps2dq %xmm0, %xmm0
> +; SSE41-NEXT:    pextrb $8, %xmm0, %eax
> +; SSE41-NEXT:    pextrb $4, %xmm0, %ecx
> +; SSE41-NEXT:    pextrb $0, %xmm0, %edx
> +; SSE41-NEXT:    movd %edx, %xmm0
> +; SSE41-NEXT:    pinsrb $1, %ecx, %xmm0
> +; SSE41-NEXT:    pinsrb $2, %eax, %xmm0
>  ; SSE41-NEXT:    movl $255, %eax
> -; SSE41-NEXT:    pinsrd $3, %eax, %xmm0
> -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> +; SSE41-NEXT:    pinsrb $3, %eax, %xmm0
>  ; SSE41-NEXT:    movd %xmm0, (%rdi)
>  ; SSE41-NEXT:    retq
>    %t0 = fptoui <3 x float> %in to <3 x i8>
>
> Modified: llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/combine-64bit-vec-binop.ll Wed Aug  7 09:24:26 2019
> @@ -101,9 +101,9 @@ define double @test2_mul(double %A, doub
>  define double @test3_mul(double %A, double %B) {
>  ; SSE41-LABEL: test3_mul:
>  ; SSE41:       # %bb.0:
> -; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> -; SSE41-NEXT:    pmullw %xmm2, %xmm0
> +; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
> +; SSE41-NEXT:    pmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; SSE41-NEXT:    pmullw %xmm1, %xmm0
>  ; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; SSE41-NEXT:    retq
>    %1 = bitcast double %A to <8 x i8>
>
> Modified: llvm/trunk/test/CodeGen/X86/combine-or.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/combine-or.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/combine-or.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/combine-or.ll Wed Aug  7 09:24:26 2019
> @@ -362,7 +362,7 @@ define <4 x float> @test25(<4 x float> %
>  define <4 x i8> @test_crash(<4 x i8> %a, <4 x i8> %b) {
>  ; CHECK-LABEL: test_crash:
>  ; CHECK:       # %bb.0:
> -; CHECK-NEXT:    blendps {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]
> +; CHECK-NEXT:    pblendw {{.*#+}} xmm0 = xmm1[0],xmm0[1],xmm1[2,3,4,5,6,7]
>  ; CHECK-NEXT:    retq
>    %shuf1 = shufflevector <4 x i8> %a, <4 x i8> zeroinitializer, <4 x i32><i32 4, i32 4, i32 2, i32 3>
>    %shuf2 = shufflevector <4 x i8> %b, <4 x i8> zeroinitializer, <4 x i32><i32 0, i32 1, i32 4, i32 4>
>
> Modified: llvm/trunk/test/CodeGen/X86/complex-fastmath.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/complex-fastmath.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/complex-fastmath.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/complex-fastmath.ll Wed Aug  7 09:24:26 2019
> @@ -39,7 +39,7 @@ define <2 x float> @complex_square_f32(<
>  ; FMA-NEXT:    vaddss %xmm0, %xmm0, %xmm2
>  ; FMA-NEXT:    vmulss %xmm2, %xmm1, %xmm2
>  ; FMA-NEXT:    vmulss %xmm1, %xmm1, %xmm1
> -; FMA-NEXT:    vfmsub231ss %xmm0, %xmm0, %xmm1
> +; FMA-NEXT:    vfmsub231ss {{.*#+}} xmm1 = (xmm0 * xmm0) - xmm1
>  ; FMA-NEXT:    vinsertps {{.*#+}} xmm0 = xmm1[0],xmm2[0],xmm1[2,3]
>  ; FMA-NEXT:    retq
>    %2 = extractelement <2 x float> %0, i32 0
> @@ -85,7 +85,7 @@ define <2 x double> @complex_square_f64(
>  ; FMA-NEXT:    vaddsd %xmm0, %xmm0, %xmm2
>  ; FMA-NEXT:    vmulsd %xmm2, %xmm1, %xmm2
>  ; FMA-NEXT:    vmulsd %xmm1, %xmm1, %xmm1
> -; FMA-NEXT:    vfmsub231sd %xmm0, %xmm0, %xmm1
> +; FMA-NEXT:    vfmsub231sd {{.*#+}} xmm1 = (xmm0 * xmm0) - xmm1
>  ; FMA-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm2[0]
>  ; FMA-NEXT:    retq
>    %2 = extractelement <2 x double> %0, i32 0
> @@ -137,9 +137,9 @@ define <2 x float> @complex_mul_f32(<2 x
>  ; FMA-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
>  ; FMA-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm1[1,1,3,3]
>  ; FMA-NEXT:    vmulss %xmm2, %xmm1, %xmm4
> -; FMA-NEXT:    vfmadd231ss %xmm0, %xmm3, %xmm4
> +; FMA-NEXT:    vfmadd231ss {{.*#+}} xmm4 = (xmm3 * xmm0) + xmm4
>  ; FMA-NEXT:    vmulss %xmm2, %xmm3, %xmm2
> -; FMA-NEXT:    vfmsub231ss %xmm0, %xmm1, %xmm2
> +; FMA-NEXT:    vfmsub231ss {{.*#+}} xmm2 = (xmm1 * xmm0) - xmm2
>  ; FMA-NEXT:    vinsertps {{.*#+}} xmm0 = xmm2[0],xmm4[0],xmm2[2,3]
>  ; FMA-NEXT:    retq
>    %3 = extractelement <2 x float> %0, i32 0
> @@ -192,9 +192,9 @@ define <2 x double> @complex_mul_f64(<2
>  ; FMA-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
>  ; FMA-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
>  ; FMA-NEXT:    vmulsd %xmm2, %xmm1, %xmm4
> -; FMA-NEXT:    vfmadd231sd %xmm0, %xmm3, %xmm4
> +; FMA-NEXT:    vfmadd231sd {{.*#+}} xmm4 = (xmm3 * xmm0) + xmm4
>  ; FMA-NEXT:    vmulsd %xmm2, %xmm3, %xmm2
> -; FMA-NEXT:    vfmsub231sd %xmm0, %xmm1, %xmm2
> +; FMA-NEXT:    vfmsub231sd {{.*#+}} xmm2 = (xmm1 * xmm0) - xmm2
>  ; FMA-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm2[0],xmm4[0]
>  ; FMA-NEXT:    retq
>    %3 = extractelement <2 x double> %0, i32 0
>
> Modified: llvm/trunk/test/CodeGen/X86/cvtv2f32.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/cvtv2f32.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/cvtv2f32.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/cvtv2f32.ll Wed Aug  7 09:24:26 2019
> @@ -42,11 +42,9 @@ define <2 x float> @uitofp_2i32_cvt_buil
>  define <2 x float> @uitofp_2i32_buildvector_cvt(i32 %x, i32 %y, <2 x float> %v) {
>  ; X32-LABEL: uitofp_2i32_buildvector_cvt:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> -; X32-NEXT:    movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
> -; X32-NEXT:    unpcklpd {{.*#+}} xmm2 = xmm2[0],xmm1[0]
> -; X32-NEXT:    movapd {{.*#+}} xmm1 = [4.503599627370496E+15,4.503599627370496E+15]
> -; X32-NEXT:    orpd %xmm1, %xmm2
> +; X32-NEXT:    movdqa {{.*#+}} xmm1 = [4.503599627370496E+15,4.503599627370496E+15]
> +; X32-NEXT:    pmovzxdq {{.*#+}} xmm2 = mem[0],zero,mem[1],zero
> +; X32-NEXT:    por %xmm1, %xmm2
>  ; X32-NEXT:    subpd %xmm1, %xmm2
>  ; X32-NEXT:    cvtpd2ps %xmm2, %xmm1
>  ; X32-NEXT:    mulps %xmm1, %xmm0
> @@ -54,13 +52,13 @@ define <2 x float> @uitofp_2i32_buildvec
>  ;
>  ; X64-LABEL: uitofp_2i32_buildvector_cvt:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    movd %esi, %xmm1
> -; X64-NEXT:    movd %edi, %xmm2
> -; X64-NEXT:    punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
> -; X64-NEXT:    movdqa {{.*#+}} xmm1 = [4.503599627370496E+15,4.503599627370496E+15]
> -; X64-NEXT:    por %xmm1, %xmm2
> -; X64-NEXT:    subpd %xmm1, %xmm2
> -; X64-NEXT:    cvtpd2ps %xmm2, %xmm1
> +; X64-NEXT:    movd %edi, %xmm1
> +; X64-NEXT:    pinsrd $1, %esi, %xmm1
> +; X64-NEXT:    pmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
> +; X64-NEXT:    movdqa {{.*#+}} xmm2 = [4.503599627370496E+15,4.503599627370496E+15]
> +; X64-NEXT:    por %xmm2, %xmm1
> +; X64-NEXT:    subpd %xmm2, %xmm1
> +; X64-NEXT:    cvtpd2ps %xmm1, %xmm1
>  ; X64-NEXT:    mulps %xmm1, %xmm0
>  ; X64-NEXT:    retq
>    %t1 = insertelement <2 x i32> undef, i32 %x, i32 0
> @@ -73,23 +71,21 @@ define <2 x float> @uitofp_2i32_buildvec
>  define <2 x float> @uitofp_2i32_legalized(<2 x i32> %in, <2 x float> %v) {
>  ; X32-LABEL: uitofp_2i32_legalized:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    xorps %xmm2, %xmm2
> -; X32-NEXT:    blendps {{.*#+}} xmm2 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; X32-NEXT:    movaps {{.*#+}} xmm0 = [4.503599627370496E+15,4.503599627370496E+15]
> -; X32-NEXT:    orps %xmm0, %xmm2
> -; X32-NEXT:    subpd %xmm0, %xmm2
> -; X32-NEXT:    cvtpd2ps %xmm2, %xmm0
> +; X32-NEXT:    pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> +; X32-NEXT:    movdqa {{.*#+}} xmm2 = [4.503599627370496E+15,4.503599627370496E+15]
> +; X32-NEXT:    por %xmm2, %xmm0
> +; X32-NEXT:    subpd %xmm2, %xmm0
> +; X32-NEXT:    cvtpd2ps %xmm0, %xmm0
>  ; X32-NEXT:    mulps %xmm1, %xmm0
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: uitofp_2i32_legalized:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    xorps %xmm2, %xmm2
> -; X64-NEXT:    blendps {{.*#+}} xmm2 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; X64-NEXT:    movaps {{.*#+}} xmm0 = [4.503599627370496E+15,4.503599627370496E+15]
> -; X64-NEXT:    orps %xmm0, %xmm2
> -; X64-NEXT:    subpd %xmm0, %xmm2
> -; X64-NEXT:    cvtpd2ps %xmm2, %xmm0
> +; X64-NEXT:    pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> +; X64-NEXT:    movdqa {{.*#+}} xmm2 = [4.503599627370496E+15,4.503599627370496E+15]
> +; X64-NEXT:    por %xmm2, %xmm0
> +; X64-NEXT:    subpd %xmm2, %xmm0
> +; X64-NEXT:    cvtpd2ps %xmm0, %xmm0
>  ; X64-NEXT:    mulps %xmm1, %xmm0
>  ; X64-NEXT:    retq
>    %t1 = uitofp <2 x i32> %in to <2 x float>
>
> Modified: llvm/trunk/test/CodeGen/X86/extract-concat.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/extract-concat.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/extract-concat.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/extract-concat.ll Wed Aug  7 09:24:26 2019
> @@ -5,9 +5,14 @@ define void @foo(<4 x float> %in, <4 x i
>  ; CHECK-LABEL: foo:
>  ; CHECK:       # %bb.0:
>  ; CHECK-NEXT:    cvttps2dq %xmm0, %xmm0
> +; CHECK-NEXT:    pextrb $8, %xmm0, %eax
> +; CHECK-NEXT:    pextrb $4, %xmm0, %ecx
> +; CHECK-NEXT:    pextrb $0, %xmm0, %edx
> +; CHECK-NEXT:    movd %edx, %xmm0
> +; CHECK-NEXT:    pinsrb $1, %ecx, %xmm0
> +; CHECK-NEXT:    pinsrb $2, %eax, %xmm0
>  ; CHECK-NEXT:    movl $255, %eax
> -; CHECK-NEXT:    pinsrd $3, %eax, %xmm0
> -; CHECK-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> +; CHECK-NEXT:    pinsrb $3, %eax, %xmm0
>  ; CHECK-NEXT:    movd %xmm0, (%rdi)
>  ; CHECK-NEXT:    retq
>    %t0 = fptosi <4 x float> %in to <4 x i32>
>
> Modified: llvm/trunk/test/CodeGen/X86/extract-insert.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/extract-insert.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/extract-insert.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/extract-insert.ll Wed Aug  7 09:24:26 2019
> @@ -31,12 +31,10 @@ define i8 @extractelt_bitcast(i32 %x) no
>  define i8 @extractelt_bitcast_extra_use(i32 %x, <4 x i8>* %p) nounwind {
>  ; X86-LABEL: extractelt_bitcast_extra_use:
>  ; X86:       # %bb.0:
> -; X86-NEXT:    pushl %eax
>  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; X86-NEXT:    movl {{[0-9]+}}(%esp), %ecx
>  ; X86-NEXT:    movl %eax, (%ecx)
>  ; X86-NEXT:    # kill: def $al killed $al killed $eax
> -; X86-NEXT:    popl %ecx
>  ; X86-NEXT:    retl
>  ;
>  ; X64-LABEL: extractelt_bitcast_extra_use:
>
> Modified: llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll Wed Aug  7 09:24:26 2019
> @@ -268,14 +268,12 @@ define void @test_x86_vcvtps2ph_128_m(<4
>  ; X32-AVX512VL-LABEL: test_x86_vcvtps2ph_128_m:
>  ; X32-AVX512VL:       # %bb.0: # %entry
>  ; X32-AVX512VL-NEXT:    movl {{[0-9]+}}(%esp), %eax # encoding: [0x8b,0x44,0x24,0x04]
> -; X32-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x1d,0xc0,0x03]
> -; X32-AVX512VL-NEXT:    vmovlps %xmm0, (%eax) # EVEX TO VEX Compression encoding: [0xc5,0xf8,0x13,0x00]
> +; X32-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, (%eax) # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x1d,0x00,0x03]
>  ; X32-AVX512VL-NEXT:    retl # encoding: [0xc3]
>  ;
>  ; X64-AVX512VL-LABEL: test_x86_vcvtps2ph_128_m:
>  ; X64-AVX512VL:       # %bb.0: # %entry
> -; X64-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, %xmm0 # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x1d,0xc0,0x03]
> -; X64-AVX512VL-NEXT:    vmovlps %xmm0, (%rdi) # EVEX TO VEX Compression encoding: [0xc5,0xf8,0x13,0x07]
> +; X64-AVX512VL-NEXT:    vcvtps2ph $3, %xmm0, (%rdi) # EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x1d,0x07,0x03]
>  ; X64-AVX512VL-NEXT:    retq # encoding: [0xc3]
>  entry:
>    %0 = tail call <8 x i16> @llvm.x86.vcvtps2ph.128(<4 x float> %a, i32 3)
>
> Modified: llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/fold-vector-sext-zext.ll Wed Aug  7 09:24:26 2019
> @@ -11,12 +11,12 @@
>  define <4 x i16> @test_sext_4i8_4i16() {
>  ; X32-LABEL: test_sext_4i8_4i16:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [0,4294967295,2,4294967293]
> +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <0,65535,2,65533,u,u,u,u>
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: test_sext_4i8_4i16:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [0,4294967295,2,4294967293]
> +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <0,65535,2,65533,u,u,u,u>
>  ; X64-NEXT:    retq
>    %1 = insertelement <4 x i8> undef, i8 0, i32 0
>    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> @@ -29,12 +29,12 @@ define <4 x i16> @test_sext_4i8_4i16() {
>  define <4 x i16> @test_sext_4i8_4i16_undef() {
>  ; X32-LABEL: test_sext_4i8_4i16_undef:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <u,4294967295,u,4294967293>
> +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <u,65535,u,65533,u,u,u,u>
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: test_sext_4i8_4i16_undef:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <u,4294967295,u,4294967293>
> +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <u,65535,u,65533,u,u,u,u>
>  ; X64-NEXT:    retq
>    %1 = insertelement <4 x i8> undef, i8 undef, i32 0
>    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> @@ -207,12 +207,12 @@ define <8 x i32> @test_sext_8i8_8i32_und
>  define <4 x i16> @test_zext_4i8_4i16() {
>  ; X32-LABEL: test_zext_4i8_4i16:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,2,253]
> +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,2,253,u,u,u,u>
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: test_zext_4i8_4i16:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,2,253]
> +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,2,253,u,u,u,u>
>  ; X64-NEXT:    retq
>    %1 = insertelement <4 x i8> undef, i8 0, i32 0
>    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
> @@ -261,12 +261,12 @@ define <4 x i64> @test_zext_4i8_4i64() {
>  define <4 x i16> @test_zext_4i8_4i16_undef() {
>  ; X32-LABEL: test_zext_4i8_4i16_undef:
>  ; X32:       # %bb.0:
> -; X32-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,0,253]
> +; X32-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,0,253,u,u,u,u>
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: test_zext_4i8_4i16_undef:
>  ; X64:       # %bb.0:
> -; X64-NEXT:    vmovaps {{.*#+}} xmm0 = [0,255,0,253]
> +; X64-NEXT:    vmovaps {{.*#+}} xmm0 = <0,255,0,253,u,u,u,u>
>  ; X64-NEXT:    retq
>    %1 = insertelement <4 x i8> undef, i8 undef, i32 0
>    %2 = insertelement <4 x i8> %1, i8 -1, i32 1
>
> Modified: llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll Wed Aug  7 09:24:26 2019
> @@ -30,18 +30,10 @@ define <8 x float> @insert_subvector_256
>  define <8 x i64> @insert_subvector_512(i32 %x0, i32 %x1, <8 x i64> %v) nounwind {
>  ; X86_AVX256-LABEL: insert_subvector_512:
>  ; X86_AVX256:       # %bb.0:
> -; X86_AVX256-NEXT:    pushl %ebp
> -; X86_AVX256-NEXT:    movl %esp, %ebp
> -; X86_AVX256-NEXT:    andl $-8, %esp
> -; X86_AVX256-NEXT:    subl $8, %esp
> -; X86_AVX256-NEXT:    vmovsd {{.*#+}} xmm2 = mem[0],zero
> -; X86_AVX256-NEXT:    vmovlps %xmm2, (%esp)
>  ; X86_AVX256-NEXT:    vextracti128 $1, %ymm0, %xmm2
> -; X86_AVX256-NEXT:    vpinsrd $0, (%esp), %xmm2, %xmm2
> +; X86_AVX256-NEXT:    vpinsrd $0, {{[0-9]+}}(%esp), %xmm2, %xmm2
>  ; X86_AVX256-NEXT:    vpinsrd $1, {{[0-9]+}}(%esp), %xmm2, %xmm2
>  ; X86_AVX256-NEXT:    vinserti128 $1, %xmm2, %ymm0, %ymm0
> -; X86_AVX256-NEXT:    movl %ebp, %esp
> -; X86_AVX256-NEXT:    popl %ebp
>  ; X86_AVX256-NEXT:    retl
>  ;
>  ; X64_AVX256-LABEL: insert_subvector_512:
>
> Modified: llvm/trunk/test/CodeGen/X86/known-bits.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/known-bits.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/known-bits.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/known-bits.ll Wed Aug  7 09:24:26 2019
> @@ -5,100 +5,44 @@
>  define void @knownbits_zext_in_reg(i8*) nounwind {
>  ; X32-LABEL: knownbits_zext_in_reg:
>  ; X32:       # %bb.0: # %BB
> -; X32-NEXT:    pushl %ebp
>  ; X32-NEXT:    pushl %ebx
> -; X32-NEXT:    pushl %edi
> -; X32-NEXT:    pushl %esi
> -; X32-NEXT:    subl $16, %esp
>  ; X32-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; X32-NEXT:    movzbl (%eax), %ecx
>  ; X32-NEXT:    imull $101, %ecx, %eax
>  ; X32-NEXT:    shrl $14, %eax
> -; X32-NEXT:    imull $177, %ecx, %ecx
> -; X32-NEXT:    shrl $14, %ecx
> -; X32-NEXT:    movzbl %al, %eax
> -; X32-NEXT:    vpxor %xmm0, %xmm0, %xmm0
> -; X32-NEXT:    vpinsrd $1, %eax, %xmm0, %xmm1
> -; X32-NEXT:    vbroadcastss {{.*#+}} xmm2 = [3.57331108E-43,3.57331108E-43,3.57331108E-43,3.57331108E-43]
> -; X32-NEXT:    vpand %xmm2, %xmm1, %xmm1
> -; X32-NEXT:    movzbl %cl, %eax
> -; X32-NEXT:    vpinsrd $1, %eax, %xmm0, %xmm0
> -; X32-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; X32-NEXT:    vpextrd $1, %xmm1, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Spill
> -; X32-NEXT:    vpextrd $1, %xmm0, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Spill
> -; X32-NEXT:    xorl %ecx, %ecx
> -; X32-NEXT:    vmovd %xmm1, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Spill
> -; X32-NEXT:    vmovd %xmm0, (%esp) # 4-byte Folded Spill
> -; X32-NEXT:    vpextrd $2, %xmm1, %edi
> -; X32-NEXT:    vpextrd $2, %xmm0, %esi
> -; X32-NEXT:    vpextrd $3, %xmm1, %ebx
> -; X32-NEXT:    vpextrd $3, %xmm0, %ebp
> +; X32-NEXT:    imull $177, %ecx, %edx
> +; X32-NEXT:    shrl $14, %edx
> +; X32-NEXT:    movzbl %al, %ecx
> +; X32-NEXT:    xorl %ebx, %ebx
>  ; X32-NEXT:    .p2align 4, 0x90
>  ; X32-NEXT:  .LBB0_1: # %CF
>  ; X32-NEXT:    # =>This Loop Header: Depth=1
>  ; X32-NEXT:    # Child Loop BB0_2 Depth 2
> -; X32-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %eax # 4-byte Reload
> -; X32-NEXT:    xorl %edx, %edx
> -; X32-NEXT:    divl {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Reload
> -; X32-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %eax # 4-byte Reload
> -; X32-NEXT:    xorl %edx, %edx
> -; X32-NEXT:    divl (%esp) # 4-byte Folded Reload
> -; X32-NEXT:    movl %edi, %eax
> -; X32-NEXT:    xorl %edx, %edx
> -; X32-NEXT:    divl %esi
> -; X32-NEXT:    movl %ebx, %eax
> -; X32-NEXT:    xorl %edx, %edx
> -; X32-NEXT:    divl %ebp
> +; X32-NEXT:    movl %ecx, %eax
> +; X32-NEXT:    divb %dl
>  ; X32-NEXT:    .p2align 4, 0x90
>  ; X32-NEXT:  .LBB0_2: # %CF237
>  ; X32-NEXT:    # Parent Loop BB0_1 Depth=1
>  ; X32-NEXT:    # => This Inner Loop Header: Depth=2
> -; X32-NEXT:    testb %cl, %cl
> +; X32-NEXT:    testb %bl, %bl
>  ; X32-NEXT:    jne .LBB0_2
>  ; X32-NEXT:    jmp .LBB0_1
>  ;
>  ; X64-LABEL: knownbits_zext_in_reg:
>  ; X64:       # %bb.0: # %BB
> -; X64-NEXT:    pushq %rbp
> -; X64-NEXT:    pushq %rbx
>  ; X64-NEXT:    movzbl (%rdi), %eax
>  ; X64-NEXT:    imull $101, %eax, %ecx
>  ; X64-NEXT:    shrl $14, %ecx
> -; X64-NEXT:    imull $177, %eax, %eax
> -; X64-NEXT:    shrl $14, %eax
> +; X64-NEXT:    imull $177, %eax, %edx
> +; X64-NEXT:    shrl $14, %edx
>  ; X64-NEXT:    movzbl %cl, %ecx
> -; X64-NEXT:    vpxor %xmm0, %xmm0, %xmm0
> -; X64-NEXT:    vpinsrd $1, %ecx, %xmm0, %xmm1
> -; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [3.57331108E-43,3.57331108E-43,3.57331108E-43,3.57331108E-43]
> -; X64-NEXT:    vpand %xmm2, %xmm1, %xmm1
> -; X64-NEXT:    movzbl %al, %eax
> -; X64-NEXT:    vpinsrd $1, %eax, %xmm0, %xmm0
> -; X64-NEXT:    vpand %xmm2, %xmm0, %xmm0
> -; X64-NEXT:    vpextrd $1, %xmm1, %r8d
> -; X64-NEXT:    vpextrd $1, %xmm0, %r9d
>  ; X64-NEXT:    xorl %esi, %esi
> -; X64-NEXT:    vmovd %xmm1, %r10d
> -; X64-NEXT:    vmovd %xmm0, %r11d
> -; X64-NEXT:    vpextrd $2, %xmm1, %edi
> -; X64-NEXT:    vpextrd $2, %xmm0, %ebx
> -; X64-NEXT:    vpextrd $3, %xmm1, %ecx
> -; X64-NEXT:    vpextrd $3, %xmm0, %ebp
>  ; X64-NEXT:    .p2align 4, 0x90
>  ; X64-NEXT:  .LBB0_1: # %CF
>  ; X64-NEXT:    # =>This Loop Header: Depth=1
>  ; X64-NEXT:    # Child Loop BB0_2 Depth 2
> -; X64-NEXT:    movl %r8d, %eax
> -; X64-NEXT:    xorl %edx, %edx
> -; X64-NEXT:    divl %r9d
> -; X64-NEXT:    movl %r10d, %eax
> -; X64-NEXT:    xorl %edx, %edx
> -; X64-NEXT:    divl %r11d
> -; X64-NEXT:    movl %edi, %eax
> -; X64-NEXT:    xorl %edx, %edx
> -; X64-NEXT:    divl %ebx
>  ; X64-NEXT:    movl %ecx, %eax
> -; X64-NEXT:    xorl %edx, %edx
> -; X64-NEXT:    divl %ebp
> +; X64-NEXT:    divb %dl
>  ; X64-NEXT:    .p2align 4, 0x90
>  ; X64-NEXT:  .LBB0_2: # %CF237
>  ; X64-NEXT:    # Parent Loop BB0_1 Depth=1
>
> Modified: llvm/trunk/test/CodeGen/X86/load-partial.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/load-partial.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/load-partial.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/load-partial.ll Wed Aug  7 09:24:26 2019
> @@ -145,18 +145,8 @@ define i32 @load_partial_illegal_type()
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    movzwl {{.*}}(%rip), %eax
>  ; SSE2-NEXT:    movd %eax, %xmm0
> -; SSE2-NEXT:    movdqa %xmm0, %xmm1
> -; SSE2-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,0,3]
> -; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> -; SSE2-NEXT:    movl $2, %eax
> -; SSE2-NEXT:    movd %eax, %xmm1
> -; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[3,0]
> -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
> -; SSE2-NEXT:    andps {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> -; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> +; SSE2-NEXT:    por {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    movd %xmm0, %eax
>  ; SSE2-NEXT:    retq
>  ;
> @@ -164,7 +154,8 @@ define i32 @load_partial_illegal_type()
>  ; SSSE3:       # %bb.0:
>  ; SSSE3-NEXT:    movzwl {{.*}}(%rip), %eax
>  ; SSSE3-NEXT:    movd %eax, %xmm0
> -; SSSE3-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3]
> +; SSSE3-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,1],zero,xmm0[3,4,5,6,7,8,9,10,11,12,13,14,15]
> +; SSSE3-NEXT:    por {{.*}}(%rip), %xmm0
>  ; SSSE3-NEXT:    movd %xmm0, %eax
>  ; SSSE3-NEXT:    retq
>  ;
> @@ -172,10 +163,8 @@ define i32 @load_partial_illegal_type()
>  ; SSE41:       # %bb.0:
>  ; SSE41-NEXT:    movzwl {{.*}}(%rip), %eax
>  ; SSE41-NEXT:    movd %eax, %xmm0
> -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,1,2,3,1],zero,zero,zero,xmm0[u,u,u,u,u,u,u,u]
>  ; SSE41-NEXT:    movl $2, %eax
> -; SSE41-NEXT:    pinsrd $2, %eax, %xmm0
> -; SSE41-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,u,u,u,u,u,u,u,u,u,u,u,u,u]
> +; SSE41-NEXT:    pinsrb $2, %eax, %xmm0
>  ; SSE41-NEXT:    movd %xmm0, %eax
>  ; SSE41-NEXT:    retq
>  ;
> @@ -183,10 +172,8 @@ define i32 @load_partial_illegal_type()
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    movzwl {{.*}}(%rip), %eax
>  ; AVX-NEXT:    vmovd %eax, %xmm0
> -; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,2,3,1],zero,zero,zero,xmm0[u,u,u,u,u,u,u,u]
>  ; AVX-NEXT:    movl $2, %eax
> -; AVX-NEXT:    vpinsrd $2, %eax, %xmm0, %xmm0
> -; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,u,u,u,u,u,u,u,u,u,u,u,u,u]
> +; AVX-NEXT:    vpinsrb $2, %eax, %xmm0, %xmm0
>  ; AVX-NEXT:    vmovd %xmm0, %eax
>  ; AVX-NEXT:    retq
>    %1 = load <2 x i8>, <2 x i8>* bitcast (i8* @h to <2 x i8>*), align 1
>
> Modified: llvm/trunk/test/CodeGen/X86/lower-bitcast.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/lower-bitcast.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/lower-bitcast.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/lower-bitcast.ll Wed Aug  7 09:24:26 2019
> @@ -9,9 +9,7 @@
>  define double @test1(double %A) {
>  ; CHECK-LABEL: test1:
>  ; CHECK:       # %bb.0:
> -; CHECK-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
>  ; CHECK-NEXT:    paddd {{.*}}(%rip), %xmm0
> -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; CHECK-NEXT:    retq
>  ;
>  ; CHECK-WIDE-LABEL: test1:
> @@ -68,9 +66,7 @@ define i64 @test4(i64 %A) {
>  ; CHECK-LABEL: test4:
>  ; CHECK:       # %bb.0:
>  ; CHECK-NEXT:    movq %rdi, %xmm0
> -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
>  ; CHECK-NEXT:    paddd {{.*}}(%rip), %xmm0
> -; CHECK-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; CHECK-NEXT:    movq %xmm0, %rax
>  ; CHECK-NEXT:    retq
>  ;
> @@ -108,9 +104,7 @@ define double @test5(double %A) {
>  define double @test6(double %A) {
>  ; CHECK-LABEL: test6:
>  ; CHECK:       # %bb.0:
> -; CHECK-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
>  ; CHECK-NEXT:    paddw {{.*}}(%rip), %xmm0
> -; CHECK-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; CHECK-NEXT:    retq
>  ;
>  ; CHECK-WIDE-LABEL: test6:
> @@ -147,9 +141,7 @@ define double @test7(double %A, double %
>  define double @test8(double %A) {
>  ; CHECK-LABEL: test8:
>  ; CHECK:       # %bb.0:
> -; CHECK-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
>  ; CHECK-NEXT:    paddb {{.*}}(%rip), %xmm0
> -; CHECK-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; CHECK-NEXT:    retq
>  ;
>  ; CHECK-WIDE-LABEL: test8:
>
> Modified: llvm/trunk/test/CodeGen/X86/madd.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/madd.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/madd.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/madd.ll Wed Aug  7 09:24:26 2019
> @@ -1876,26 +1876,12 @@ define <4 x i32> @larger_mul(<16 x i16>
>  ;
>  ; AVX1-LABEL: larger_mul:
>  ; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm2
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> -; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm0
> -; AVX1-NEXT:    vpackssdw %xmm0, %xmm2, %xmm0
> -; AVX1-NEXT:    vpmovsxwd %xmm1, %xmm2
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
> -; AVX1-NEXT:    vpmovsxwd %xmm1, %xmm1
> -; AVX1-NEXT:    vpackssdw %xmm1, %xmm2, %xmm1
>  ; AVX1-NEXT:    vpmaddwd %xmm1, %xmm0, %xmm0
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: larger_mul:
>  ; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpmovsxwd %xmm0, %ymm0
> -; AVX2-NEXT:    vpmovsxwd %xmm1, %ymm1
> -; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
> -; AVX2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
> -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm2
> -; AVX2-NEXT:    vpackssdw %xmm2, %xmm0, %xmm0
>  ; AVX2-NEXT:    vpmaddwd %xmm1, %xmm0, %xmm0
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
> @@ -2597,29 +2583,29 @@ define <4 x i32> @pmaddwd_bad_indices(<8
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    movdqa (%rdi), %xmm0
>  ; SSE2-NEXT:    movdqa (%rsi), %xmm1
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm1[0,2,2,3,4,5,6,7]
> -; SSE2-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm0[2,1,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,6,5,6,7]
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm0[2,1,2,3,4,5,6,7]
> -; SSE2-NEXT:    pshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,6,5,6,7]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm3[1,0,3,2,4,5,6,7]
> -; SSE2-NEXT:    movdqa %xmm3, %xmm4
> -; SSE2-NEXT:    pmulhw %xmm2, %xmm4
> -; SSE2-NEXT:    pmullw %xmm2, %xmm3
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[1,0,3,2,4,5,6,7]
>  ; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,3,2,3,4,5,6,7]
>  ; SSE2-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,7,6,7]
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm1[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufhw {{.*#+}} xmm3 = xmm3[0,1,2,3,4,6,6,7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
>  ; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[3,1,2,3,4,5,6,7]
>  ; SSE2-NEXT:    pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,7,5,6,7]
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[1,0,3,2,4,5,6,7]
> -; SSE2-NEXT:    movdqa %xmm0, %xmm2
> -; SSE2-NEXT:    pmulhw %xmm1, %xmm2
> +; SSE2-NEXT:    movdqa %xmm2, %xmm4
> +; SSE2-NEXT:    pmulhw %xmm3, %xmm4
> +; SSE2-NEXT:    pmullw %xmm3, %xmm2
> +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
> +; SSE2-NEXT:    movdqa %xmm0, %xmm3
> +; SSE2-NEXT:    pmulhw %xmm1, %xmm3
>  ; SSE2-NEXT:    pmullw %xmm1, %xmm0
> -; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
> -; SSE2-NEXT:    paddd %xmm3, %xmm0
> +; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> +; SSE2-NEXT:    paddd %xmm2, %xmm0
>  ; SSE2-NEXT:    retq
>  ;
>  ; AVX-LABEL: pmaddwd_bad_indices:
> @@ -2627,13 +2613,13 @@ define <4 x i32> @pmaddwd_bad_indices(<8
>  ; AVX-NEXT:    vmovdqa (%rdi), %xmm0
>  ; AVX-NEXT:    vmovdqa (%rsi), %xmm1
>  ; AVX-NEXT:    vpshufb {{.*#+}} xmm2 = xmm0[2,3,4,5,10,11,12,13,12,13,10,11,12,13,14,15]
> -; AVX-NEXT:    vpmovsxwd %xmm2, %xmm2
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,6,7,8,9,14,15,8,9,14,15,12,13,14,15]
>  ; AVX-NEXT:    vpshufb {{.*#+}} xmm3 = xmm1[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm1 = xmm1[2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
> +; AVX-NEXT:    vpmovsxwd %xmm2, %xmm2
>  ; AVX-NEXT:    vpmovsxwd %xmm3, %xmm3
>  ; AVX-NEXT:    vpmulld %xmm3, %xmm2, %xmm2
> -; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,6,7,8,9,14,15,8,9,14,15,12,13,14,15]
>  ; AVX-NEXT:    vpmovsxwd %xmm0, %xmm0
> -; AVX-NEXT:    vpshufb {{.*#+}} xmm1 = xmm1[2,3,6,7,10,11,14,15,14,15,10,11,12,13,14,15]
>  ; AVX-NEXT:    vpmovsxwd %xmm1, %xmm1
>  ; AVX-NEXT:    vpmulld %xmm1, %xmm0, %xmm0
>  ; AVX-NEXT:    vpaddd %xmm0, %xmm2, %xmm0
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_compressstore.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_compressstore.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_compressstore.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_compressstore.ll Wed Aug  7 09:24:26 2019
> @@ -603,11 +603,9 @@ define void @compressstore_v16f64_v16i1(
>  define void @compressstore_v2f32_v2i32(float* %base, <2 x float> %V, <2 x i32> %trigger) {
>  ; SSE2-LABEL: compressstore_v2f32_v2i32:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm1
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne LBB2_1
> @@ -629,8 +627,8 @@ define void @compressstore_v2f32_v2i32(f
>  ; SSE42-LABEL: compressstore_v2f32_v2i32:
>  ; SSE42:       ## %bb.0:
>  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> -; SSE42-NEXT:    pblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm1
> +; SSE42-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm1
>  ; SSE42-NEXT:    movmskpd %xmm1, %eax
>  ; SSE42-NEXT:    testb $1, %al
>  ; SSE42-NEXT:    jne LBB2_1
> @@ -648,69 +646,54 @@ define void @compressstore_v2f32_v2i32(f
>  ; SSE42-NEXT:    extractps $1, %xmm0, (%rdi)
>  ; SSE42-NEXT:    retq
>  ;
> -; AVX1-LABEL: compressstore_v2f32_v2i32:
> -; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> -; AVX1-NEXT:    vmovmskpd %xmm1, %eax
> -; AVX1-NEXT:    testb $1, %al
> -; AVX1-NEXT:    jne LBB2_1
> -; AVX1-NEXT:  ## %bb.2: ## %else
> -; AVX1-NEXT:    testb $2, %al
> -; AVX1-NEXT:    jne LBB2_3
> -; AVX1-NEXT:  LBB2_4: ## %else2
> -; AVX1-NEXT:    retq
> -; AVX1-NEXT:  LBB2_1: ## %cond.store
> -; AVX1-NEXT:    vmovss %xmm0, (%rdi)
> -; AVX1-NEXT:    addq $4, %rdi
> -; AVX1-NEXT:    testb $2, %al
> -; AVX1-NEXT:    je LBB2_4
> -; AVX1-NEXT:  LBB2_3: ## %cond.store1
> -; AVX1-NEXT:    vextractps $1, %xmm0, (%rdi)
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: compressstore_v2f32_v2i32:
> -; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> -; AVX2-NEXT:    vmovmskpd %xmm1, %eax
> -; AVX2-NEXT:    testb $1, %al
> -; AVX2-NEXT:    jne LBB2_1
> -; AVX2-NEXT:  ## %bb.2: ## %else
> -; AVX2-NEXT:    testb $2, %al
> -; AVX2-NEXT:    jne LBB2_3
> -; AVX2-NEXT:  LBB2_4: ## %else2
> -; AVX2-NEXT:    retq
> -; AVX2-NEXT:  LBB2_1: ## %cond.store
> -; AVX2-NEXT:    vmovss %xmm0, (%rdi)
> -; AVX2-NEXT:    addq $4, %rdi
> -; AVX2-NEXT:    testb $2, %al
> -; AVX2-NEXT:    je LBB2_4
> -; AVX2-NEXT:  LBB2_3: ## %cond.store1
> -; AVX2-NEXT:    vextractps $1, %xmm0, (%rdi)
> -; AVX2-NEXT:    retq
> +; AVX1OR2-LABEL: compressstore_v2f32_v2i32:
> +; AVX1OR2:       ## %bb.0:
> +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX1OR2-NEXT:    vpmovsxdq %xmm1, %xmm1
> +; AVX1OR2-NEXT:    vmovmskpd %xmm1, %eax
> +; AVX1OR2-NEXT:    testb $1, %al
> +; AVX1OR2-NEXT:    jne LBB2_1
> +; AVX1OR2-NEXT:  ## %bb.2: ## %else
> +; AVX1OR2-NEXT:    testb $2, %al
> +; AVX1OR2-NEXT:    jne LBB2_3
> +; AVX1OR2-NEXT:  LBB2_4: ## %else2
> +; AVX1OR2-NEXT:    retq
> +; AVX1OR2-NEXT:  LBB2_1: ## %cond.store
> +; AVX1OR2-NEXT:    vmovss %xmm0, (%rdi)
> +; AVX1OR2-NEXT:    addq $4, %rdi
> +; AVX1OR2-NEXT:    testb $2, %al
> +; AVX1OR2-NEXT:    je LBB2_4
> +; AVX1OR2-NEXT:  LBB2_3: ## %cond.store1
> +; AVX1OR2-NEXT:    vextractps $1, %xmm0, (%rdi)
> +; AVX1OR2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: compressstore_v2f32_v2i32:
>  ; AVX512F:       ## %bb.0:
> +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX512F-NEXT:    vptestnmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vptestnmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vcompressps %zmm0, (%rdi) {%k1}
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: compressstore_v2f32_v2i32:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm1, %xmm1, %k1
> -; AVX512VL-NEXT:    vcompressps %xmm0, (%rdi) {%k1}
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: compressstore_v2f32_v2i32:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vcompressps %xmm0, (%rdi) {%k1}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: compressstore_v2f32_v2i32:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vcompressps %xmm0, (%rdi) {%k1}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    call void @llvm.masked.compressstore.v2f32(<2 x float> %V, float* %base, <2 x i1> %mask)
>    ret void
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_expandload.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_expandload.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_expandload.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_expandload.ll Wed Aug  7 09:24:26 2019
> @@ -1117,11 +1117,9 @@ define <16 x double> @expandload_v16f64_
>  define <2 x float> @expandload_v2f32_v2i1(float* %base, <2 x float> %src0, <2 x i32> %trigger) {
>  ; SSE2-LABEL: expandload_v2f32_v2i1:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm1
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne LBB4_1
> @@ -1146,8 +1144,8 @@ define <2 x float> @expandload_v2f32_v2i
>  ; SSE42-LABEL: expandload_v2f32_v2i1:
>  ; SSE42:       ## %bb.0:
>  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> -; SSE42-NEXT:    pblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm1
> +; SSE42-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm1
>  ; SSE42-NEXT:    movmskpd %xmm1, %eax
>  ; SSE42-NEXT:    testb $1, %al
>  ; SSE42-NEXT:    jne LBB4_1
> @@ -1166,58 +1164,34 @@ define <2 x float> @expandload_v2f32_v2i
>  ; SSE42-NEXT:    insertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
>  ; SSE42-NEXT:    retq
>  ;
> -; AVX1-LABEL: expandload_v2f32_v2i1:
> -; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> -; AVX1-NEXT:    vmovmskpd %xmm1, %eax
> -; AVX1-NEXT:    testb $1, %al
> -; AVX1-NEXT:    jne LBB4_1
> -; AVX1-NEXT:  ## %bb.2: ## %else
> -; AVX1-NEXT:    testb $2, %al
> -; AVX1-NEXT:    jne LBB4_3
> -; AVX1-NEXT:  LBB4_4: ## %else2
> -; AVX1-NEXT:    retq
> -; AVX1-NEXT:  LBB4_1: ## %cond.load
> -; AVX1-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> -; AVX1-NEXT:    vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
> -; AVX1-NEXT:    addq $4, %rdi
> -; AVX1-NEXT:    testb $2, %al
> -; AVX1-NEXT:    je LBB4_4
> -; AVX1-NEXT:  LBB4_3: ## %cond.load1
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: expandload_v2f32_v2i1:
> -; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
> -; AVX2-NEXT:    vmovmskpd %xmm1, %eax
> -; AVX2-NEXT:    testb $1, %al
> -; AVX2-NEXT:    jne LBB4_1
> -; AVX2-NEXT:  ## %bb.2: ## %else
> -; AVX2-NEXT:    testb $2, %al
> -; AVX2-NEXT:    jne LBB4_3
> -; AVX2-NEXT:  LBB4_4: ## %else2
> -; AVX2-NEXT:    retq
> -; AVX2-NEXT:  LBB4_1: ## %cond.load
> -; AVX2-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> -; AVX2-NEXT:    vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
> -; AVX2-NEXT:    addq $4, %rdi
> -; AVX2-NEXT:    testb $2, %al
> -; AVX2-NEXT:    je LBB4_4
> -; AVX2-NEXT:  LBB4_3: ## %cond.load1
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> -; AVX2-NEXT:    retq
> +; AVX1OR2-LABEL: expandload_v2f32_v2i1:
> +; AVX1OR2:       ## %bb.0:
> +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX1OR2-NEXT:    vpmovsxdq %xmm1, %xmm1
> +; AVX1OR2-NEXT:    vmovmskpd %xmm1, %eax
> +; AVX1OR2-NEXT:    testb $1, %al
> +; AVX1OR2-NEXT:    jne LBB4_1
> +; AVX1OR2-NEXT:  ## %bb.2: ## %else
> +; AVX1OR2-NEXT:    testb $2, %al
> +; AVX1OR2-NEXT:    jne LBB4_3
> +; AVX1OR2-NEXT:  LBB4_4: ## %else2
> +; AVX1OR2-NEXT:    retq
> +; AVX1OR2-NEXT:  LBB4_1: ## %cond.load
> +; AVX1OR2-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
> +; AVX1OR2-NEXT:    vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
> +; AVX1OR2-NEXT:    addq $4, %rdi
> +; AVX1OR2-NEXT:    testb $2, %al
> +; AVX1OR2-NEXT:    je LBB4_4
> +; AVX1OR2-NEXT:  LBB4_3: ## %cond.load1
> +; AVX1OR2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
> +; AVX1OR2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: expandload_v2f32_v2i1:
>  ; AVX512F:       ## %bb.0:
> +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX512F-NEXT:    vptestnmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vptestnmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vexpandps (%rdi), %zmm0 {%k1}
> @@ -1225,13 +1199,21 @@ define <2 x float> @expandload_v2f32_v2i
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: expandload_v2f32_v2i1:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm1, %xmm1, %k1
> -; AVX512VL-NEXT:    vexpandps (%rdi), %xmm0 {%k1}
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: expandload_v2f32_v2i1:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vexpandps (%rdi), %xmm0 {%k1}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: expandload_v2f32_v2i1:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm1, %xmm1, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vexpandps (%rdi), %xmm0 {%k1}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    %res = call <2 x float> @llvm.masked.expandload.v2f32(float* %base, <2 x i1> %mask, <2 x float> %src0)
>    ret <2 x float> %res
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll Wed Aug  7 09:24:26 2019
> @@ -915,13 +915,12 @@ define <2 x double> @test17(double* %bas
>  ; KNL_64-LABEL: test17:
>  ; KNL_64:       # %bb.0:
>  ; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; KNL_64-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; KNL_64-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_64-NEXT:    vgatherqpd (%rdi,%zmm0,8), %zmm2 {%k1}
> +; KNL_64-NEXT:    vgatherdpd (%rdi,%ymm0,8), %zmm2 {%k1}
>  ; KNL_64-NEXT:    vmovapd %xmm2, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
> @@ -929,36 +928,31 @@ define <2 x double> @test17(double* %bas
>  ; KNL_32-LABEL: test17:
>  ; KNL_32:       # %bb.0:
>  ; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; KNL_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; KNL_32-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vgatherqpd (%eax,%zmm0,8), %zmm2 {%k1}
> +; KNL_32-NEXT:    vgatherdpd (%eax,%ymm0,8), %zmm2 {%k1}
>  ; KNL_32-NEXT:    vmovapd %xmm2, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
>  ; SKX-LABEL: test17:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; SKX-NEXT:    vgatherqpd (%rdi,%xmm0,8), %xmm2 {%k1}
> +; SKX-NEXT:    vgatherdpd (%rdi,%xmm0,8), %xmm2 {%k1}
>  ; SKX-NEXT:    vmovapd %xmm2, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test17:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; SKX_32-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; SKX_32-NEXT:    vgatherqpd (%eax,%xmm0,8), %xmm2 {%k1}
> +; SKX_32-NEXT:    vgatherdpd (%eax,%xmm0,8), %xmm2 {%k1}
>  ; SKX_32-NEXT:    vmovapd %xmm2, %xmm0
>  ; SKX_32-NEXT:    retl
>
> @@ -1080,8 +1074,8 @@ define void @test20(<2 x float>%a1, <2 x
>  ;
>  ; KNL_32-LABEL: test20:
>  ; KNL_32:       # %bb.0:
> +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> -; KNL_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; KNL_32-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; KNL_32-NEXT:    vptestmq %zmm2, %zmm2, %k0
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> @@ -1099,7 +1093,6 @@ define void @test20(<2 x float>%a1, <2 x
>  ;
>  ; SKX_32-LABEL: test20:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; SKX_32-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; SKX_32-NEXT:    vpmovq2m %xmm2, %k1
>  ; SKX_32-NEXT:    vscatterdps %xmm0, (,%xmm1) {%k1}
> @@ -1113,9 +1106,9 @@ define void @test21(<2 x i32>%a1, <2 x i
>  ; KNL_64-LABEL: test21:
>  ; KNL_64:       # %bb.0:
>  ; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_64-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; KNL_64-NEXT:    vptestmq %zmm2, %zmm2, %k0
> -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
>  ; KNL_64-NEXT:    vpscatterqd %ymm0, (,%zmm1) {%k1}
> @@ -1124,10 +1117,10 @@ define void @test21(<2 x i32>%a1, <2 x i
>  ;
>  ; KNL_32-LABEL: test21:
>  ; KNL_32:       # %bb.0:
> +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; KNL_32-NEXT:    vptestmq %zmm2, %zmm2, %k0
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
>  ; KNL_32-NEXT:    vpscatterdd %zmm0, (,%zmm1) {%k1}
> @@ -1138,7 +1131,6 @@ define void @test21(<2 x i32>%a1, <2 x i
>  ; SKX:       # %bb.0:
>  ; SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; SKX-NEXT:    vpmovq2m %xmm2, %k1
> -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SKX-NEXT:    vpscatterqd %xmm0, (,%xmm1) {%k1}
>  ; SKX-NEXT:    retq
>  ;
> @@ -1146,8 +1138,6 @@ define void @test21(<2 x i32>%a1, <2 x i
>  ; SKX_32:       # %bb.0:
>  ; SKX_32-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; SKX_32-NEXT:    vpmovq2m %xmm2, %k1
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; SKX_32-NEXT:    vpscatterdd %xmm0, (,%xmm1) {%k1}
>  ; SKX_32-NEXT:    retl
>    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x i32*> %ptr, i32 4, <2 x i1> %mask)
> @@ -1161,7 +1151,7 @@ define <2 x float> @test22(float* %base,
>  ; KNL_64-LABEL: test22:
>  ; KNL_64:       # %bb.0:
>  ; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; KNL_64-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
> @@ -1174,7 +1164,7 @@ define <2 x float> @test22(float* %base,
>  ; KNL_32-LABEL: test22:
>  ; KNL_32:       # %bb.0:
>  ; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; KNL_32-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
> @@ -1187,7 +1177,6 @@ define <2 x float> @test22(float* %base,
>  ;
>  ; SKX-LABEL: test22:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
>  ; SKX-NEXT:    vgatherdps (%rdi,%xmm0,4), %xmm2 {%k1}
> @@ -1196,7 +1185,6 @@ define <2 x float> @test22(float* %base,
>  ;
>  ; SKX_32-LABEL: test22:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> @@ -1264,28 +1252,28 @@ declare <2 x i64> @llvm.masked.gather.v2
>  define <2 x i32> @test23(i32* %base, <2 x i32> %ind, <2 x i1> %mask, <2 x i32> %src0) {
>  ; KNL_64-LABEL: test23:
>  ; KNL_64:       # %bb.0:
> +; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_64-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm1 {%k1}
> -; KNL_64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; KNL_64-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm2 {%k1}
> +; KNL_64-NEXT:    vmovdqa %xmm2, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
>  ;
>  ; KNL_32-LABEL: test23:
>  ; KNL_32:       # %bb.0:
> +; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_32-NEXT:    vpgatherdd (%eax,%zmm0,4), %zmm1 {%k1}
> -; KNL_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; KNL_32-NEXT:    vpgatherdd (%eax,%zmm0,4), %zmm2 {%k1}
> +; KNL_32-NEXT:    vmovdqa %xmm2, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
> @@ -1293,10 +1281,8 @@ define <2 x i32> @test23(i32* %base, <2
>  ; SKX:       # %bb.0:
>  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> -; SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm1 {%k1}
> -; SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm2 {%k1}
> +; SKX-NEXT:    vmovdqa %xmm2, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test23:
> @@ -1304,10 +1290,8 @@ define <2 x i32> @test23(i32* %base, <2
>  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> -; SKX_32-NEXT:    vpgatherdd (%eax,%xmm0,4), %xmm1 {%k1}
> -; SKX_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; SKX_32-NEXT:    vpgatherdd (%eax,%xmm0,4), %xmm2 {%k1}
> +; SKX_32-NEXT:    vmovdqa %xmm2, %xmm0
>  ; SKX_32-NEXT:    retl
>    %sext_ind = sext <2 x i32> %ind to <2 x i64>
>    %gep.random = getelementptr i32, i32* %base, <2 x i64> %sext_ind
> @@ -1318,28 +1302,28 @@ define <2 x i32> @test23(i32* %base, <2
>  define <2 x i32> @test23b(i32* %base, <2 x i64> %ind, <2 x i1> %mask, <2 x i32> %src0) {
>  ; KNL_64-LABEL: test23b:
>  ; KNL_64:       # %bb.0:
> +; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
>  ; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_64-NEXT:    vpgatherqd (%rdi,%zmm0,4), %ymm1 {%k1}
> -; KNL_64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; KNL_64-NEXT:    vpgatherqd (%rdi,%zmm0,4), %ymm2 {%k1}
> +; KNL_64-NEXT:    vmovdqa %xmm2, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
>  ;
>  ; KNL_32-LABEL: test23b:
>  ; KNL_32:       # %bb.0:
> +; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
>  ; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_32-NEXT:    vpgatherqd (%eax,%zmm0,4), %ymm1 {%k1}
> -; KNL_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; KNL_32-NEXT:    vpgatherqd (%eax,%zmm0,4), %ymm2 {%k1}
> +; KNL_32-NEXT:    vmovdqa %xmm2, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
> @@ -1347,9 +1331,8 @@ define <2 x i32> @test23b(i32* %base, <2
>  ; SKX:       # %bb.0:
>  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> -; SKX-NEXT:    vpgatherqd (%rdi,%xmm0,4), %xmm1 {%k1}
> -; SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; SKX-NEXT:    vpgatherqd (%rdi,%xmm0,4), %xmm2 {%k1}
> +; SKX-NEXT:    vmovdqa %xmm2, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test23b:
> @@ -1357,9 +1340,8 @@ define <2 x i32> @test23b(i32* %base, <2
>  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> -; SKX_32-NEXT:    vpgatherqd (%eax,%xmm0,4), %xmm1 {%k1}
> -; SKX_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; SKX_32-NEXT:    vpgatherqd (%eax,%xmm0,4), %xmm2 {%k1}
> +; SKX_32-NEXT:    vmovdqa %xmm2, %xmm0
>  ; SKX_32-NEXT:    retl
>    %gep.random = getelementptr i32, i32* %base, <2 x i64> %ind
>    %res = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32*> %gep.random, i32 4, <2 x i1> %mask, <2 x i32> %src0)
> @@ -1369,22 +1351,22 @@ define <2 x i32> @test23b(i32* %base, <2
>  define <2 x i32> @test24(i32* %base, <2 x i32> %ind) {
>  ; KNL_64-LABEL: test24:
>  ; KNL_64:       # %bb.0:
> -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_64-NEXT:    movw $3, %ax
>  ; KNL_64-NEXT:    kmovw %eax, %k1
>  ; KNL_64-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm1 {%k1}
> -; KNL_64-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; KNL_64-NEXT:    vmovdqa %xmm1, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
>  ;
>  ; KNL_32-LABEL: test24:
>  ; KNL_32:       # %bb.0:
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; KNL_32-NEXT:    movw $3, %cx
>  ; KNL_32-NEXT:    kmovw %ecx, %k1
>  ; KNL_32-NEXT:    vpgatherdd (%eax,%zmm0,4), %zmm1 {%k1}
> -; KNL_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; KNL_32-NEXT:    vmovdqa %xmm1, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
> @@ -1392,9 +1374,8 @@ define <2 x i32> @test24(i32* %base, <2
>  ; SKX:       # %bb.0:
>  ; SKX-NEXT:    movb $3, %al
>  ; SKX-NEXT:    kmovw %eax, %k1
> -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm1 {%k1}
> -; SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; SKX-NEXT:    vmovdqa %xmm1, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test24:
> @@ -1402,9 +1383,8 @@ define <2 x i32> @test24(i32* %base, <2
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; SKX_32-NEXT:    movb $3, %cl
>  ; SKX_32-NEXT:    kmovw %ecx, %k1
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SKX_32-NEXT:    vpgatherdd (%eax,%xmm0,4), %xmm1 {%k1}
> -; SKX_32-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; SKX_32-NEXT:    vmovdqa %xmm1, %xmm0
>  ; SKX_32-NEXT:    retl
>    %sext_ind = sext <2 x i32> %ind to <2 x i64>
>    %gep.random = getelementptr i32, i32* %base, <2 x i64> %sext_ind
> @@ -1416,13 +1396,12 @@ define <2 x i64> @test25(i64* %base, <2
>  ; KNL_64-LABEL: test25:
>  ; KNL_64:       # %bb.0:
>  ; KNL_64-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; KNL_64-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; KNL_64-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_64-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_64-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_64-NEXT:    vpgatherqq (%rdi,%zmm0,8), %zmm2 {%k1}
> +; KNL_64-NEXT:    vpgatherdq (%rdi,%ymm0,8), %zmm2 {%k1}
>  ; KNL_64-NEXT:    vmovdqa %xmm2, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
> @@ -1430,36 +1409,31 @@ define <2 x i64> @test25(i64* %base, <2
>  ; KNL_32-LABEL: test25:
>  ; KNL_32:       # %bb.0:
>  ; KNL_32-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; KNL_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; KNL_32-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; KNL_32-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vpgatherqq (%eax,%zmm0,8), %zmm2 {%k1}
> +; KNL_32-NEXT:    vpgatherdq (%eax,%ymm0,8), %zmm2 {%k1}
>  ; KNL_32-NEXT:    vmovdqa %xmm2, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
>  ; SKX-LABEL: test25:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; SKX-NEXT:    vpgatherqq (%rdi,%xmm0,8), %xmm2 {%k1}
> +; SKX-NEXT:    vpgatherdq (%rdi,%xmm0,8), %xmm2 {%k1}
>  ; SKX-NEXT:    vmovdqa %xmm2, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test25:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; SKX_32-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; SKX_32-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; SKX_32-NEXT:    vpmovq2m %xmm1, %k1
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; SKX_32-NEXT:    vpgatherqq (%eax,%xmm0,8), %xmm2 {%k1}
> +; SKX_32-NEXT:    vpgatherdq (%eax,%xmm0,8), %xmm2 {%k1}
>  ; SKX_32-NEXT:    vmovdqa %xmm2, %xmm0
>  ; SKX_32-NEXT:    retl
>    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> @@ -1472,11 +1446,10 @@ define <2 x i64> @test26(i64* %base, <2
>  ; KNL_64-LABEL: test26:
>  ; KNL_64:       # %bb.0:
>  ; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> -; KNL_64-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; KNL_64-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_64-NEXT:    movb $3, %al
>  ; KNL_64-NEXT:    kmovw %eax, %k1
> -; KNL_64-NEXT:    vpgatherqq (%rdi,%zmm0,8), %zmm1 {%k1}
> +; KNL_64-NEXT:    vpgatherdq (%rdi,%ymm0,8), %zmm1 {%k1}
>  ; KNL_64-NEXT:    vmovdqa %xmm1, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
> @@ -1484,32 +1457,27 @@ define <2 x i64> @test26(i64* %base, <2
>  ; KNL_32-LABEL: test26:
>  ; KNL_32:       # %bb.0:
>  ; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> -; KNL_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; KNL_32-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; KNL_32-NEXT:    movb $3, %cl
>  ; KNL_32-NEXT:    kmovw %ecx, %k1
> -; KNL_32-NEXT:    vpgatherqq (%eax,%zmm0,8), %zmm1 {%k1}
> +; KNL_32-NEXT:    vpgatherdq (%eax,%ymm0,8), %zmm1 {%k1}
>  ; KNL_32-NEXT:    vmovdqa %xmm1, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
>  ; SKX-LABEL: test26:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; SKX-NEXT:    kxnorw %k0, %k0, %k1
> -; SKX-NEXT:    vpgatherqq (%rdi,%xmm0,8), %xmm1 {%k1}
> +; SKX-NEXT:    vpgatherdq (%rdi,%xmm0,8), %xmm1 {%k1}
>  ; SKX-NEXT:    vmovdqa %xmm1, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test26:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; SKX_32-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; SKX_32-NEXT:    kxnorw %k0, %k0, %k1
> -; SKX_32-NEXT:    vpgatherqq (%eax,%xmm0,8), %xmm1 {%k1}
> +; SKX_32-NEXT:    vpgatherdq (%eax,%xmm0,8), %xmm1 {%k1}
>  ; SKX_32-NEXT:    vmovdqa %xmm1, %xmm0
>  ; SKX_32-NEXT:    retl
>    %sext_ind = sext <2 x i32> %ind to <2 x i64>
> @@ -1522,40 +1490,40 @@ define <2 x i64> @test26(i64* %base, <2
>  define <2 x float> @test27(float* %base, <2 x i32> %ind) {
>  ; KNL_64-LABEL: test27:
>  ; KNL_64:       # %bb.0:
> -; KNL_64-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_64-NEXT:    movw $3, %ax
>  ; KNL_64-NEXT:    kmovw %eax, %k1
> -; KNL_64-NEXT:    vgatherdps (%rdi,%zmm1,4), %zmm0 {%k1}
> -; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
> +; KNL_64-NEXT:    vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}
> +; KNL_64-NEXT:    vmovaps %xmm1, %xmm0
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
>  ;
>  ; KNL_32-LABEL: test27:
>  ; KNL_32:       # %bb.0:
> -; KNL_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; KNL_32-NEXT:    movw $3, %cx
>  ; KNL_32-NEXT:    kmovw %ecx, %k1
> -; KNL_32-NEXT:    vgatherdps (%eax,%zmm1,4), %zmm0 {%k1}
> -; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
> +; KNL_32-NEXT:    vgatherdps (%eax,%zmm0,4), %zmm1 {%k1}
> +; KNL_32-NEXT:    vmovaps %xmm1, %xmm0
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
>  ; SKX-LABEL: test27:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
>  ; SKX-NEXT:    movb $3, %al
>  ; SKX-NEXT:    kmovw %eax, %k1
> -; SKX-NEXT:    vgatherdps (%rdi,%xmm1,4), %xmm0 {%k1}
> +; SKX-NEXT:    vgatherdps (%rdi,%xmm0,4), %xmm1 {%k1}
> +; SKX-NEXT:    vmovaps %xmm1, %xmm0
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test27:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[0,2,2,3]
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
>  ; SKX_32-NEXT:    movb $3, %cl
>  ; SKX_32-NEXT:    kmovw %ecx, %k1
> -; SKX_32-NEXT:    vgatherdps (%eax,%xmm1,4), %xmm0 {%k1}
> +; SKX_32-NEXT:    vgatherdps (%eax,%xmm0,4), %xmm1 {%k1}
> +; SKX_32-NEXT:    vmovaps %xmm1, %xmm0
>  ; SKX_32-NEXT:    retl
>    %sext_ind = sext <2 x i32> %ind to <2 x i64>
>    %gep.random = getelementptr float, float* %base, <2 x i64> %sext_ind
> @@ -1568,7 +1536,7 @@ define void @test28(<2 x i32>%a1, <2 x i
>  ; KNL_64-LABEL: test28:
>  ; KNL_64:       # %bb.0:
>  ; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> -; KNL_64-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; KNL_64-NEXT:    movb $3, %al
>  ; KNL_64-NEXT:    kmovw %eax, %k1
>  ; KNL_64-NEXT:    vpscatterqd %ymm0, (,%zmm1) {%k1}
> @@ -1577,8 +1545,8 @@ define void @test28(<2 x i32>%a1, <2 x i
>  ;
>  ; KNL_32-LABEL: test28:
>  ; KNL_32:       # %bb.0:
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; KNL_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> +; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; KNL_32-NEXT:    movw $3, %ax
>  ; KNL_32-NEXT:    kmovw %eax, %k1
>  ; KNL_32-NEXT:    vpscatterdd %zmm0, (,%zmm1) {%k1}
> @@ -1587,7 +1555,6 @@ define void @test28(<2 x i32>%a1, <2 x i
>  ;
>  ; SKX-LABEL: test28:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SKX-NEXT:    kxnorw %k0, %k0, %k1
>  ; SKX-NEXT:    vpscatterqd %xmm0, (,%xmm1) {%k1}
>  ; SKX-NEXT:    retq
> @@ -1596,8 +1563,6 @@ define void @test28(<2 x i32>%a1, <2 x i
>  ; SKX_32:       # %bb.0:
>  ; SKX_32-NEXT:    movb $3, %al
>  ; SKX_32-NEXT:    kmovw %eax, %k1
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; SKX_32-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; SKX_32-NEXT:    vpscatterdd %xmm0, (,%xmm1) {%k1}
>  ; SKX_32-NEXT:    retl
>    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x i32*> %ptr, i32 4, <2 x i1> <i1 true, i1 true>)
> @@ -2673,9 +2638,7 @@ define <16 x float> @sext_i8_index(float
>  define <8 x float> @sext_v8i8_index(float* %base, <8 x i8> %ind) {
>  ; KNL_64-LABEL: sext_v8i8_index:
>  ; KNL_64:       # %bb.0:
> -; KNL_64-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> -; KNL_64-NEXT:    vpslld $24, %ymm0, %ymm0
> -; KNL_64-NEXT:    vpsrad $24, %ymm0, %ymm1
> +; KNL_64-NEXT:    vpmovsxbd %xmm0, %ymm1
>  ; KNL_64-NEXT:    movw $255, %ax
>  ; KNL_64-NEXT:    kmovw %eax, %k1
>  ; KNL_64-NEXT:    vgatherdps (%rdi,%zmm1,4), %zmm0 {%k1}
> @@ -2684,10 +2647,8 @@ define <8 x float> @sext_v8i8_index(floa
>  ;
>  ; KNL_32-LABEL: sext_v8i8_index:
>  ; KNL_32:       # %bb.0:
> -; KNL_32-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vpslld $24, %ymm0, %ymm0
> -; KNL_32-NEXT:    vpsrad $24, %ymm0, %ymm1
> +; KNL_32-NEXT:    vpmovsxbd %xmm0, %ymm1
>  ; KNL_32-NEXT:    movw $255, %cx
>  ; KNL_32-NEXT:    kmovw %ecx, %k1
>  ; KNL_32-NEXT:    vgatherdps (%eax,%zmm1,4), %zmm0 {%k1}
> @@ -2696,20 +2657,16 @@ define <8 x float> @sext_v8i8_index(floa
>  ;
>  ; SKX-LABEL: sext_v8i8_index:
>  ; SKX:       # %bb.0:
> -; SKX-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
> +; SKX-NEXT:    vpmovsxbd %xmm0, %ymm1
>  ; SKX-NEXT:    kxnorw %k0, %k0, %k1
> -; SKX-NEXT:    vpslld $24, %ymm0, %ymm0
> -; SKX-NEXT:    vpsrad $24, %ymm0, %ymm1
>  ; SKX-NEXT:    vgatherdps (%rdi,%ymm1,4), %ymm0 {%k1}
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: sext_v8i8_index:
>  ; SKX_32:       # %bb.0:
> -; SKX_32-NEXT:    vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> +; SKX_32-NEXT:    vpmovsxbd %xmm0, %ymm1
>  ; SKX_32-NEXT:    kxnorw %k0, %k0, %k1
> -; SKX_32-NEXT:    vpslld $24, %ymm0, %ymm0
> -; SKX_32-NEXT:    vpsrad $24, %ymm0, %ymm1
>  ; SKX_32-NEXT:    vgatherdps (%eax,%ymm1,4), %ymm0 {%k1}
>  ; SKX_32-NEXT:    retl
>
> @@ -2725,28 +2682,26 @@ declare <8 x float> @llvm.masked.gather.
>  define void @test_scatter_2i32_index(<2 x double> %a1, double* %base, <2 x i32> %ind, <2 x i1> %mask) {
>  ; KNL_64-LABEL: test_scatter_2i32_index:
>  ; KNL_64:       # %bb.0:
> +; KNL_64-NEXT:    # kill: def $xmm1 killed $xmm1 def $ymm1
>  ; KNL_64-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> -; KNL_64-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; KNL_64-NEXT:    vpsraq $32, %zmm1, %zmm1
>  ; KNL_64-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; KNL_64-NEXT:    vptestmq %zmm2, %zmm2, %k0
>  ; KNL_64-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_64-NEXT:    kshiftrw $14, %k0, %k1
> -; KNL_64-NEXT:    vscatterqpd %zmm0, (%rdi,%zmm1,8) {%k1}
> +; KNL_64-NEXT:    vscatterdpd %zmm0, (%rdi,%ymm1,8) {%k1}
>  ; KNL_64-NEXT:    vzeroupper
>  ; KNL_64-NEXT:    retq
>  ;
>  ; KNL_32-LABEL: test_scatter_2i32_index:
>  ; KNL_32:       # %bb.0:
> +; KNL_32-NEXT:    # kill: def $xmm1 killed $xmm1 def $ymm1
>  ; KNL_32-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> -; KNL_32-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; KNL_32-NEXT:    vpsraq $32, %zmm1, %zmm1
>  ; KNL_32-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; KNL_32-NEXT:    vptestmq %zmm2, %zmm2, %k0
>  ; KNL_32-NEXT:    kshiftlw $14, %k0, %k0
>  ; KNL_32-NEXT:    kshiftrw $14, %k0, %k1
>  ; KNL_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; KNL_32-NEXT:    vscatterqpd %zmm0, (%eax,%zmm1,8) {%k1}
> +; KNL_32-NEXT:    vscatterdpd %zmm0, (%eax,%ymm1,8) {%k1}
>  ; KNL_32-NEXT:    vzeroupper
>  ; KNL_32-NEXT:    retl
>  ;
> @@ -2754,19 +2709,15 @@ define void @test_scatter_2i32_index(<2
>  ; SKX:       # %bb.0:
>  ; SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; SKX-NEXT:    vpmovq2m %xmm2, %k1
> -; SKX-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; SKX-NEXT:    vpsraq $32, %xmm1, %xmm1
> -; SKX-NEXT:    vscatterqpd %xmm0, (%rdi,%xmm1,8) {%k1}
> +; SKX-NEXT:    vscatterdpd %xmm0, (%rdi,%xmm1,8) {%k1}
>  ; SKX-NEXT:    retq
>  ;
>  ; SKX_32-LABEL: test_scatter_2i32_index:
>  ; SKX_32:       # %bb.0:
>  ; SKX_32-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; SKX_32-NEXT:    vpmovq2m %xmm2, %k1
> -; SKX_32-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; SKX_32-NEXT:    vpsraq $32, %xmm1, %xmm1
>  ; SKX_32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; SKX_32-NEXT:    vscatterqpd %xmm0, (%eax,%xmm1,8) {%k1}
> +; SKX_32-NEXT:    vscatterdpd %xmm0, (%eax,%xmm1,8) {%k1}
>  ; SKX_32-NEXT:    retl
>    %gep = getelementptr double, double *%base, <2 x i32> %ind
>    call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> %a1, <2 x double*> %gep, i32 4, <2 x i1> %mask)
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_gather_scatter_widen.ll Wed Aug  7 09:24:26 2019
> @@ -30,24 +30,21 @@ define <2 x double> @test_gather_v2i32_i
>  ;
>  ; PROMOTE_SKX-LABEL: test_gather_v2i32_index:
>  ; PROMOTE_SKX:       # %bb.0:
> -; PROMOTE_SKX-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; PROMOTE_SKX-NEXT:    vpsraq $32, %xmm0, %xmm0
>  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; PROMOTE_SKX-NEXT:    vgatherqpd (%rdi,%xmm0,8), %xmm2 {%k1}
> +; PROMOTE_SKX-NEXT:    vgatherdpd (%rdi,%xmm0,8), %xmm2 {%k1}
>  ; PROMOTE_SKX-NEXT:    vmovapd %xmm2, %xmm0
>  ; PROMOTE_SKX-NEXT:    retq
>  ;
>  ; PROMOTE_KNL-LABEL: test_gather_v2i32_index:
>  ; PROMOTE_KNL:       # %bb.0:
>  ; PROMOTE_KNL-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> -; PROMOTE_KNL-NEXT:    vpsllq $32, %xmm0, %xmm0
> -; PROMOTE_KNL-NEXT:    vpsraq $32, %zmm0, %zmm0
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; PROMOTE_KNL-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> -; PROMOTE_KNL-NEXT:    vgatherqpd (%rdi,%zmm0,8), %zmm2 {%k1}
> +; PROMOTE_KNL-NEXT:    vgatherdpd (%rdi,%ymm0,8), %zmm2 {%k1}
>  ; PROMOTE_KNL-NEXT:    vmovapd %xmm2, %xmm0
>  ; PROMOTE_KNL-NEXT:    vzeroupper
>  ; PROMOTE_KNL-NEXT:    retq
> @@ -61,11 +58,8 @@ define <2 x double> @test_gather_v2i32_i
>  ;
>  ; PROMOTE_AVX2-LABEL: test_gather_v2i32_index:
>  ; PROMOTE_AVX2:       # %bb.0:
> -; PROMOTE_AVX2-NEXT:    vpsllq $32, %xmm0, %xmm3
> -; PROMOTE_AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; PROMOTE_AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3]
>  ; PROMOTE_AVX2-NEXT:    vpsllq $63, %xmm1, %xmm1
> -; PROMOTE_AVX2-NEXT:    vgatherqpd %xmm1, (%rdi,%xmm0,8), %xmm2
> +; PROMOTE_AVX2-NEXT:    vgatherdpd %xmm1, (%rdi,%xmm0,8), %xmm2
>  ; PROMOTE_AVX2-NEXT:    vmovapd %xmm2, %xmm0
>  ; PROMOTE_AVX2-NEXT:    retq
>    %gep.random = getelementptr double, double* %base, <2 x i32> %ind
> @@ -97,21 +91,18 @@ define void @test_scatter_v2i32_index(<2
>  ; PROMOTE_SKX:       # %bb.0:
>  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm2, %k1
> -; PROMOTE_SKX-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; PROMOTE_SKX-NEXT:    vpsraq $32, %xmm1, %xmm1
> -; PROMOTE_SKX-NEXT:    vscatterqpd %xmm0, (%rdi,%xmm1,8) {%k1}
> +; PROMOTE_SKX-NEXT:    vscatterdpd %xmm0, (%rdi,%xmm1,8) {%k1}
>  ; PROMOTE_SKX-NEXT:    retq
>  ;
>  ; PROMOTE_KNL-LABEL: test_scatter_v2i32_index:
>  ; PROMOTE_KNL:       # %bb.0:
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm1 killed $xmm1 def $ymm1
>  ; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
> -; PROMOTE_KNL-NEXT:    vpsllq $32, %xmm1, %xmm1
> -; PROMOTE_KNL-NEXT:    vpsraq $32, %zmm1, %zmm1
>  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; PROMOTE_KNL-NEXT:    vptestmq %zmm2, %zmm2, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> -; PROMOTE_KNL-NEXT:    vscatterqpd %zmm0, (%rdi,%zmm1,8) {%k1}
> +; PROMOTE_KNL-NEXT:    vscatterdpd %zmm0, (%rdi,%ymm1,8) {%k1}
>  ; PROMOTE_KNL-NEXT:    vzeroupper
>  ; PROMOTE_KNL-NEXT:    retq
>  ;
> @@ -143,9 +134,7 @@ define void @test_scatter_v2i32_index(<2
>  ;
>  ; PROMOTE_AVX2-LABEL: test_scatter_v2i32_index:
>  ; PROMOTE_AVX2:       # %bb.0:
> -; PROMOTE_AVX2-NEXT:    vpsllq $32, %xmm1, %xmm3
> -; PROMOTE_AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; PROMOTE_AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> +; PROMOTE_AVX2-NEXT:    vpmovsxdq %xmm1, %xmm1
>  ; PROMOTE_AVX2-NEXT:    vpsllq $3, %xmm1, %xmm1
>  ; PROMOTE_AVX2-NEXT:    vmovq %rdi, %xmm3
>  ; PROMOTE_AVX2-NEXT:    vpbroadcastq %xmm3, %xmm3
> @@ -199,21 +188,20 @@ define <2 x i32> @test_gather_v2i32_data
>  ; PROMOTE_SKX:       # %bb.0:
>  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> -; PROMOTE_SKX-NEXT:    vpgatherqd (,%xmm0), %xmm1 {%k1}
> -; PROMOTE_SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; PROMOTE_SKX-NEXT:    vpgatherqd (,%xmm0), %xmm2 {%k1}
> +; PROMOTE_SKX-NEXT:    vmovdqa %xmm2, %xmm0
>  ; PROMOTE_SKX-NEXT:    retq
>  ;
>  ; PROMOTE_KNL-LABEL: test_gather_v2i32_data:
>  ; PROMOTE_KNL:       # %bb.0:
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
>  ; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; PROMOTE_KNL-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
>  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> -; PROMOTE_KNL-NEXT:    vpgatherqd (,%zmm0), %ymm1 {%k1}
> -; PROMOTE_KNL-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; PROMOTE_KNL-NEXT:    vpgatherqd (,%zmm0), %ymm2 {%k1}
> +; PROMOTE_KNL-NEXT:    vmovdqa %xmm2, %xmm0
>  ; PROMOTE_KNL-NEXT:    vzeroupper
>  ; PROMOTE_KNL-NEXT:    retq
>  ;
> @@ -227,11 +215,10 @@ define <2 x i32> @test_gather_v2i32_data
>  ;
>  ; PROMOTE_AVX2-LABEL: test_gather_v2i32_data:
>  ; PROMOTE_AVX2:       # %bb.0:
> -; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
>  ; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; PROMOTE_AVX2-NEXT:    vpslld $31, %xmm1, %xmm1
>  ; PROMOTE_AVX2-NEXT:    vpgatherqd %xmm1, (,%xmm0), %xmm2
> -; PROMOTE_AVX2-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm2[0],zero,xmm2[1],zero
> +; PROMOTE_AVX2-NEXT:    vmovdqa %xmm2, %xmm0
>  ; PROMOTE_AVX2-NEXT:    retq
>    %res = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32*> %ptr, i32 4, <2 x i1> %mask, <2 x i32> %src0)
>    ret <2 x i32>%res
> @@ -261,16 +248,15 @@ define void @test_scatter_v2i32_data(<2
>  ; PROMOTE_SKX:       # %bb.0:
>  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm2, %k1
> -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; PROMOTE_SKX-NEXT:    vpscatterqd %xmm0, (,%xmm1) {%k1}
>  ; PROMOTE_SKX-NEXT:    retq
>  ;
>  ; PROMOTE_KNL-LABEL: test_scatter_v2i32_data:
>  ; PROMOTE_KNL:       # %bb.0:
>  ; PROMOTE_KNL-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
>  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; PROMOTE_KNL-NEXT:    vptestmq %zmm2, %zmm2, %k0
> -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
>  ; PROMOTE_KNL-NEXT:    vpscatterqd %ymm0, (,%zmm1) {%k1}
> @@ -316,7 +302,7 @@ define void @test_scatter_v2i32_data(<2
>  ; PROMOTE_AVX2-NEXT:    je .LBB3_4
>  ; PROMOTE_AVX2-NEXT:  .LBB3_3: # %cond.store1
>  ; PROMOTE_AVX2-NEXT:    vpextrq $1, %xmm1, %rax
> -; PROMOTE_AVX2-NEXT:    vextractps $2, %xmm0, (%rax)
> +; PROMOTE_AVX2-NEXT:    vextractps $1, %xmm0, (%rax)
>  ; PROMOTE_AVX2-NEXT:    retq
>    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x i32*> %ptr, i32 4, <2 x i1> %mask)
>    ret void
> @@ -348,22 +334,20 @@ define <2 x i32> @test_gather_v2i32_data
>  ; PROMOTE_SKX:       # %bb.0:
>  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm1, %k1
> -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
> -; PROMOTE_SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm1 {%k1}
> -; PROMOTE_SKX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; PROMOTE_SKX-NEXT:    vpgatherdd (%rdi,%xmm0,4), %xmm2 {%k1}
> +; PROMOTE_SKX-NEXT:    vmovdqa %xmm2, %xmm0
>  ; PROMOTE_SKX-NEXT:    retq
>  ;
>  ; PROMOTE_KNL-LABEL: test_gather_v2i32_data_index:
>  ; PROMOTE_KNL:       # %bb.0:
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm2 killed $xmm2 def $zmm2
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm1, %xmm1
>  ; PROMOTE_KNL-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
>  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
> -; PROMOTE_KNL-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm1 {%k1}
> -; PROMOTE_KNL-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
> +; PROMOTE_KNL-NEXT:    vpgatherdd (%rdi,%zmm0,4), %zmm2 {%k1}
> +; PROMOTE_KNL-NEXT:    vmovdqa %xmm2, %xmm0
>  ; PROMOTE_KNL-NEXT:    vzeroupper
>  ; PROMOTE_KNL-NEXT:    retq
>  ;
> @@ -377,12 +361,10 @@ define <2 x i32> @test_gather_v2i32_data
>  ;
>  ; PROMOTE_AVX2-LABEL: test_gather_v2i32_data_index:
>  ; PROMOTE_AVX2:       # %bb.0:
> -; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; PROMOTE_AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[0,2,2,3]
>  ; PROMOTE_AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; PROMOTE_AVX2-NEXT:    vpslld $31, %xmm1, %xmm1
>  ; PROMOTE_AVX2-NEXT:    vpgatherdd %xmm1, (%rdi,%xmm0,4), %xmm2
> -; PROMOTE_AVX2-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm2[0],zero,xmm2[1],zero
> +; PROMOTE_AVX2-NEXT:    vmovdqa %xmm2, %xmm0
>  ; PROMOTE_AVX2-NEXT:    retq
>    %gep.random = getelementptr i32, i32* %base, <2 x i32> %ind
>    %res = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32*> %gep.random, i32 4, <2 x i1> %mask, <2 x i32> %src0)
> @@ -413,17 +395,15 @@ define void @test_scatter_v2i32_data_ind
>  ; PROMOTE_SKX:       # %bb.0:
>  ; PROMOTE_SKX-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; PROMOTE_SKX-NEXT:    vpmovq2m %xmm2, %k1
> -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; PROMOTE_SKX-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; PROMOTE_SKX-NEXT:    vpscatterdd %xmm0, (%rdi,%xmm1,4) {%k1}
>  ; PROMOTE_SKX-NEXT:    retq
>  ;
>  ; PROMOTE_KNL-LABEL: test_scatter_v2i32_data_index:
>  ; PROMOTE_KNL:       # %bb.0:
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
> +; PROMOTE_KNL-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; PROMOTE_KNL-NEXT:    vpsllq $63, %xmm2, %xmm2
>  ; PROMOTE_KNL-NEXT:    vptestmq %zmm2, %zmm2, %k0
> -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; PROMOTE_KNL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; PROMOTE_KNL-NEXT:    kshiftlw $14, %k0, %k0
>  ; PROMOTE_KNL-NEXT:    kshiftrw $14, %k0, %k1
>  ; PROMOTE_KNL-NEXT:    vpscatterdd %zmm0, (%rdi,%zmm1,4) {%k1}
> @@ -458,9 +438,7 @@ define void @test_scatter_v2i32_data_ind
>  ;
>  ; PROMOTE_AVX2-LABEL: test_scatter_v2i32_data_index:
>  ; PROMOTE_AVX2:       # %bb.0:
> -; PROMOTE_AVX2-NEXT:    vpsllq $32, %xmm1, %xmm3
> -; PROMOTE_AVX2-NEXT:    vpsrad $31, %xmm3, %xmm3
> -; PROMOTE_AVX2-NEXT:    vpblendd {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3]
> +; PROMOTE_AVX2-NEXT:    vpmovsxdq %xmm1, %xmm1
>  ; PROMOTE_AVX2-NEXT:    vpsllq $2, %xmm1, %xmm1
>  ; PROMOTE_AVX2-NEXT:    vmovq %rdi, %xmm3
>  ; PROMOTE_AVX2-NEXT:    vpbroadcastq %xmm3, %xmm3
> @@ -481,7 +459,7 @@ define void @test_scatter_v2i32_data_ind
>  ; PROMOTE_AVX2-NEXT:    je .LBB5_4
>  ; PROMOTE_AVX2-NEXT:  .LBB5_3: # %cond.store1
>  ; PROMOTE_AVX2-NEXT:    vpextrq $1, %xmm1, %rax
> -; PROMOTE_AVX2-NEXT:    vextractps $2, %xmm0, (%rax)
> +; PROMOTE_AVX2-NEXT:    vextractps $1, %xmm0, (%rax)
>  ; PROMOTE_AVX2-NEXT:    retq
>    %gep = getelementptr i32, i32 *%base, <2 x i32> %ind
>    call void @llvm.masked.scatter.v2i32.v2p0i32(<2 x i32> %a1, <2 x i32*> %gep, i32 4, <2 x i1> %mask)
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_load.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_load.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_load.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_load.ll Wed Aug  7 09:24:26 2019
> @@ -458,38 +458,40 @@ define <8 x double> @load_v8f64_v8i16(<8
>  ;
>  ; AVX1-LABEL: load_v8f64_v8i16:
>  ; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm4 = xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm5
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm4 = xmm4[2,3,0,1]
> -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm4
> -; AVX1-NEXT:    vinsertf128 $1, %xmm4, %ymm5, %ymm4
> -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> -; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm3
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> +; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> +; AVX1-NEXT:    vpmovsxwd %xmm3, %xmm3
> +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm5
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]
> +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm3
> +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm5, %ymm3
> +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> +; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm0
> +; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm4
>  ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
>  ; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm3, %ymm0
> -; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm3
> -; AVX1-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> -; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm4, %ymm1
> -; AVX1-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> +; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm4, %ymm0
> +; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm4
> +; AVX1-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> +; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm3, %ymm1
> +; AVX1-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: load_v8f64_v8i16:
>  ; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> -; AVX2-NEXT:    vpunpckhwd {{.*#+}} xmm4 = xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> -; AVX2-NEXT:    vpmovsxdq %xmm4, %ymm4
> -; AVX2-NEXT:    vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> +; AVX2-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpmovsxwd %xmm3, %xmm3
> +; AVX2-NEXT:    vpmovsxdq %xmm3, %ymm3
> +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpmovsxwd %xmm0, %xmm0
>  ; AVX2-NEXT:    vpmovsxdq %xmm0, %ymm0
> -; AVX2-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm3
> -; AVX2-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> -; AVX2-NEXT:    vmaskmovpd 32(%rdi), %ymm4, %ymm1
> -; AVX2-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> +; AVX2-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm4
> +; AVX2-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> +; AVX2-NEXT:    vmaskmovpd 32(%rdi), %ymm3, %ymm1
> +; AVX2-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: load_v8f64_v8i16:
> @@ -723,11 +725,9 @@ define <8 x double> @load_v8f64_v8i64(<8
>  define <2 x float> @load_v2f32_v2i32(<2 x i32> %trigger, <2 x float>* %addr, <2 x float> %dst) {
>  ; SSE2-LABEL: load_v2f32_v2i32:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne LBB7_1
> @@ -753,8 +753,8 @@ define <2 x float> @load_v2f32_v2i32(<2
>  ; SSE42-LABEL: load_v2f32_v2i32:
>  ; SSE42:       ## %bb.0:
>  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> -; SSE42-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm0
> +; SSE42-NEXT:    pcmpeqd %xmm0, %xmm2
> +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm0
>  ; SSE42-NEXT:    movmskpd %xmm0, %eax
>  ; SSE42-NEXT:    testb $1, %al
>  ; SSE42-NEXT:    jne LBB7_1
> @@ -774,32 +774,20 @@ define <2 x float> @load_v2f32_v2i32(<2
>  ; SSE42-NEXT:    movaps %xmm1, %xmm0
>  ; SSE42-NEXT:    retq
>  ;
> -; AVX1-LABEL: load_v2f32_v2i32:
> -; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX1-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> -; AVX1-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: load_v2f32_v2i32:
> -; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> -; AVX2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> -; AVX2-NEXT:    retq
> +; AVX1OR2-LABEL: load_v2f32_v2i32:
> +; AVX1OR2:       ## %bb.0:
> +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> +; AVX1OR2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> +; AVX1OR2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> +; AVX1OR2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> +; AVX1OR2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: load_v2f32_v2i32:
>  ; AVX512F:       ## %bb.0:
>  ; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vblendmps (%rdi), %zmm1, %zmm0 {%k1}
> @@ -807,13 +795,21 @@ define <2 x float> @load_v2f32_v2i32(<2
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: load_v2f32_v2i32:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> -; AVX512VL-NEXT:    vblendmps (%rdi), %xmm1, %xmm0 {%k1}
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: load_v2f32_v2i32:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vblendmps (%rdi), %xmm1, %xmm0 {%k1}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: load_v2f32_v2i32:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vblendmps (%rdi), %xmm1, %xmm0 {%k1}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr, i32 4, <2 x i1> %mask, <2 x float> %dst)
>    ret <2 x float> %res
> @@ -822,11 +818,9 @@ define <2 x float> @load_v2f32_v2i32(<2
>  define <2 x float> @load_v2f32_v2i32_undef(<2 x i32> %trigger, <2 x float>* %addr) {
>  ; SSE2-LABEL: load_v2f32_v2i32_undef:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    pxor %xmm1, %xmm1
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm1, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    ## implicit-def: $xmm0
> @@ -850,8 +844,8 @@ define <2 x float> @load_v2f32_v2i32_und
>  ; SSE42-LABEL: load_v2f32_v2i32_undef:
>  ; SSE42:       ## %bb.0:
>  ; SSE42-NEXT:    pxor %xmm1, %xmm1
> -; SSE42-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
> -; SSE42-NEXT:    pcmpeqq %xmm1, %xmm0
> +; SSE42-NEXT:    pcmpeqd %xmm0, %xmm1
> +; SSE42-NEXT:    pmovsxdq %xmm1, %xmm0
>  ; SSE42-NEXT:    movmskpd %xmm0, %eax
>  ; SSE42-NEXT:    testb $1, %al
>  ; SSE42-NEXT:    ## implicit-def: $xmm0
> @@ -869,29 +863,18 @@ define <2 x float> @load_v2f32_v2i32_und
>  ; SSE42-NEXT:    insertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
>  ; SSE42-NEXT:    retq
>  ;
> -; AVX1-LABEL: load_v2f32_v2i32_undef:
> -; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3],xmm0[4,5],xmm1[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX1-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm0
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: load_v2f32_v2i32_undef:
> -; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm1, %xmm0, %xmm0
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm0
> -; AVX2-NEXT:    retq
> +; AVX1OR2-LABEL: load_v2f32_v2i32_undef:
> +; AVX1OR2:       ## %bb.0:
> +; AVX1OR2-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> +; AVX1OR2-NEXT:    vpcmpeqd %xmm1, %xmm0, %xmm0
> +; AVX1OR2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> +; AVX1OR2-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm0
> +; AVX1OR2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: load_v2f32_v2i32_undef:
>  ; AVX512F:       ## %bb.0:
> -; AVX512F-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vmovups (%rdi), %zmm0 {%k1} {z}
> @@ -899,13 +882,21 @@ define <2 x float> @load_v2f32_v2i32_und
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: load_v2f32_v2i32_undef:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm1, %xmm1, %xmm1
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> -; AVX512VL-NEXT:    vmovups (%rdi), %xmm0 {%k1} {z}
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: load_v2f32_v2i32_undef:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vmovups (%rdi), %xmm0 {%k1} {z}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: load_v2f32_v2i32_undef:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vmovups (%rdi), %xmm0 {%k1} {z}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    %res = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* %addr, i32 4, <2 x i1> %mask, <2 x float>undef)
>    ret <2 x float> %res
> @@ -1792,38 +1783,40 @@ define <8 x i64> @load_v8i64_v8i16(<8 x
>  ;
>  ; AVX1-LABEL: load_v8i64_v8i16:
>  ; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> -; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm4 = xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm5
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm4 = xmm4[2,3,0,1]
> -; AVX1-NEXT:    vpmovsxdq %xmm4, %xmm4
> -; AVX1-NEXT:    vinsertf128 $1, %xmm4, %ymm5, %ymm4
> -; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> -; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm3
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> +; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> +; AVX1-NEXT:    vpmovsxwd %xmm3, %xmm3
> +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm5
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]
> +; AVX1-NEXT:    vpmovsxdq %xmm3, %xmm3
> +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm5, %ymm3
> +; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> +; AVX1-NEXT:    vpmovsxwd %xmm0, %xmm0
> +; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm4
>  ; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
>  ; AVX1-NEXT:    vpmovsxdq %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm3, %ymm0
> -; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm3
> -; AVX1-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> -; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm4, %ymm1
> -; AVX1-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> +; AVX1-NEXT:    vinsertf128 $1, %xmm0, %ymm4, %ymm0
> +; AVX1-NEXT:    vmaskmovpd (%rdi), %ymm0, %ymm4
> +; AVX1-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> +; AVX1-NEXT:    vmaskmovpd 32(%rdi), %ymm3, %ymm1
> +; AVX1-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: load_v8i64_v8i16:
>  ; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> -; AVX2-NEXT:    vpunpckhwd {{.*#+}} xmm4 = xmm0[4],xmm3[4],xmm0[5],xmm3[5],xmm0[6],xmm3[6],xmm0[7],xmm3[7]
> -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm4, %xmm4
> -; AVX2-NEXT:    vpmovsxdq %xmm4, %ymm4
> -; AVX2-NEXT:    vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
> -; AVX2-NEXT:    vpcmpeqd %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
> +; AVX2-NEXT:    vpxor %xmm4, %xmm4, %xmm4
> +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpmovsxwd %xmm3, %xmm3
> +; AVX2-NEXT:    vpmovsxdq %xmm3, %ymm3
> +; AVX2-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpmovsxwd %xmm0, %xmm0
>  ; AVX2-NEXT:    vpmovsxdq %xmm0, %ymm0
> -; AVX2-NEXT:    vpmaskmovq (%rdi), %ymm0, %ymm3
> -; AVX2-NEXT:    vblendvpd %ymm0, %ymm3, %ymm1, %ymm0
> -; AVX2-NEXT:    vpmaskmovq 32(%rdi), %ymm4, %ymm1
> -; AVX2-NEXT:    vblendvpd %ymm4, %ymm1, %ymm2, %ymm1
> +; AVX2-NEXT:    vpmaskmovq (%rdi), %ymm0, %ymm4
> +; AVX2-NEXT:    vblendvpd %ymm0, %ymm4, %ymm1, %ymm0
> +; AVX2-NEXT:    vpmaskmovq 32(%rdi), %ymm3, %ymm1
> +; AVX2-NEXT:    vblendvpd %ymm3, %ymm1, %ymm2, %ymm1
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: load_v8i64_v8i16:
> @@ -2061,11 +2054,9 @@ define <8 x i64> @load_v8i64_v8i64(<8 x
>  define <2 x i32> @load_v2i32_v2i32(<2 x i32> %trigger, <2 x i32>* %addr, <2 x i32> %dst) {
>  ; SSE2-LABEL: load_v2i32_v2i32:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne LBB17_1
> @@ -2073,26 +2064,26 @@ define <2 x i32> @load_v2i32_v2i32(<2 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    jne LBB17_3
>  ; SSE2-NEXT:  LBB17_4: ## %else2
> -; SSE2-NEXT:    movapd %xmm1, %xmm0
> +; SSE2-NEXT:    movaps %xmm1, %xmm0
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  LBB17_1: ## %cond.load
> -; SSE2-NEXT:    movl (%rdi), %ecx
> -; SSE2-NEXT:    movq %rcx, %xmm0
> -; SSE2-NEXT:    movsd {{.*#+}} xmm1 = xmm0[0],xmm1[1]
> +; SSE2-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
> +; SSE2-NEXT:    movss {{.*#+}} xmm1 = xmm0[0],xmm1[1,2,3]
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je LBB17_4
>  ; SSE2-NEXT:  LBB17_3: ## %cond.load1
> -; SSE2-NEXT:    movl 4(%rdi), %eax
> -; SSE2-NEXT:    movq %rax, %xmm0
> -; SSE2-NEXT:    unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0]
> -; SSE2-NEXT:    movapd %xmm1, %xmm0
> +; SSE2-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
> +; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0]
> +; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[2,0],xmm1[2,3]
> +; SSE2-NEXT:    movaps %xmm0, %xmm1
> +; SSE2-NEXT:    movaps %xmm1, %xmm0
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE42-LABEL: load_v2i32_v2i32:
>  ; SSE42:       ## %bb.0:
>  ; SSE42-NEXT:    pxor %xmm2, %xmm2
> -; SSE42-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; SSE42-NEXT:    pcmpeqq %xmm2, %xmm0
> +; SSE42-NEXT:    pcmpeqd %xmm0, %xmm2
> +; SSE42-NEXT:    pmovsxdq %xmm2, %xmm0
>  ; SSE42-NEXT:    movmskpd %xmm0, %eax
>  ; SSE42-NEXT:    testb $1, %al
>  ; SSE42-NEXT:    jne LBB17_1
> @@ -2103,62 +2094,59 @@ define <2 x i32> @load_v2i32_v2i32(<2 x
>  ; SSE42-NEXT:    movdqa %xmm1, %xmm0
>  ; SSE42-NEXT:    retq
>  ; SSE42-NEXT:  LBB17_1: ## %cond.load
> -; SSE42-NEXT:    movl (%rdi), %ecx
> -; SSE42-NEXT:    pinsrq $0, %rcx, %xmm1
> +; SSE42-NEXT:    pinsrd $0, (%rdi), %xmm1
>  ; SSE42-NEXT:    testb $2, %al
>  ; SSE42-NEXT:    je LBB17_4
>  ; SSE42-NEXT:  LBB17_3: ## %cond.load1
> -; SSE42-NEXT:    movl 4(%rdi), %eax
> -; SSE42-NEXT:    pinsrq $1, %rax, %xmm1
> +; SSE42-NEXT:    pinsrd $1, 4(%rdi), %xmm1
>  ; SSE42-NEXT:    movdqa %xmm1, %xmm0
>  ; SSE42-NEXT:    retq
>  ;
>  ; AVX1-LABEL: load_v2i32_v2i32:
>  ; AVX1:       ## %bb.0:
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
>  ; AVX1-NEXT:    vmaskmovps (%rdi), %xmm0, %xmm2
> -; AVX1-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; AVX1-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> -; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: load_v2i32_v2i32:
>  ; AVX2:       ## %bb.0:
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> +; AVX2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
>  ; AVX2-NEXT:    vpmaskmovd (%rdi), %xmm0, %xmm2
> -; AVX2-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
>  ; AVX2-NEXT:    vblendvps %xmm0, %xmm2, %xmm1, %xmm0
> -; AVX2-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: load_v2i32_v2i32:
>  ; AVX512F:       ## %bb.0:
> -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> -; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> -; AVX512F-NEXT:    vmovdqu32 (%rdi), %zmm0 {%k1}
> -; AVX512F-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> +; AVX512F-NEXT:    vpblendmd (%rdi), %zmm1, %zmm0 {%k1}
> +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 killed $zmm0
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: load_v2i32_v2i32:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> -; AVX512VL-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> -; AVX512VL-NEXT:    vmovdqu32 (%rdi), %xmm0 {%k1}
> -; AVX512VL-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: load_v2i32_v2i32:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vpblendmd (%rdi), %xmm1, %xmm0 {%k1}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: load_v2i32_v2i32:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vpblendmd (%rdi), %xmm1, %xmm0 {%k1}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    %res = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* %addr, i32 4, <2 x i1> %mask, <2 x i32> %dst)
>    ret <2 x i32> %res
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_store.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_store.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_store.ll Wed Aug  7 09:24:26 2019
> @@ -165,11 +165,9 @@ define void @store_v4f64_v4i64(<4 x i64>
>  define void @store_v2f32_v2i32(<2 x i32> %trigger, <2 x float>* %addr, <2 x float> %val) {
>  ; SSE2-LABEL: store_v2f32_v2i32:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne LBB3_1
> @@ -190,8 +188,8 @@ define void @store_v2f32_v2i32(<2 x i32>
>  ; SSE4-LABEL: store_v2f32_v2i32:
>  ; SSE4:       ## %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> -; SSE4-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; SSE4-NEXT:    pcmpeqq %xmm2, %xmm0
> +; SSE4-NEXT:    pcmpeqd %xmm0, %xmm2
> +; SSE4-NEXT:    pmovsxdq %xmm2, %xmm0
>  ; SSE4-NEXT:    movmskpd %xmm0, %eax
>  ; SSE4-NEXT:    testb $1, %al
>  ; SSE4-NEXT:    jne LBB3_1
> @@ -208,43 +206,40 @@ define void @store_v2f32_v2i32(<2 x i32>
>  ; SSE4-NEXT:    extractps $1, %xmm1, 4(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
> -; AVX1-LABEL: store_v2f32_v2i32:
> -; AVX1:       ## %bb.0:
> -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX1-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: store_v2f32_v2i32:
> -; AVX2:       ## %bb.0:
> -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX2-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> -; AVX2-NEXT:    retq
> +; AVX1OR2-LABEL: store_v2f32_v2i32:
> +; AVX1OR2:       ## %bb.0:
> +; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> +; AVX1OR2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
> +; AVX1OR2-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
> +; AVX1OR2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: store_v2f32_v2i32:
>  ; AVX512F:       ## %bb.0:
>  ; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vmovups %zmm1, (%rdi) {%k1}
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: store_v2f32_v2i32:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> -; AVX512VL-NEXT:    vmovups %xmm1, (%rdi) {%k1}
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: store_v2f32_v2i32:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vmovups %xmm1, (%rdi) {%k1}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: store_v2f32_v2i32:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vmovups %xmm1, (%rdi) {%k1}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> %val, <2 x float>* %addr, i32 4, <2 x i1> %mask)
>    ret void
> @@ -1046,11 +1041,9 @@ define void @store_v1i32_v1i32(<1 x i32>
>  define void @store_v2i32_v2i32(<2 x i32> %trigger, <2 x i32>* %addr, <2 x i32> %val) {
>  ; SSE2-LABEL: store_v2i32_v2i32:
>  ; SSE2:       ## %bb.0:
> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,0,1,1]
>  ; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne LBB10_1
> @@ -1064,15 +1057,15 @@ define void @store_v2i32_v2i32(<2 x i32>
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je LBB10_4
>  ; SSE2-NEXT:  LBB10_3: ## %cond.store1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,1,2,3]
>  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: store_v2i32_v2i32:
>  ; SSE4:       ## %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> -; SSE4-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; SSE4-NEXT:    pcmpeqq %xmm2, %xmm0
> +; SSE4-NEXT:    pcmpeqd %xmm0, %xmm2
> +; SSE4-NEXT:    pmovsxdq %xmm2, %xmm0
>  ; SSE4-NEXT:    movmskpd %xmm0, %eax
>  ; SSE4-NEXT:    testb $1, %al
>  ; SSE4-NEXT:    jne LBB10_1
> @@ -1086,48 +1079,51 @@ define void @store_v2i32_v2i32(<2 x i32>
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je LBB10_4
>  ; SSE4-NEXT:  LBB10_3: ## %cond.store1
> -; SSE4-NEXT:    extractps $2, %xmm1, 4(%rdi)
> +; SSE4-NEXT:    extractps $1, %xmm1, 4(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: store_v2i32_v2i32:
>  ; AVX1:       ## %bb.0:
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3],xmm0[4,5],xmm2[6,7]
> -; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX1-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
>  ; AVX1-NEXT:    vmaskmovps %xmm1, %xmm0, (%rdi)
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: store_v2i32_v2i32:
>  ; AVX2:       ## %bb.0:
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
> +; AVX2-NEXT:    vmovq {{.*#+}} xmm0 = xmm0[0],zero
>  ; AVX2-NEXT:    vpmaskmovd %xmm1, %xmm0, (%rdi)
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: store_v2i32_v2i32:
>  ; AVX512F:       ## %bb.0:
> -; AVX512F-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512F-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512F-NEXT:    vptestnmq %zmm0, %zmm0, %k0
> -; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> +; AVX512F-NEXT:    ## kill: def $xmm1 killed $xmm1 def $zmm1
> +; AVX512F-NEXT:    ## kill: def $xmm0 killed $xmm0 def $zmm0
> +; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> -; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
> +; AVX512F-NEXT:    vmovdqu32 %zmm1, (%rdi) {%k1}
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> -; AVX512VL-LABEL: store_v2i32_v2i32:
> -; AVX512VL:       ## %bb.0:
> -; AVX512VL-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX512VL-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3]
> -; AVX512VL-NEXT:    vptestnmq %xmm0, %xmm0, %k1
> -; AVX512VL-NEXT:    vpmovqd %xmm1, (%rdi) {%k1}
> -; AVX512VL-NEXT:    retq
> +; AVX512VLDQ-LABEL: store_v2i32_v2i32:
> +; AVX512VLDQ:       ## %bb.0:
> +; AVX512VLDQ-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLDQ-NEXT:    kshiftlb $6, %k0, %k0
> +; AVX512VLDQ-NEXT:    kshiftrb $6, %k0, %k1
> +; AVX512VLDQ-NEXT:    vmovdqu32 %xmm1, (%rdi) {%k1}
> +; AVX512VLDQ-NEXT:    retq
> +;
> +; AVX512VLBW-LABEL: store_v2i32_v2i32:
> +; AVX512VLBW:       ## %bb.0:
> +; AVX512VLBW-NEXT:    vptestnmd %xmm0, %xmm0, %k0
> +; AVX512VLBW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512VLBW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512VLBW-NEXT:    vmovdqu32 %xmm1, (%rdi) {%k1}
> +; AVX512VLBW-NEXT:    retq
>    %mask = icmp eq <2 x i32> %trigger, zeroinitializer
>    call void @llvm.masked.store.v2i32.p0v2i32(<2 x i32> %val, <2 x i32>* %addr, i32 4, <2 x i1> %mask)
>    ret void
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_store_trunc.ll Wed Aug  7 09:24:26 2019
> @@ -615,17 +615,15 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-LABEL: truncstore_v8i64_v8i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm6, %xmm6
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm7 = xmm0[0,2,2,3,4,5,6,7]
> -; SSE2-NEXT:    punpckldq {{.*#+}} xmm7 = xmm7[0],xmm1[0],xmm7[1],xmm1[1]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,1,0,2,4,5,6,7]
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> -; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,1,0,2,4,5,6,7]
> -; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> -; SSE2-NEXT:    movsd {{.*#+}} xmm0 = xmm7[0],xmm0[1]
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm7 = [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> +; SSE2-NEXT:    pand %xmm7, %xmm3
> +; SSE2-NEXT:    pand %xmm7, %xmm2
> +; SSE2-NEXT:    packuswb %xmm3, %xmm2
> +; SSE2-NEXT:    pand %xmm7, %xmm1
> +; SSE2-NEXT:    pand %xmm7, %xmm0
> +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> +; SSE2-NEXT:    packuswb %xmm2, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm6, %xmm5
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm1, %xmm5
> @@ -645,17 +643,26 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB2_5
>  ; SSE2-NEXT:  .LBB2_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB2_7
> +; SSE2-NEXT:    je .LBB2_8
> +; SSE2-NEXT:  .LBB2_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB2_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB2_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB2_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB2_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB2_11
> +; SSE2-NEXT:    je .LBB2_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB2_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB2_13
> -; SSE2-NEXT:  .LBB2_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB2_15
>  ; SSE2-NEXT:  .LBB2_16: # %else14
> @@ -665,50 +672,36 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB2_4
>  ; SSE2-NEXT:  .LBB2_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB2_6
>  ; SSE2-NEXT:  .LBB2_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB2_8
> -; SSE2-NEXT:  .LBB2_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB2_10
> -; SSE2-NEXT:  .LBB2_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB2_12
> -; SSE2-NEXT:  .LBB2_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB2_14
> +; SSE2-NEXT:    jne .LBB2_7
> +; SSE2-NEXT:    jmp .LBB2_8
>  ; SSE2-NEXT:  .LBB2_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB2_16
>  ; SSE2-NEXT:  .LBB2_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i64_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm6, %xmm6
> -; SSE4-NEXT:    pblendw {{.*#+}} xmm3 = xmm3[0],xmm6[1,2,3],xmm3[4],xmm6[5,6,7]
> -; SSE4-NEXT:    pblendw {{.*#+}} xmm2 = xmm2[0],xmm6[1,2,3],xmm2[4],xmm6[5,6,7]
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm7 = [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> +; SSE4-NEXT:    pand %xmm7, %xmm3
> +; SSE4-NEXT:    pand %xmm7, %xmm2
>  ; SSE4-NEXT:    packusdw %xmm3, %xmm2
> -; SSE4-NEXT:    pblendw {{.*#+}} xmm1 = xmm1[0],xmm6[1,2,3],xmm1[4],xmm6[5,6,7]
> -; SSE4-NEXT:    pblendw {{.*#+}} xmm0 = xmm0[0],xmm6[1,2,3],xmm0[4],xmm6[5,6,7]
> +; SSE4-NEXT:    pand %xmm7, %xmm1
> +; SSE4-NEXT:    pand %xmm7, %xmm0
>  ; SSE4-NEXT:    packusdw %xmm1, %xmm0
>  ; SSE4-NEXT:    packusdw %xmm2, %xmm0
> +; SSE4-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE4-NEXT:    pcmpeqd %xmm6, %xmm5
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm1, %xmm5
> @@ -747,36 +740,36 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB2_4
>  ; SSE4-NEXT:  .LBB2_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB2_6
>  ; SSE4-NEXT:  .LBB2_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB2_8
>  ; SSE4-NEXT:  .LBB2_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB2_10
>  ; SSE4-NEXT:  .LBB2_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB2_12
>  ; SSE4-NEXT:  .LBB2_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB2_14
>  ; SSE4-NEXT:  .LBB2_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB2_16
>  ; SSE4-NEXT:  .LBB2_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v8i64_v8i8:
>  ; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vmovaps {{.*#+}} ymm3 = [65535,65535,65535,65535]
> +; AVX1-NEXT:    vmovaps {{.*#+}} ymm3 = [255,255,255,255]
>  ; AVX1-NEXT:    vandps %ymm3, %ymm1, %ymm1
>  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm4
>  ; AVX1-NEXT:    vpackusdw %xmm4, %xmm1, %xmm1
> @@ -784,6 +777,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
>  ; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
>  ; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> +; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm1
>  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
> @@ -822,44 +816,48 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB2_4
>  ; AVX1-NEXT:  .LBB2_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB2_6
>  ; AVX1-NEXT:  .LBB2_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB2_8
>  ; AVX1-NEXT:  .LBB2_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    testb $16, %al
>  ; AVX1-NEXT:    je .LBB2_10
>  ; AVX1-NEXT:  .LBB2_9: # %cond.store7
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $32, %al
>  ; AVX1-NEXT:    je .LBB2_12
>  ; AVX1-NEXT:  .LBB2_11: # %cond.store9
> -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX1-NEXT:    testb $64, %al
>  ; AVX1-NEXT:    je .LBB2_14
>  ; AVX1-NEXT:  .LBB2_13: # %cond.store11
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    testb $-128, %al
>  ; AVX1-NEXT:    je .LBB2_16
>  ; AVX1-NEXT:  .LBB2_15: # %cond.store13
> -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: truncstore_v8i64_v8i8:
>  ; AVX2:       # %bb.0:
>  ; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
> -; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm4
> -; AVX2-NEXT:    vshufps {{.*#+}} xmm1 = xmm1[0,2],xmm4[0,2]
> -; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm4
> -; AVX2-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm4[0,2]
> -; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
> -; AVX2-NEXT:    vpshufb {{.*#+}} ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
> -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
> +; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm4
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 = <u,u,0,8,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm1, %xmm1
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm4
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
> +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
>  ; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm2, %ymm1
>  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
>  ; AVX2-NEXT:    notl %eax
> @@ -894,31 +892,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB2_4
>  ; AVX2-NEXT:  .LBB2_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB2_6
>  ; AVX2-NEXT:  .LBB2_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB2_8
>  ; AVX2-NEXT:  .LBB2_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    testb $16, %al
>  ; AVX2-NEXT:    je .LBB2_10
>  ; AVX2-NEXT:  .LBB2_9: # %cond.store7
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $32, %al
>  ; AVX2-NEXT:    je .LBB2_12
>  ; AVX2-NEXT:  .LBB2_11: # %cond.store9
> -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX2-NEXT:    testb $64, %al
>  ; AVX2-NEXT:    je .LBB2_14
>  ; AVX2-NEXT:  .LBB2_13: # %cond.store11
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    testb $-128, %al
>  ; AVX2-NEXT:    je .LBB2_16
>  ; AVX2-NEXT:  .LBB2_15: # %cond.store13
> -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -926,7 +924,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB2_1
> @@ -959,31 +957,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB2_4
>  ; AVX512F-NEXT:  .LBB2_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB2_6
>  ; AVX512F-NEXT:  .LBB2_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB2_8
>  ; AVX512F-NEXT:  .LBB2_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB2_10
>  ; AVX512F-NEXT:  .LBB2_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB2_12
>  ; AVX512F-NEXT:  .LBB2_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB2_14
>  ; AVX512F-NEXT:  .LBB2_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB2_16
>  ; AVX512F-NEXT:  .LBB2_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1147,7 +1145,11 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-LABEL: truncstore_v4i64_v4i16:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
>  ; SSE2-NEXT:    movmskps %xmm3, %eax
>  ; SSE2-NEXT:    xorl $15, %eax
> @@ -1170,24 +1172,28 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB4_4
>  ; SSE2-NEXT:  .LBB4_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 2(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB4_6
>  ; SSE2-NEXT:  .LBB4_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 4(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
>  ; SSE2-NEXT:    je .LBB4_8
>  ; SSE2-NEXT:  .LBB4_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 6(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i64_v4i16:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm3, %xmm3
> -; SSE4-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm3
>  ; SSE4-NEXT:    movmskps %xmm3, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -1209,62 +1215,109 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB4_4
>  ; SSE4-NEXT:  .LBB4_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB4_6
>  ; SSE4-NEXT:  .LBB4_5: # %cond.store3
> -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB4_8
>  ; SSE4-NEXT:  .LBB4_7: # %cond.store5
> -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
> -; AVX-LABEL: truncstore_v4i64_v4i16:
> -; AVX:       # %bb.0:
> -; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm3
> -; AVX-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm3[0,2]
> -; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> -; AVX-NEXT:    vmovmskps %xmm1, %eax
> -; AVX-NEXT:    xorl $15, %eax
> -; AVX-NEXT:    testb $1, %al
> -; AVX-NEXT:    jne .LBB4_1
> -; AVX-NEXT:  # %bb.2: # %else
> -; AVX-NEXT:    testb $2, %al
> -; AVX-NEXT:    jne .LBB4_3
> -; AVX-NEXT:  .LBB4_4: # %else2
> -; AVX-NEXT:    testb $4, %al
> -; AVX-NEXT:    jne .LBB4_5
> -; AVX-NEXT:  .LBB4_6: # %else4
> -; AVX-NEXT:    testb $8, %al
> -; AVX-NEXT:    jne .LBB4_7
> -; AVX-NEXT:  .LBB4_8: # %else6
> -; AVX-NEXT:    vzeroupper
> -; AVX-NEXT:    retq
> -; AVX-NEXT:  .LBB4_1: # %cond.store
> -; AVX-NEXT:    vpextrw $0, %xmm0, (%rdi)
> -; AVX-NEXT:    testb $2, %al
> -; AVX-NEXT:    je .LBB4_4
> -; AVX-NEXT:  .LBB4_3: # %cond.store1
> -; AVX-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> -; AVX-NEXT:    testb $4, %al
> -; AVX-NEXT:    je .LBB4_6
> -; AVX-NEXT:  .LBB4_5: # %cond.store3
> -; AVX-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> -; AVX-NEXT:    testb $8, %al
> -; AVX-NEXT:    je .LBB4_8
> -; AVX-NEXT:  .LBB4_7: # %cond.store5
> -; AVX-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> -; AVX-NEXT:    vzeroupper
> -; AVX-NEXT:    retq
> +; AVX1-LABEL: truncstore_v4i64_v4i16:
> +; AVX1:       # %bb.0:
> +; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX1-NEXT:    vmovmskps %xmm1, %eax
> +; AVX1-NEXT:    xorl $15, %eax
> +; AVX1-NEXT:    testb $1, %al
> +; AVX1-NEXT:    jne .LBB4_1
> +; AVX1-NEXT:  # %bb.2: # %else
> +; AVX1-NEXT:    testb $2, %al
> +; AVX1-NEXT:    jne .LBB4_3
> +; AVX1-NEXT:  .LBB4_4: # %else2
> +; AVX1-NEXT:    testb $4, %al
> +; AVX1-NEXT:    jne .LBB4_5
> +; AVX1-NEXT:  .LBB4_6: # %else4
> +; AVX1-NEXT:    testb $8, %al
> +; AVX1-NEXT:    jne .LBB4_7
> +; AVX1-NEXT:  .LBB4_8: # %else6
> +; AVX1-NEXT:    vzeroupper
> +; AVX1-NEXT:    retq
> +; AVX1-NEXT:  .LBB4_1: # %cond.store
> +; AVX1-NEXT:    vpextrw $0, %xmm0, (%rdi)
> +; AVX1-NEXT:    testb $2, %al
> +; AVX1-NEXT:    je .LBB4_4
> +; AVX1-NEXT:  .LBB4_3: # %cond.store1
> +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    testb $4, %al
> +; AVX1-NEXT:    je .LBB4_6
> +; AVX1-NEXT:  .LBB4_5: # %cond.store3
> +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    testb $8, %al
> +; AVX1-NEXT:    je .LBB4_8
> +; AVX1-NEXT:  .LBB4_7: # %cond.store5
> +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vzeroupper
> +; AVX1-NEXT:    retq
> +;
> +; AVX2-LABEL: truncstore_v4i64_v4i16:
> +; AVX2:       # %bb.0:
> +; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm3 = xmm3[0,2,2,3]
> +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
> +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX2-NEXT:    vmovmskps %xmm1, %eax
> +; AVX2-NEXT:    xorl $15, %eax
> +; AVX2-NEXT:    testb $1, %al
> +; AVX2-NEXT:    jne .LBB4_1
> +; AVX2-NEXT:  # %bb.2: # %else
> +; AVX2-NEXT:    testb $2, %al
> +; AVX2-NEXT:    jne .LBB4_3
> +; AVX2-NEXT:  .LBB4_4: # %else2
> +; AVX2-NEXT:    testb $4, %al
> +; AVX2-NEXT:    jne .LBB4_5
> +; AVX2-NEXT:  .LBB4_6: # %else4
> +; AVX2-NEXT:    testb $8, %al
> +; AVX2-NEXT:    jne .LBB4_7
> +; AVX2-NEXT:  .LBB4_8: # %else6
> +; AVX2-NEXT:    vzeroupper
> +; AVX2-NEXT:    retq
> +; AVX2-NEXT:  .LBB4_1: # %cond.store
> +; AVX2-NEXT:    vpextrw $0, %xmm0, (%rdi)
> +; AVX2-NEXT:    testb $2, %al
> +; AVX2-NEXT:    je .LBB4_4
> +; AVX2-NEXT:  .LBB4_3: # %cond.store1
> +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    testb $4, %al
> +; AVX2-NEXT:    je .LBB4_6
> +; AVX2-NEXT:  .LBB4_5: # %cond.store3
> +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    testb $8, %al
> +; AVX2-NEXT:    je .LBB4_8
> +; AVX2-NEXT:  .LBB4_7: # %cond.store5
> +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vzeroupper
> +; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i64_v4i16:
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB4_1
> @@ -1285,15 +1338,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB4_4
>  ; AVX512F-NEXT:  .LBB4_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB4_6
>  ; AVX512F-NEXT:  .LBB4_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB4_8
>  ; AVX512F-NEXT:  .LBB4_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1302,10 +1355,9 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> +; AVX512BW-NEXT:    vpmovqw %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -1326,47 +1378,55 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE2-LABEL: truncstore_v4i64_v4i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> +; SSE2-NEXT:    pand %xmm4, %xmm1
> +; SSE2-NEXT:    pand %xmm4, %xmm0
> +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
> -; SSE2-NEXT:    movmskps %xmm3, %eax
> -; SSE2-NEXT:    xorl $15, %eax
> -; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movmskps %xmm3, %ecx
> +; SSE2-NEXT:    xorl $15, %ecx
> +; SSE2-NEXT:    testb $1, %cl
> +; SSE2-NEXT:    movd %xmm0, %eax
>  ; SSE2-NEXT:    jne .LBB5_1
>  ; SSE2-NEXT:  # %bb.2: # %else
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    jne .LBB5_3
>  ; SSE2-NEXT:  .LBB5_4: # %else2
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    jne .LBB5_5
>  ; SSE2-NEXT:  .LBB5_6: # %else4
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    jne .LBB5_7
>  ; SSE2-NEXT:  .LBB5_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB5_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, (%rdi)
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    movb %al, (%rdi)
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    je .LBB5_4
>  ; SSE2-NEXT:  .LBB5_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    movb %ah, 1(%rdi)
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    je .LBB5_6
>  ; SSE2-NEXT:  .LBB5_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    movl %eax, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    je .LBB5_8
>  ; SSE2-NEXT:  .LBB5_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    shrl $24, %eax
>  ; SSE2-NEXT:    movb %al, 3(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i64_v4i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm3, %xmm3
> -; SSE4-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; SSE4-NEXT:    pshufb %xmm4, %xmm1
> +; SSE4-NEXT:    pshufb %xmm4, %xmm0
> +; SSE4-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
>  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm3
>  ; SSE4-NEXT:    movmskps %xmm3, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -1388,62 +1448,107 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB5_4
>  ; SSE4-NEXT:  .LBB5_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB5_6
>  ; SSE4-NEXT:  .LBB5_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB5_8
>  ; SSE4-NEXT:  .LBB5_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
> -; AVX-LABEL: truncstore_v4i64_v4i8:
> -; AVX:       # %bb.0:
> -; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm3
> -; AVX-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm3[0,2]
> -; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> -; AVX-NEXT:    vmovmskps %xmm1, %eax
> -; AVX-NEXT:    xorl $15, %eax
> -; AVX-NEXT:    testb $1, %al
> -; AVX-NEXT:    jne .LBB5_1
> -; AVX-NEXT:  # %bb.2: # %else
> -; AVX-NEXT:    testb $2, %al
> -; AVX-NEXT:    jne .LBB5_3
> -; AVX-NEXT:  .LBB5_4: # %else2
> -; AVX-NEXT:    testb $4, %al
> -; AVX-NEXT:    jne .LBB5_5
> -; AVX-NEXT:  .LBB5_6: # %else4
> -; AVX-NEXT:    testb $8, %al
> -; AVX-NEXT:    jne .LBB5_7
> -; AVX-NEXT:  .LBB5_8: # %else6
> -; AVX-NEXT:    vzeroupper
> -; AVX-NEXT:    retq
> -; AVX-NEXT:  .LBB5_1: # %cond.store
> -; AVX-NEXT:    vpextrb $0, %xmm0, (%rdi)
> -; AVX-NEXT:    testb $2, %al
> -; AVX-NEXT:    je .LBB5_4
> -; AVX-NEXT:  .LBB5_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> -; AVX-NEXT:    testb $4, %al
> -; AVX-NEXT:    je .LBB5_6
> -; AVX-NEXT:  .LBB5_5: # %cond.store3
> -; AVX-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> -; AVX-NEXT:    testb $8, %al
> -; AVX-NEXT:    je .LBB5_8
> -; AVX-NEXT:  .LBB5_7: # %cond.store5
> -; AVX-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> -; AVX-NEXT:    vzeroupper
> -; AVX-NEXT:    retq
> +; AVX1-LABEL: truncstore_v4i64_v4i8:
> +; AVX1:       # %bb.0:
> +; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> +; AVX1-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> +; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX1-NEXT:    vmovmskps %xmm1, %eax
> +; AVX1-NEXT:    xorl $15, %eax
> +; AVX1-NEXT:    testb $1, %al
> +; AVX1-NEXT:    jne .LBB5_1
> +; AVX1-NEXT:  # %bb.2: # %else
> +; AVX1-NEXT:    testb $2, %al
> +; AVX1-NEXT:    jne .LBB5_3
> +; AVX1-NEXT:  .LBB5_4: # %else2
> +; AVX1-NEXT:    testb $4, %al
> +; AVX1-NEXT:    jne .LBB5_5
> +; AVX1-NEXT:  .LBB5_6: # %else4
> +; AVX1-NEXT:    testb $8, %al
> +; AVX1-NEXT:    jne .LBB5_7
> +; AVX1-NEXT:  .LBB5_8: # %else6
> +; AVX1-NEXT:    vzeroupper
> +; AVX1-NEXT:    retq
> +; AVX1-NEXT:  .LBB5_1: # %cond.store
> +; AVX1-NEXT:    vpextrb $0, %xmm0, (%rdi)
> +; AVX1-NEXT:    testb $2, %al
> +; AVX1-NEXT:    je .LBB5_4
> +; AVX1-NEXT:  .LBB5_3: # %cond.store1
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    testb $4, %al
> +; AVX1-NEXT:    je .LBB5_6
> +; AVX1-NEXT:  .LBB5_5: # %cond.store3
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    testb $8, %al
> +; AVX1-NEXT:    je .LBB5_8
> +; AVX1-NEXT:  .LBB5_7: # %cond.store5
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vzeroupper
> +; AVX1-NEXT:    retq
> +;
> +; AVX2-LABEL: truncstore_v4i64_v4i8:
> +; AVX2:       # %bb.0:
> +; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
> +; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX2-NEXT:    vmovmskps %xmm1, %eax
> +; AVX2-NEXT:    xorl $15, %eax
> +; AVX2-NEXT:    testb $1, %al
> +; AVX2-NEXT:    jne .LBB5_1
> +; AVX2-NEXT:  # %bb.2: # %else
> +; AVX2-NEXT:    testb $2, %al
> +; AVX2-NEXT:    jne .LBB5_3
> +; AVX2-NEXT:  .LBB5_4: # %else2
> +; AVX2-NEXT:    testb $4, %al
> +; AVX2-NEXT:    jne .LBB5_5
> +; AVX2-NEXT:  .LBB5_6: # %else4
> +; AVX2-NEXT:    testb $8, %al
> +; AVX2-NEXT:    jne .LBB5_7
> +; AVX2-NEXT:  .LBB5_8: # %else6
> +; AVX2-NEXT:    vzeroupper
> +; AVX2-NEXT:    retq
> +; AVX2-NEXT:  .LBB5_1: # %cond.store
> +; AVX2-NEXT:    vpextrb $0, %xmm0, (%rdi)
> +; AVX2-NEXT:    testb $2, %al
> +; AVX2-NEXT:    je .LBB5_4
> +; AVX2-NEXT:  .LBB5_3: # %cond.store1
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    testb $4, %al
> +; AVX2-NEXT:    je .LBB5_6
> +; AVX2-NEXT:  .LBB5_5: # %cond.store3
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    testb $8, %al
> +; AVX2-NEXT:    je .LBB5_8
> +; AVX2-NEXT:  .LBB5_7: # %cond.store5
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vzeroupper
> +; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i64_v4i8:
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB5_1
> @@ -1464,15 +1569,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB5_4
>  ; AVX512F-NEXT:  .LBB5_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB5_6
>  ; AVX512F-NEXT:  .LBB5_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB5_8
>  ; AVX512F-NEXT:  .LBB5_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1481,10 +1586,9 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> +; AVX512BW-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -1505,6 +1609,7 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE2-LABEL: truncstore_v2i64_v2i32:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
>  ; SSE2-NEXT:    pand %xmm2, %xmm1
> @@ -1522,13 +1627,14 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB6_4
>  ; SSE2-NEXT:  .LBB6_3: # %cond.store1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
>  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i32:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskpd %xmm2, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -1540,11 +1646,11 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE4-NEXT:  .LBB6_4: # %else2
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB6_1: # %cond.store
> -; SSE4-NEXT:    movss %xmm0, (%rdi)
> +; SSE4-NEXT:    movd %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB6_4
>  ; SSE4-NEXT:  .LBB6_3: # %cond.store1
> -; SSE4-NEXT:    extractps $2, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrd $1, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v2i64_v2i32:
> @@ -1573,9 +1679,9 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
> @@ -1590,9 +1696,9 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
> +; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX512BW-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -1606,6 +1712,8 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE2-LABEL: truncstore_v2i64_v2i16:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
>  ; SSE2-NEXT:    pand %xmm2, %xmm1
> @@ -1624,13 +1732,15 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB7_4
>  ; SSE2-NEXT:  .LBB7_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> +; SSE2-NEXT:    pextrw $1, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 2(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i16:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskpd %xmm2, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -1646,12 +1756,14 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB7_4
>  ; SSE4-NEXT:  .LBB7_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v2i64_v2i16:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
>  ; AVX-NEXT:    xorl $3, %eax
> @@ -1667,13 +1779,15 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB7_4
>  ; AVX-NEXT:  .LBB7_3: # %cond.store1
> -; AVX-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v2i64_v2i16:
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX512F-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB7_1
> @@ -1688,7 +1802,7 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB7_4
>  ; AVX512F-NEXT:  .LBB7_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1696,10 +1810,10 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
> +; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -1719,12 +1833,17 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE2-LABEL: truncstore_v2i64_v2i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
>  ; SSE2-NEXT:    pand %xmm2, %xmm1
>  ; SSE2-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-NEXT:    xorl $3, %eax
>  ; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movd %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB8_1
>  ; SSE2-NEXT:  # %bb.2: # %else
>  ; SSE2-NEXT:    testb $2, %al
> @@ -1732,18 +1851,17 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE2-NEXT:  .LBB8_4: # %else2
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB8_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB8_4
>  ; SSE2-NEXT:  .LBB8_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskpd %xmm2, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -1759,12 +1877,13 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB8_4
>  ; SSE4-NEXT:  .LBB8_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $8, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v2i64_v2i8:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
>  ; AVX-NEXT:    xorl $3, %eax
> @@ -1780,13 +1899,14 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB8_4
>  ; AVX-NEXT:  .LBB8_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v2i64_v2i8:
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB8_1
> @@ -1801,7 +1921,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB8_4
>  ; AVX512F-NEXT:  .LBB8_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1809,9 +1929,9 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
> +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -3593,11 +3713,11 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-LABEL: truncstore_v8i32_v8i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm4, %xmm4
> -; SSE2-NEXT:    pslld $16, %xmm1
> -; SSE2-NEXT:    psrad $16, %xmm1
> -; SSE2-NEXT:    pslld $16, %xmm0
> -; SSE2-NEXT:    psrad $16, %xmm0
> -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
> +; SSE2-NEXT:    pand %xmm5, %xmm1
> +; SSE2-NEXT:    pand %xmm5, %xmm0
> +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm1, %xmm3
> @@ -3617,17 +3737,26 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB12_5
>  ; SSE2-NEXT:  .LBB12_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB12_7
> +; SSE2-NEXT:    je .LBB12_8
> +; SSE2-NEXT:  .LBB12_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB12_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB12_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB12_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB12_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB12_11
> +; SSE2-NEXT:    je .LBB12_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB12_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB12_13
> -; SSE2-NEXT:  .LBB12_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB12_15
>  ; SSE2-NEXT:  .LBB12_16: # %else14
> @@ -3637,47 +3766,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB12_4
>  ; SSE2-NEXT:  .LBB12_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB12_6
>  ; SSE2-NEXT:  .LBB12_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB12_8
> -; SSE2-NEXT:  .LBB12_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB12_10
> -; SSE2-NEXT:  .LBB12_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB12_12
> -; SSE2-NEXT:  .LBB12_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB12_14
> +; SSE2-NEXT:    jne .LBB12_7
> +; SSE2-NEXT:    jmp .LBB12_8
>  ; SSE2-NEXT:  .LBB12_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB12_16
>  ; SSE2-NEXT:  .LBB12_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i32_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> -; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
>  ; SSE4-NEXT:    pshufb %xmm5, %xmm1
>  ; SSE4-NEXT:    pshufb %xmm5, %xmm0
> -; SSE4-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
> +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE4-NEXT:    pcmpeqd %xmm4, %xmm3
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm1, %xmm3
> @@ -3716,40 +3829,40 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB12_4
>  ; SSE4-NEXT:  .LBB12_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB12_6
>  ; SSE4-NEXT:  .LBB12_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB12_8
>  ; SSE4-NEXT:  .LBB12_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB12_10
>  ; SSE4-NEXT:  .LBB12_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB12_12
>  ; SSE4-NEXT:  .LBB12_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB12_14
>  ; SSE4-NEXT:  .LBB12_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB12_16
>  ; SSE4-NEXT:  .LBB12_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v8i32_v8i8:
>  ; AVX1:       # %bb.0:
>  ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
>  ; AVX1-NEXT:    vpshufb %xmm3, %xmm2, %xmm2
>  ; AVX1-NEXT:    vpshufb %xmm3, %xmm0, %xmm0
> -; AVX1-NEXT:    vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
> +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
>  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
>  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm2, %xmm2
> @@ -3788,39 +3901,42 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB12_4
>  ; AVX1-NEXT:  .LBB12_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB12_6
>  ; AVX1-NEXT:  .LBB12_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB12_8
>  ; AVX1-NEXT:  .LBB12_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    testb $16, %al
>  ; AVX1-NEXT:    je .LBB12_10
>  ; AVX1-NEXT:  .LBB12_9: # %cond.store7
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $32, %al
>  ; AVX1-NEXT:    je .LBB12_12
>  ; AVX1-NEXT:  .LBB12_11: # %cond.store9
> -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX1-NEXT:    testb $64, %al
>  ; AVX1-NEXT:    je .LBB12_14
>  ; AVX1-NEXT:  .LBB12_13: # %cond.store11
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    testb $-128, %al
>  ; AVX1-NEXT:    je .LBB12_16
>  ; AVX1-NEXT:  .LBB12_15: # %cond.store13
> -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: truncstore_v8i32_v8i8:
>  ; AVX2:       # %bb.0:
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpshufb {{.*#+}} ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
> -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
> +; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
>  ; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
>  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
>  ; AVX2-NEXT:    notl %eax
> @@ -3855,31 +3971,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB12_4
>  ; AVX2-NEXT:  .LBB12_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB12_6
>  ; AVX2-NEXT:  .LBB12_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB12_8
>  ; AVX2-NEXT:  .LBB12_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    testb $16, %al
>  ; AVX2-NEXT:    je .LBB12_10
>  ; AVX2-NEXT:  .LBB12_9: # %cond.store7
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $32, %al
>  ; AVX2-NEXT:    je .LBB12_12
>  ; AVX2-NEXT:  .LBB12_11: # %cond.store9
> -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX2-NEXT:    testb $64, %al
>  ; AVX2-NEXT:    je .LBB12_14
>  ; AVX2-NEXT:  .LBB12_13: # %cond.store11
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    testb $-128, %al
>  ; AVX2-NEXT:    je .LBB12_16
>  ; AVX2-NEXT:  .LBB12_15: # %cond.store13
> -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -3888,7 +4004,7 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512F-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
>  ; AVX512F-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB12_1
> @@ -3921,31 +4037,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB12_4
>  ; AVX512F-NEXT:  .LBB12_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB12_6
>  ; AVX512F-NEXT:  .LBB12_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB12_8
>  ; AVX512F-NEXT:  .LBB12_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB12_10
>  ; AVX512F-NEXT:  .LBB12_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB12_12
>  ; AVX512F-NEXT:  .LBB12_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB12_14
>  ; AVX512F-NEXT:  .LBB12_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB12_16
>  ; AVX512F-NEXT:  .LBB12_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -3954,10 +4070,9 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512BW-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpmovdw %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> +; AVX512BW-NEXT:    vpmovdb %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -3978,6 +4093,9 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE2-LABEL: truncstore_v4i32_v4i16:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    movmskps %xmm2, %eax
>  ; SSE2-NEXT:    xorl $15, %eax
> @@ -4000,23 +4118,24 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB13_4
>  ; SSE2-NEXT:  .LBB13_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 2(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB13_6
>  ; SSE2-NEXT:  .LBB13_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 4(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
>  ; SSE2-NEXT:    je .LBB13_8
>  ; SSE2-NEXT:  .LBB13_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 6(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i32_v4i16:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskps %xmm2, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -4038,20 +4157,21 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB13_4
>  ; SSE4-NEXT:  .LBB13_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB13_6
>  ; SSE4-NEXT:  .LBB13_5: # %cond.store3
> -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB13_8
>  ; SSE4-NEXT:  .LBB13_7: # %cond.store5
> -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v4i32_v4i16:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX-NEXT:    xorl $15, %eax
> @@ -4073,21 +4193,22 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB13_4
>  ; AVX-NEXT:  .LBB13_3: # %cond.store1
> -; AVX-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    testb $4, %al
>  ; AVX-NEXT:    je .LBB13_6
>  ; AVX-NEXT:  .LBB13_5: # %cond.store3
> -; AVX-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX-NEXT:    testb $8, %al
>  ; AVX-NEXT:    je .LBB13_8
>  ; AVX-NEXT:  .LBB13_7: # %cond.store5
> -; AVX-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i32_v4i16:
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB13_1
> @@ -4108,15 +4229,15 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB13_4
>  ; AVX512F-NEXT:  .LBB13_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB13_6
>  ; AVX512F-NEXT:  .LBB13_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB13_8
>  ; AVX512F-NEXT:  .LBB13_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -4124,9 +4245,9 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -4146,45 +4267,49 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE2-LABEL: truncstore_v4i32_v4i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-NEXT:    movmskps %xmm2, %eax
> -; SSE2-NEXT:    xorl $15, %eax
> -; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movmskps %xmm2, %ecx
> +; SSE2-NEXT:    xorl $15, %ecx
> +; SSE2-NEXT:    testb $1, %cl
> +; SSE2-NEXT:    movd %xmm0, %eax
>  ; SSE2-NEXT:    jne .LBB14_1
>  ; SSE2-NEXT:  # %bb.2: # %else
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    jne .LBB14_3
>  ; SSE2-NEXT:  .LBB14_4: # %else2
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    jne .LBB14_5
>  ; SSE2-NEXT:  .LBB14_6: # %else4
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    jne .LBB14_7
>  ; SSE2-NEXT:  .LBB14_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB14_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, (%rdi)
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    movb %al, (%rdi)
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    je .LBB14_4
>  ; SSE2-NEXT:  .LBB14_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    movb %ah, 1(%rdi)
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    je .LBB14_6
>  ; SSE2-NEXT:  .LBB14_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    movl %eax, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    je .LBB14_8
>  ; SSE2-NEXT:  .LBB14_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    shrl $24, %eax
>  ; SSE2-NEXT:    movb %al, 3(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i32_v4i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskps %xmm2, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -4206,20 +4331,21 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB14_4
>  ; SSE4-NEXT:  .LBB14_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB14_6
>  ; SSE4-NEXT:  .LBB14_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB14_8
>  ; SSE4-NEXT:  .LBB14_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v4i32_v4i8:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX-NEXT:    xorl $15, %eax
> @@ -4241,21 +4367,22 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB14_4
>  ; AVX-NEXT:  .LBB14_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    testb $4, %al
>  ; AVX-NEXT:    je .LBB14_6
>  ; AVX-NEXT:  .LBB14_5: # %cond.store3
> -; AVX-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    testb $8, %al
>  ; AVX-NEXT:    je .LBB14_8
>  ; AVX-NEXT:  .LBB14_7: # %cond.store5
> -; AVX-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i32_v4i8:
>  ; AVX512F:       # %bb.0:
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB14_1
> @@ -4276,15 +4403,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB14_4
>  ; AVX512F-NEXT:  .LBB14_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB14_6
>  ; AVX512F-NEXT:  .LBB14_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB14_8
>  ; AVX512F-NEXT:  .LBB14_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -4292,9 +4419,9 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -6147,6 +6274,8 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-LABEL: truncstore_v8i16_v8i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqw %xmm1, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm2, %xmm1
> @@ -6163,17 +6292,26 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB17_5
>  ; SSE2-NEXT:  .LBB17_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB17_7
> +; SSE2-NEXT:    je .LBB17_8
> +; SSE2-NEXT:  .LBB17_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB17_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB17_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB17_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB17_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB17_11
> +; SSE2-NEXT:    je .LBB17_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB17_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB17_13
> -; SSE2-NEXT:  .LBB17_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB17_15
>  ; SSE2-NEXT:  .LBB17_16: # %else14
> @@ -6183,43 +6321,28 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB17_4
>  ; SSE2-NEXT:  .LBB17_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB17_6
>  ; SSE2-NEXT:  .LBB17_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB17_8
> -; SSE2-NEXT:  .LBB17_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB17_10
> -; SSE2-NEXT:  .LBB17_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB17_12
> -; SSE2-NEXT:  .LBB17_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB17_14
> +; SSE2-NEXT:    jne .LBB17_7
> +; SSE2-NEXT:    jmp .LBB17_8
>  ; SSE2-NEXT:  .LBB17_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB17_16
>  ; SSE2-NEXT:  .LBB17_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i16_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqw %xmm1, %xmm2
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm2, %xmm1
> @@ -6255,36 +6378,37 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB17_4
>  ; SSE4-NEXT:  .LBB17_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB17_6
>  ; SSE4-NEXT:  .LBB17_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB17_8
>  ; SSE4-NEXT:  .LBB17_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB17_10
>  ; SSE4-NEXT:  .LBB17_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB17_12
>  ; SSE4-NEXT:  .LBB17_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB17_14
>  ; SSE4-NEXT:  .LBB17_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB17_16
>  ; SSE4-NEXT:  .LBB17_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v8i16_v8i8:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; AVX-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> @@ -6320,31 +6444,31 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB17_4
>  ; AVX-NEXT:  .LBB17_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    testb $4, %al
>  ; AVX-NEXT:    je .LBB17_6
>  ; AVX-NEXT:  .LBB17_5: # %cond.store3
> -; AVX-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    testb $8, %al
>  ; AVX-NEXT:    je .LBB17_8
>  ; AVX-NEXT:  .LBB17_7: # %cond.store5
> -; AVX-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX-NEXT:    testb $16, %al
>  ; AVX-NEXT:    je .LBB17_10
>  ; AVX-NEXT:  .LBB17_9: # %cond.store7
> -; AVX-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX-NEXT:    testb $32, %al
>  ; AVX-NEXT:    je .LBB17_12
>  ; AVX-NEXT:  .LBB17_11: # %cond.store9
> -; AVX-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX-NEXT:    testb $64, %al
>  ; AVX-NEXT:    je .LBB17_14
>  ; AVX-NEXT:  .LBB17_13: # %cond.store11
> -; AVX-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX-NEXT:    testb $-128, %al
>  ; AVX-NEXT:    je .LBB17_16
>  ; AVX-NEXT:  .LBB17_15: # %cond.store13
> -; AVX-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v8i16_v8i8:
> @@ -6354,6 +6478,7 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512F-NEXT:    vpternlogq $15, %zmm1, %zmm1, %zmm1
>  ; AVX512F-NEXT:    vpmovsxwq %xmm1, %zmm1
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB17_1
> @@ -6386,31 +6511,31 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB17_4
>  ; AVX512F-NEXT:  .LBB17_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB17_6
>  ; AVX512F-NEXT:  .LBB17_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB17_8
>  ; AVX512F-NEXT:  .LBB17_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB17_10
>  ; AVX512F-NEXT:  .LBB17_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB17_12
>  ; AVX512F-NEXT:  .LBB17_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB17_14
>  ; AVX512F-NEXT:  .LBB17_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB17_16
>  ; AVX512F-NEXT:  .LBB17_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -6418,9 +6543,9 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmw %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> +; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_store_trunc_ssat.ll Wed Aug  7 09:24:26 2019
> @@ -948,7 +948,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    pxor %xmm8, %xmm8
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm9 = [127,127]
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm11 = [2147483648,2147483648]
> -; SSE2-NEXT:    movdqa %xmm2, %xmm6
> +; SSE2-NEXT:    movdqa %xmm3, %xmm6
>  ; SSE2-NEXT:    pxor %xmm11, %xmm6
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [2147483775,2147483775]
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm7
> @@ -959,23 +959,10 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    pand %xmm12, %xmm6
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm13 = xmm7[1,1,3,3]
>  ; SSE2-NEXT:    por %xmm6, %xmm13
> -; SSE2-NEXT:    pand %xmm13, %xmm2
> +; SSE2-NEXT:    pand %xmm13, %xmm3
>  ; SSE2-NEXT:    pandn %xmm9, %xmm13
> -; SSE2-NEXT:    por %xmm2, %xmm13
> -; SSE2-NEXT:    movdqa %xmm3, %xmm2
> -; SSE2-NEXT:    pxor %xmm11, %xmm2
> -; SSE2-NEXT:    movdqa %xmm10, %xmm6
> -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm6
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm2[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm12, %xmm7
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm6[1,1,3,3]
> -; SSE2-NEXT:    por %xmm7, %xmm2
> -; SSE2-NEXT:    pand %xmm2, %xmm3
> -; SSE2-NEXT:    pandn %xmm9, %xmm2
> -; SSE2-NEXT:    por %xmm3, %xmm2
> -; SSE2-NEXT:    movdqa %xmm0, %xmm3
> +; SSE2-NEXT:    por %xmm3, %xmm13
> +; SSE2-NEXT:    movdqa %xmm2, %xmm3
>  ; SSE2-NEXT:    pxor %xmm11, %xmm3
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm6
>  ; SSE2-NEXT:    pcmpgtd %xmm3, %xmm6
> @@ -985,78 +972,97 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    pand %xmm12, %xmm7
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm6[1,1,3,3]
>  ; SSE2-NEXT:    por %xmm7, %xmm3
> -; SSE2-NEXT:    pand %xmm3, %xmm0
> +; SSE2-NEXT:    pand %xmm3, %xmm2
>  ; SSE2-NEXT:    pandn %xmm9, %xmm3
> -; SSE2-NEXT:    por %xmm0, %xmm3
> -; SSE2-NEXT:    movdqa %xmm1, %xmm0
> -; SSE2-NEXT:    pxor %xmm11, %xmm0
> +; SSE2-NEXT:    por %xmm2, %xmm3
> +; SSE2-NEXT:    movdqa %xmm1, %xmm2
> +; SSE2-NEXT:    pxor %xmm11, %xmm2
> +; SSE2-NEXT:    movdqa %xmm10, %xmm6
> +; SSE2-NEXT:    pcmpgtd %xmm2, %xmm6
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm2[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm12, %xmm7
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm6[1,1,3,3]
> +; SSE2-NEXT:    por %xmm7, %xmm2
> +; SSE2-NEXT:    pand %xmm2, %xmm1
> +; SSE2-NEXT:    pandn %xmm9, %xmm2
> +; SSE2-NEXT:    por %xmm1, %xmm2
> +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> +; SSE2-NEXT:    pxor %xmm11, %xmm1
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm6
> -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm6
> +; SSE2-NEXT:    pcmpgtd %xmm1, %xmm6
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm6[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm7, %xmm0
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm7, %xmm1
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm6[1,1,3,3]
> -; SSE2-NEXT:    por %xmm0, %xmm6
> -; SSE2-NEXT:    pand %xmm6, %xmm1
> -; SSE2-NEXT:    pandn %xmm9, %xmm6
>  ; SSE2-NEXT:    por %xmm1, %xmm6
> +; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    pandn %xmm9, %xmm6
> +; SSE2-NEXT:    por %xmm0, %xmm6
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm9 = [18446744073709551488,18446744073709551488]
>  ; SSE2-NEXT:    movdqa %xmm6, %xmm0
>  ; SSE2-NEXT:    pxor %xmm11, %xmm0
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [18446744071562067840,18446744071562067840]
> -; SSE2-NEXT:    movdqa %xmm0, %xmm7
> -; SSE2-NEXT:    pcmpgtd %xmm10, %xmm7
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm1, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[1,1,3,3]
> -; SSE2-NEXT:    por %xmm0, %xmm1
> -; SSE2-NEXT:    pand %xmm1, %xmm6
> -; SSE2-NEXT:    pandn %xmm9, %xmm1
> -; SSE2-NEXT:    por %xmm6, %xmm1
> -; SSE2-NEXT:    movdqa %xmm3, %xmm0
> -; SSE2-NEXT:    pxor %xmm11, %xmm0
> -; SSE2-NEXT:    movdqa %xmm0, %xmm6
> -; SSE2-NEXT:    pcmpgtd %xmm10, %xmm6
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm1[0,0,2,2]
>  ; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm0[1,1,3,3]
>  ; SSE2-NEXT:    pand %xmm12, %xmm7
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm6[1,1,3,3]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[1,1,3,3]
>  ; SSE2-NEXT:    por %xmm7, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm3
> +; SSE2-NEXT:    pand %xmm0, %xmm6
>  ; SSE2-NEXT:    pandn %xmm9, %xmm0
> -; SSE2-NEXT:    por %xmm3, %xmm0
> -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-NEXT:    por %xmm6, %xmm0
>  ; SSE2-NEXT:    movdqa %xmm2, %xmm1
>  ; SSE2-NEXT:    pxor %xmm11, %xmm1
> -; SSE2-NEXT:    movdqa %xmm1, %xmm3
> +; SSE2-NEXT:    movdqa %xmm1, %xmm6
> +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm6
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm12, %xmm7
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm6[1,1,3,3]
> +; SSE2-NEXT:    por %xmm7, %xmm1
> +; SSE2-NEXT:    pand %xmm1, %xmm2
> +; SSE2-NEXT:    pandn %xmm9, %xmm1
> +; SSE2-NEXT:    por %xmm2, %xmm1
> +; SSE2-NEXT:    movdqa %xmm3, %xmm2
> +; SSE2-NEXT:    pxor %xmm11, %xmm2
> +; SSE2-NEXT:    movdqa %xmm2, %xmm6
> +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm6
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm12 = xmm6[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm2[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm12, %xmm7
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm6[1,1,3,3]
> +; SSE2-NEXT:    por %xmm7, %xmm2
> +; SSE2-NEXT:    pand %xmm2, %xmm3
> +; SSE2-NEXT:    pandn %xmm9, %xmm2
> +; SSE2-NEXT:    por %xmm3, %xmm2
> +; SSE2-NEXT:    pxor %xmm13, %xmm11
> +; SSE2-NEXT:    movdqa %xmm11, %xmm3
>  ; SSE2-NEXT:    pcmpgtd %xmm10, %xmm3
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm1
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm11
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm7 = xmm11[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm7
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; SSE2-NEXT:    por %xmm1, %xmm3
> -; SSE2-NEXT:    pand %xmm3, %xmm2
> +; SSE2-NEXT:    por %xmm7, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm13
>  ; SSE2-NEXT:    pandn %xmm9, %xmm3
> -; SSE2-NEXT:    por %xmm2, %xmm3
> -; SSE2-NEXT:    pxor %xmm13, %xmm11
> -; SSE2-NEXT:    movdqa %xmm11, %xmm1
> -; SSE2-NEXT:    pcmpgtd %xmm10, %xmm1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm1[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm11
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm11[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm2, %xmm6
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; SSE2-NEXT:    por %xmm6, %xmm1
> -; SSE2-NEXT:    pand %xmm1, %xmm13
> -; SSE2-NEXT:    pandn %xmm9, %xmm1
> -; SSE2-NEXT:    por %xmm13, %xmm1
> -; SSE2-NEXT:    packssdw %xmm3, %xmm1
> -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-NEXT:    por %xmm13, %xmm3
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm6 = [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> +; SSE2-NEXT:    pand %xmm6, %xmm3
> +; SSE2-NEXT:    pand %xmm6, %xmm2
> +; SSE2-NEXT:    packuswb %xmm3, %xmm2
> +; SSE2-NEXT:    pand %xmm6, %xmm1
> +; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> +; SSE2-NEXT:    packuswb %xmm2, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm8, %xmm5
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm1, %xmm5
> @@ -1076,17 +1082,26 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB2_5
>  ; SSE2-NEXT:  .LBB2_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB2_7
> +; SSE2-NEXT:    je .LBB2_8
> +; SSE2-NEXT:  .LBB2_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB2_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB2_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB2_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB2_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB2_11
> +; SSE2-NEXT:    je .LBB2_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB2_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB2_13
> -; SSE2-NEXT:  .LBB2_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB2_15
>  ; SSE2-NEXT:  .LBB2_16: # %else14
> @@ -1096,38 +1111,22 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB2_4
>  ; SSE2-NEXT:  .LBB2_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB2_6
>  ; SSE2-NEXT:  .LBB2_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB2_8
> -; SSE2-NEXT:  .LBB2_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB2_10
> -; SSE2-NEXT:  .LBB2_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB2_12
> -; SSE2-NEXT:  .LBB2_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB2_14
> +; SSE2-NEXT:    jne .LBB2_7
> +; SSE2-NEXT:    jmp .LBB2_8
>  ; SSE2-NEXT:  .LBB2_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB2_16
>  ; SSE2-NEXT:  .LBB2_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i64_v8i8:
> @@ -1136,39 +1135,45 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE4-NEXT:    pxor %xmm8, %xmm8
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm7 = [127,127]
>  ; SSE4-NEXT:    movdqa %xmm7, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
> -; SSE4-NEXT:    movdqa %xmm7, %xmm10
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm10
> -; SSE4-NEXT:    movdqa %xmm7, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> -; SSE4-NEXT:    movdqa %xmm7, %xmm2
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
> +; SSE4-NEXT:    movdqa %xmm7, %xmm10
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm10
>  ; SSE4-NEXT:    movdqa %xmm7, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm9, %xmm0
> +; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
>  ; SSE4-NEXT:    movdqa %xmm7, %xmm3
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm9, %xmm3
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
>  ; SSE4-NEXT:    movdqa %xmm7, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm7
> +; SSE4-NEXT:    movdqa %xmm7, %xmm2
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
> +; SSE4-NEXT:    movdqa %xmm7, %xmm0
> +; SSE4-NEXT:    pcmpgtq %xmm9, %xmm0
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm9, %xmm7
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm1 = [18446744073709551488,18446744073709551488]
>  ; SSE4-NEXT:    movapd %xmm7, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
>  ; SSE4-NEXT:    movdqa %xmm1, %xmm6
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm7, %xmm6
> -; SSE4-NEXT:    movapd %xmm3, %xmm0
> +; SSE4-NEXT:    movapd %xmm2, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
>  ; SSE4-NEXT:    movdqa %xmm1, %xmm7
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm7
> -; SSE4-NEXT:    packssdw %xmm6, %xmm7
> -; SSE4-NEXT:    movapd %xmm2, %xmm0
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm7
> +; SSE4-NEXT:    movapd %xmm3, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> +; SSE4-NEXT:    movdqa %xmm1, %xmm2
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
>  ; SSE4-NEXT:    movapd %xmm10, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm10, %xmm1
> -; SSE4-NEXT:    packssdw %xmm3, %xmm1
> -; SSE4-NEXT:    packssdw %xmm1, %xmm7
> +; SSE4-NEXT:    movapd {{.*#+}} xmm0 = [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> +; SSE4-NEXT:    andpd %xmm0, %xmm1
> +; SSE4-NEXT:    andpd %xmm0, %xmm2
> +; SSE4-NEXT:    packusdw %xmm1, %xmm2
> +; SSE4-NEXT:    andpd %xmm0, %xmm7
> +; SSE4-NEXT:    andpd %xmm0, %xmm6
> +; SSE4-NEXT:    packusdw %xmm7, %xmm6
> +; SSE4-NEXT:    packusdw %xmm2, %xmm6
> +; SSE4-NEXT:    packuswb %xmm6, %xmm6
>  ; SSE4-NEXT:    pcmpeqd %xmm8, %xmm5
>  ; SSE4-NEXT:    pcmpeqd %xmm0, %xmm0
>  ; SSE4-NEXT:    pxor %xmm0, %xmm5
> @@ -1203,62 +1208,74 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE4-NEXT:  .LBB2_16: # %else14
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB2_1: # %cond.store
> -; SSE4-NEXT:    pextrb $0, %xmm7, (%rdi)
> +; SSE4-NEXT:    pextrb $0, %xmm6, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB2_4
>  ; SSE4-NEXT:  .LBB2_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm7, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm6, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB2_6
>  ; SSE4-NEXT:  .LBB2_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm7, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm6, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB2_8
>  ; SSE4-NEXT:  .LBB2_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm7, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm6, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB2_10
>  ; SSE4-NEXT:  .LBB2_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm7, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm6, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB2_12
>  ; SSE4-NEXT:  .LBB2_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm7, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm6, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB2_14
>  ; SSE4-NEXT:  .LBB2_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm7, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm6, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB2_16
>  ; SSE4-NEXT:  .LBB2_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm7, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm6, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v8i64_v8i8:
>  ; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm3
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [127,127]
> -; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm4, %xmm8
> -; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm4, %xmm9
> -; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm7
> -; AVX1-NEXT:    vpcmpgtq %xmm7, %xmm4, %xmm5
> -; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm4, %xmm6
> -; AVX1-NEXT:    vblendvpd %xmm6, %xmm0, %xmm4, %xmm0
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm6 = [18446744073709551488,18446744073709551488]
> -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm0, %xmm10
> -; AVX1-NEXT:    vblendvpd %xmm5, %xmm7, %xmm4, %xmm5
> -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm5, %xmm11
> -; AVX1-NEXT:    vblendvpd %xmm9, %xmm1, %xmm4, %xmm1
> -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm1, %xmm7
> -; AVX1-NEXT:    vblendvpd %xmm8, %xmm3, %xmm4, %xmm3
> -; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm3, %xmm4
> -; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm6, %xmm3
> -; AVX1-NEXT:    vblendvpd %xmm7, %xmm1, %xmm6, %xmm1
> -; AVX1-NEXT:    vpackssdw %xmm3, %xmm1, %xmm1
> -; AVX1-NEXT:    vblendvpd %xmm11, %xmm5, %xmm6, %xmm3
> -; AVX1-NEXT:    vblendvpd %xmm10, %xmm0, %xmm6, %xmm0
> -; AVX1-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> -; AVX1-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
> +; AVX1-NEXT:    vmovapd {{.*#+}} ymm9 = [127,127,127,127]
> +; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm10
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm5 = [127,127]
> +; AVX1-NEXT:    vpcmpgtq %xmm10, %xmm5, %xmm6
> +; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm5, %xmm7
> +; AVX1-NEXT:    vinsertf128 $1, %xmm6, %ymm7, %ymm8
> +; AVX1-NEXT:    vblendvpd %ymm8, %ymm1, %ymm9, %ymm8
> +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> +; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm5, %xmm4
> +; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm5, %xmm11
> +; AVX1-NEXT:    vinsertf128 $1, %xmm4, %ymm11, %ymm12
> +; AVX1-NEXT:    vblendvpd %ymm12, %ymm0, %ymm9, %ymm9
> +; AVX1-NEXT:    vmovapd {{.*#+}} ymm12 = [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
> +; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm5, %xmm3
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [18446744073709551488,18446744073709551488]
> +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm3, %xmm3
> +; AVX1-NEXT:    vblendvpd %xmm11, %xmm0, %xmm5, %xmm0
> +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm0, %xmm0
> +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm0, %ymm0
> +; AVX1-NEXT:    vblendvpd %ymm0, %ymm9, %ymm12, %ymm0
> +; AVX1-NEXT:    vblendvpd %xmm6, %xmm10, %xmm5, %xmm3
> +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm3, %xmm3
> +; AVX1-NEXT:    vblendvpd %xmm7, %xmm1, %xmm5, %xmm1
> +; AVX1-NEXT:    vpcmpgtq %xmm4, %xmm1, %xmm1
> +; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm1, %ymm1
> +; AVX1-NEXT:    vblendvpd %ymm1, %ymm8, %ymm12, %ymm1
> +; AVX1-NEXT:    vmovapd {{.*#+}} ymm3 = [255,255,255,255]
> +; AVX1-NEXT:    vandpd %ymm3, %ymm1, %ymm1
> +; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm4
> +; AVX1-NEXT:    vpackusdw %xmm4, %xmm1, %xmm1
> +; AVX1-NEXT:    vandpd %ymm3, %ymm0, %ymm0
> +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
> +; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> +; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> +; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm1
>  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
> @@ -1297,31 +1314,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB2_4
>  ; AVX1-NEXT:  .LBB2_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB2_6
>  ; AVX1-NEXT:  .LBB2_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB2_8
>  ; AVX1-NEXT:  .LBB2_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    testb $16, %al
>  ; AVX1-NEXT:    je .LBB2_10
>  ; AVX1-NEXT:  .LBB2_9: # %cond.store7
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $32, %al
>  ; AVX1-NEXT:    je .LBB2_12
>  ; AVX1-NEXT:  .LBB2_11: # %cond.store9
> -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX1-NEXT:    testb $64, %al
>  ; AVX1-NEXT:    je .LBB2_14
>  ; AVX1-NEXT:  .LBB2_13: # %cond.store11
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    testb $-128, %al
>  ; AVX1-NEXT:    je .LBB2_16
>  ; AVX1-NEXT:  .LBB2_15: # %cond.store13
> -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -1329,19 +1346,26 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX2:       # %bb.0:
>  ; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm4 = [127,127,127,127]
> -; AVX2-NEXT:    vpcmpgtq %ymm0, %ymm4, %ymm5
> -; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
>  ; AVX2-NEXT:    vpcmpgtq %ymm1, %ymm4, %ymm5
>  ; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> +; AVX2-NEXT:    vpcmpgtq %ymm0, %ymm4, %ymm5
> +; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
>  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm4 = [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
> -; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm1, %ymm5
> -; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
>  ; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm0, %ymm5
>  ; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
> -; AVX2-NEXT:    vpackssdw %ymm1, %ymm0, %ymm0
> -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
> -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm1
> -; AVX2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
> +; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm1, %ymm5
> +; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> +; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm4
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 = <u,u,0,8,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm1, %xmm1
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
> +; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm4
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
> +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
>  ; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm2, %ymm1
>  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
>  ; AVX2-NEXT:    notl %eax
> @@ -1376,31 +1400,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB2_4
>  ; AVX2-NEXT:  .LBB2_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB2_6
>  ; AVX2-NEXT:  .LBB2_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB2_8
>  ; AVX2-NEXT:  .LBB2_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    testb $16, %al
>  ; AVX2-NEXT:    je .LBB2_10
>  ; AVX2-NEXT:  .LBB2_9: # %cond.store7
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $32, %al
>  ; AVX2-NEXT:    je .LBB2_12
>  ; AVX2-NEXT:  .LBB2_11: # %cond.store9
> -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX2-NEXT:    testb $64, %al
>  ; AVX2-NEXT:    je .LBB2_14
>  ; AVX2-NEXT:  .LBB2_13: # %cond.store11
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    testb $-128, %al
>  ; AVX2-NEXT:    je .LBB2_16
>  ; AVX2-NEXT:  .LBB2_15: # %cond.store13
> -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -1410,7 +1434,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpminsq {{.*}}(%rip){1to8}, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vpmaxsq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> -; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB2_1
> @@ -1443,31 +1467,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB2_4
>  ; AVX512F-NEXT:  .LBB2_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB2_6
>  ; AVX512F-NEXT:  .LBB2_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB2_8
>  ; AVX512F-NEXT:  .LBB2_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB2_10
>  ; AVX512F-NEXT:  .LBB2_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB2_12
>  ; AVX512F-NEXT:  .LBB2_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB2_14
>  ; AVX512F-NEXT:  .LBB2_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB2_16
>  ; AVX512F-NEXT:  .LBB2_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1744,7 +1768,7 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:    pxor %xmm9, %xmm9
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [32767,32767]
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> -; SSE2-NEXT:    movdqa %xmm0, %xmm5
> +; SSE2-NEXT:    movdqa %xmm1, %xmm5
>  ; SSE2-NEXT:    pxor %xmm4, %xmm5
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [2147516415,2147516415]
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm7
> @@ -1755,50 +1779,54 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:    pand %xmm3, %xmm6
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm7[1,1,3,3]
>  ; SSE2-NEXT:    por %xmm6, %xmm5
> -; SSE2-NEXT:    pand %xmm5, %xmm0
> +; SSE2-NEXT:    pand %xmm5, %xmm1
>  ; SSE2-NEXT:    pandn %xmm8, %xmm5
> -; SSE2-NEXT:    por %xmm0, %xmm5
> -; SSE2-NEXT:    movdqa %xmm1, %xmm0
> -; SSE2-NEXT:    pxor %xmm4, %xmm0
> +; SSE2-NEXT:    por %xmm1, %xmm5
> +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> +; SSE2-NEXT:    pxor %xmm4, %xmm1
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm3
> -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm3
> +; SSE2-NEXT:    pcmpgtd %xmm1, %xmm3
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm1
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; SSE2-NEXT:    por %xmm0, %xmm3
> -; SSE2-NEXT:    pand %xmm3, %xmm1
> -; SSE2-NEXT:    pandn %xmm8, %xmm3
>  ; SSE2-NEXT:    por %xmm1, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm0
> +; SSE2-NEXT:    pandn %xmm8, %xmm3
> +; SSE2-NEXT:    por %xmm0, %xmm3
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [18446744073709518848,18446744073709518848]
> -; SSE2-NEXT:    movdqa %xmm3, %xmm0
> -; SSE2-NEXT:    pxor %xmm4, %xmm0
> +; SSE2-NEXT:    movdqa %xmm3, %xmm1
> +; SSE2-NEXT:    pxor %xmm4, %xmm1
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm6 = [18446744071562035200,18446744071562035200]
> -; SSE2-NEXT:    movdqa %xmm0, %xmm7
> +; SSE2-NEXT:    movdqa %xmm1, %xmm7
>  ; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm6, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm1, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[1,1,3,3]
> -; SSE2-NEXT:    por %xmm0, %xmm1
> -; SSE2-NEXT:    pand %xmm1, %xmm3
> -; SSE2-NEXT:    pandn %xmm8, %xmm1
> -; SSE2-NEXT:    por %xmm3, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm7[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm6, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm0, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm7[1,1,3,3]
> +; SSE2-NEXT:    por %xmm1, %xmm0
> +; SSE2-NEXT:    pand %xmm0, %xmm3
> +; SSE2-NEXT:    pandn %xmm8, %xmm0
> +; SSE2-NEXT:    por %xmm3, %xmm0
>  ; SSE2-NEXT:    pxor %xmm5, %xmm4
> -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,0,2,2]
> +; SSE2-NEXT:    movdqa %xmm4, %xmm1
> +; SSE2-NEXT:    pcmpgtd %xmm6, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm1[0,0,2,2]
>  ; SSE2-NEXT:    pcmpeqd %xmm6, %xmm4
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,1,3,3]
>  ; SSE2-NEXT:    pand %xmm3, %xmm4
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    por %xmm4, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm5
> -; SSE2-NEXT:    pandn %xmm8, %xmm0
> -; SSE2-NEXT:    por %xmm5, %xmm0
> -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    por %xmm4, %xmm1
> +; SSE2-NEXT:    pand %xmm1, %xmm5
> +; SSE2-NEXT:    pandn %xmm8, %xmm1
> +; SSE2-NEXT:    por %xmm5, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm9
>  ; SSE2-NEXT:    movmskps %xmm9, %eax
>  ; SSE2-NEXT:    xorl $15, %eax
> @@ -1821,17 +1849,17 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB4_4
>  ; SSE2-NEXT:  .LBB4_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 2(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB4_6
>  ; SSE2-NEXT:  .LBB4_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 4(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
>  ; SSE2-NEXT:    je .LBB4_8
>  ; SSE2-NEXT:  .LBB4_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 6(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -1841,12 +1869,12 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE4-NEXT:    pxor %xmm4, %xmm4
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [32767,32767]
>  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> +; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
>  ; SSE4-NEXT:    movdqa %xmm5, %xmm6
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm6
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm6
>  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm5
> +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm5
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm1 = [18446744073709518848,18446744073709518848]
>  ; SSE4-NEXT:    movapd %xmm5, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> @@ -1855,7 +1883,11 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE4-NEXT:    movapd %xmm6, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm6, %xmm1
> -; SSE4-NEXT:    packssdw %xmm3, %xmm1
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm4
>  ; SSE4-NEXT:    movmskps %xmm4, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -1873,19 +1905,19 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE4-NEXT:  .LBB4_8: # %else6
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB4_1: # %cond.store
> -; SSE4-NEXT:    pextrw $0, %xmm1, (%rdi)
> +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB4_4
>  ; SSE4-NEXT:  .LBB4_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $2, %xmm1, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB4_6
>  ; SSE4-NEXT:  .LBB4_5: # %cond.store3
> -; SSE4-NEXT:    pextrw $4, %xmm1, 4(%rdi)
> +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB4_8
>  ; SSE4-NEXT:  .LBB4_7: # %cond.store5
> -; SSE4-NEXT:    pextrw $6, %xmm1, 6(%rdi)
> +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i64_v4i16:
> @@ -1901,8 +1933,12 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX1-NEXT:    vblendvpd %xmm5, %xmm3, %xmm4, %xmm3
>  ; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm3, %xmm4
>  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm6, %xmm3
> +; AVX1-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
>  ; AVX1-NEXT:    vblendvpd %xmm7, %xmm0, %xmm6, %xmm0
> -; AVX1-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> +; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -1925,15 +1961,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB4_4
>  ; AVX1-NEXT:  .LBB4_3: # %cond.store1
> -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB4_6
>  ; AVX1-NEXT:  .LBB4_5: # %cond.store3
> -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB4_8
>  ; AVX1-NEXT:  .LBB4_7: # %cond.store5
> -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -1947,7 +1983,11 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX2-NEXT:    vpcmpgtq %ymm3, %ymm0, %ymm4
>  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
>  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> -; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> +; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -1970,15 +2010,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB4_4
>  ; AVX2-NEXT:  .LBB4_3: # %cond.store1
> -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB4_6
>  ; AVX2-NEXT:  .LBB4_5: # %cond.store3
> -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB4_8
>  ; AVX2-NEXT:  .LBB4_7: # %cond.store5
> -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -1991,7 +2031,7 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [18446744073709518848,18446744073709518848,18446744073709518848,18446744073709518848]
>  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB4_1
> @@ -2012,15 +2052,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB4_4
>  ; AVX512F-NEXT:  .LBB4_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB4_6
>  ; AVX512F-NEXT:  .LBB4_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB4_8
>  ; AVX512F-NEXT:  .LBB4_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2029,14 +2069,13 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [32767,32767,32767,32767]
>  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [18446744073709518848,18446744073709518848,18446744073709518848,18446744073709518848]
>  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> -; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> +; AVX512BW-NEXT:    vpmovqw %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2065,7 +2104,7 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE2-NEXT:    pxor %xmm9, %xmm9
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [127,127]
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648]
> -; SSE2-NEXT:    movdqa %xmm0, %xmm5
> +; SSE2-NEXT:    movdqa %xmm1, %xmm5
>  ; SSE2-NEXT:    pxor %xmm4, %xmm5
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [2147483775,2147483775]
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm7
> @@ -2076,83 +2115,88 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE2-NEXT:    pand %xmm3, %xmm6
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm7[1,1,3,3]
>  ; SSE2-NEXT:    por %xmm6, %xmm5
> -; SSE2-NEXT:    pand %xmm5, %xmm0
> +; SSE2-NEXT:    pand %xmm5, %xmm1
>  ; SSE2-NEXT:    pandn %xmm8, %xmm5
> -; SSE2-NEXT:    por %xmm0, %xmm5
> -; SSE2-NEXT:    movdqa %xmm1, %xmm0
> -; SSE2-NEXT:    pxor %xmm4, %xmm0
> +; SSE2-NEXT:    por %xmm1, %xmm5
> +; SSE2-NEXT:    movdqa %xmm0, %xmm1
> +; SSE2-NEXT:    pxor %xmm4, %xmm1
>  ; SSE2-NEXT:    movdqa %xmm10, %xmm3
> -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm3
> +; SSE2-NEXT:    pcmpgtd %xmm1, %xmm3
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm1
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; SSE2-NEXT:    por %xmm0, %xmm3
> -; SSE2-NEXT:    pand %xmm3, %xmm1
> -; SSE2-NEXT:    pandn %xmm8, %xmm3
>  ; SSE2-NEXT:    por %xmm1, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm0
> +; SSE2-NEXT:    pandn %xmm8, %xmm3
> +; SSE2-NEXT:    por %xmm0, %xmm3
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [18446744073709551488,18446744073709551488]
>  ; SSE2-NEXT:    movdqa %xmm3, %xmm0
>  ; SSE2-NEXT:    pxor %xmm4, %xmm0
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm6 = [18446744071562067840,18446744071562067840]
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [18446744071562067840,18446744071562067840]
>  ; SSE2-NEXT:    movdqa %xmm0, %xmm7
> -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
> +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm7
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm6, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm1, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm7[1,1,3,3]
> -; SSE2-NEXT:    por %xmm0, %xmm1
> -; SSE2-NEXT:    pand %xmm1, %xmm3
> -; SSE2-NEXT:    pandn %xmm8, %xmm1
> -; SSE2-NEXT:    por %xmm3, %xmm1
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm0[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm1, %xmm6
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm7[1,1,3,3]
> +; SSE2-NEXT:    por %xmm6, %xmm0
> +; SSE2-NEXT:    pand %xmm0, %xmm3
> +; SSE2-NEXT:    pandn %xmm8, %xmm0
> +; SSE2-NEXT:    por %xmm3, %xmm0
>  ; SSE2-NEXT:    pxor %xmm5, %xmm4
> -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm6, %xmm4
> +; SSE2-NEXT:    movdqa %xmm4, %xmm1
> +; SSE2-NEXT:    pcmpgtd %xmm10, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm1[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm4
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm4[1,1,3,3]
>  ; SSE2-NEXT:    pand %xmm3, %xmm4
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> -; SSE2-NEXT:    por %xmm4, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm5
> -; SSE2-NEXT:    pandn %xmm8, %xmm0
> -; SSE2-NEXT:    por %xmm5, %xmm0
> -; SSE2-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    por %xmm4, %xmm1
> +; SSE2-NEXT:    pand %xmm1, %xmm5
> +; SSE2-NEXT:    pandn %xmm8, %xmm1
> +; SSE2-NEXT:    por %xmm5, %xmm1
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [255,0,0,0,0,0,0,0,255,0,0,0,0,0,0,0]
> +; SSE2-NEXT:    pand %xmm3, %xmm1
> +; SSE2-NEXT:    pand %xmm3, %xmm0
> +; SSE2-NEXT:    packuswb %xmm1, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm9
> -; SSE2-NEXT:    movmskps %xmm9, %eax
> -; SSE2-NEXT:    xorl $15, %eax
> -; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movmskps %xmm9, %ecx
> +; SSE2-NEXT:    xorl $15, %ecx
> +; SSE2-NEXT:    testb $1, %cl
> +; SSE2-NEXT:    movd %xmm0, %eax
>  ; SSE2-NEXT:    jne .LBB5_1
>  ; SSE2-NEXT:  # %bb.2: # %else
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    jne .LBB5_3
>  ; SSE2-NEXT:  .LBB5_4: # %else2
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    jne .LBB5_5
>  ; SSE2-NEXT:  .LBB5_6: # %else4
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    jne .LBB5_7
>  ; SSE2-NEXT:  .LBB5_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB5_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, (%rdi)
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    movb %al, (%rdi)
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    je .LBB5_4
>  ; SSE2-NEXT:  .LBB5_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    movb %ah, 1(%rdi)
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    je .LBB5_6
>  ; SSE2-NEXT:  .LBB5_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    movl %eax, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    je .LBB5_8
>  ; SSE2-NEXT:  .LBB5_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    shrl $24, %eax
>  ; SSE2-NEXT:    movb %al, 3(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -2162,21 +2206,24 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE4-NEXT:    pxor %xmm4, %xmm4
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [127,127]
>  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> +; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
>  ; SSE4-NEXT:    movdqa %xmm5, %xmm6
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm6
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm6
>  ; SSE4-NEXT:    movdqa %xmm5, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm5
> -; SSE4-NEXT:    movdqa {{.*#+}} xmm1 = [18446744073709551488,18446744073709551488]
> +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm5
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm3 = [18446744073709551488,18446744073709551488]
>  ; SSE4-NEXT:    movapd %xmm5, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm5, %xmm3
> +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> +; SSE4-NEXT:    movdqa %xmm3, %xmm1
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm5, %xmm1
>  ; SSE4-NEXT:    movapd %xmm6, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm1, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm6, %xmm1
> -; SSE4-NEXT:    packssdw %xmm3, %xmm1
> +; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm6, %xmm3
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; SSE4-NEXT:    pshufb %xmm0, %xmm3
> +; SSE4-NEXT:    pshufb %xmm0, %xmm1
> +; SSE4-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3]
>  ; SSE4-NEXT:    pcmpeqd %xmm2, %xmm4
>  ; SSE4-NEXT:    movmskps %xmm4, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -2198,15 +2245,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB5_4
>  ; SSE4-NEXT:  .LBB5_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $4, %xmm1, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm1, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB5_6
>  ; SSE4-NEXT:  .LBB5_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $8, %xmm1, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm1, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB5_8
>  ; SSE4-NEXT:  .LBB5_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $12, %xmm1, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm1, 3(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i64_v4i8:
> @@ -2222,8 +2269,11 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX1-NEXT:    vblendvpd %xmm5, %xmm3, %xmm4, %xmm3
>  ; AVX1-NEXT:    vpcmpgtq %xmm6, %xmm3, %xmm4
>  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm3, %xmm6, %xmm3
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX1-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
>  ; AVX1-NEXT:    vblendvpd %xmm7, %xmm0, %xmm6, %xmm0
> -; AVX1-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> +; AVX1-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -2246,15 +2296,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB5_4
>  ; AVX1-NEXT:  .LBB5_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB5_6
>  ; AVX1-NEXT:  .LBB5_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB5_8
>  ; AVX1-NEXT:  .LBB5_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -2268,7 +2318,10 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX2-NEXT:    vpcmpgtq %ymm3, %ymm0, %ymm4
>  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
>  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> -; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -2291,15 +2344,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB5_4
>  ; AVX2-NEXT:  .LBB5_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB5_6
>  ; AVX2-NEXT:  .LBB5_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB5_8
>  ; AVX2-NEXT:  .LBB5_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -2312,7 +2365,7 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
>  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB5_1
> @@ -2333,15 +2386,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB5_4
>  ; AVX512F-NEXT:  .LBB5_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB5_6
>  ; AVX512F-NEXT:  .LBB5_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB5_8
>  ; AVX512F-NEXT:  .LBB5_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2350,14 +2403,13 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [127,127,127,127]
>  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [18446744073709551488,18446744073709551488,18446744073709551488,18446744073709551488]
>  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> -; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> +; AVX512BW-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2405,13 +2457,14 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm4[0,0,2,2]
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[1,1,3,3]
> -; SSE2-NEXT:    por %xmm3, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm5
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    por %xmm5, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm4[1,1,3,3]
> +; SSE2-NEXT:    por %xmm0, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm5
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm3
> +; SSE2-NEXT:    por %xmm5, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
>  ; SSE2-NEXT:    pand %xmm2, %xmm1
> @@ -2429,7 +2482,7 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB6_4
>  ; SSE2-NEXT:  .LBB6_3: # %cond.store1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
>  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -2445,6 +2498,7 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE4-NEXT:    movapd %xmm4, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
>  ; SSE4-NEXT:    movmskpd %xmm3, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -2456,11 +2510,11 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE4-NEXT:  .LBB6_4: # %else2
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB6_1: # %cond.store
> -; SSE4-NEXT:    movss %xmm2, (%rdi)
> +; SSE4-NEXT:    movd %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB6_4
>  ; SSE4-NEXT:  .LBB6_3: # %cond.store1
> -; SSE4-NEXT:    extractps $2, %xmm2, 4(%rdi)
> +; SSE4-NEXT:    pextrd $1, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v2i64_v2i32:
> @@ -2469,6 +2523,7 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> +; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [2147483647,2147483647]
>  ; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm2, %xmm3
>  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> @@ -2476,7 +2531,6 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX1-NEXT:    vpcmpgtq %xmm2, %xmm0, %xmm3
>  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
>  ; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX1-NEXT:    vmaskmovps %xmm0, %xmm1, (%rdi)
>  ; AVX1-NEXT:    retq
>  ;
> @@ -2486,6 +2540,7 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> +; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX2-NEXT:    vmovdqa {{.*#+}} xmm2 = [2147483647,2147483647]
>  ; AVX2-NEXT:    vpcmpgtq %xmm0, %xmm2, %xmm3
>  ; AVX2-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> @@ -2493,7 +2548,6 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX2-NEXT:    vpcmpgtq %xmm2, %xmm0, %xmm3
>  ; AVX2-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
>  ; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX2-NEXT:    vpmaskmovd %xmm0, %xmm1, (%rdi)
>  ; AVX2-NEXT:    retq
>  ;
> @@ -2502,13 +2556,13 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [2147483647,2147483647]
>  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [18446744071562067968,18446744071562067968]
>  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> -; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
> @@ -2526,13 +2580,13 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [2147483647,2147483647]
>  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [18446744071562067968,18446744071562067968]
>  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2571,13 +2625,15 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm4[0,0,2,2]
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[1,1,3,3]
> -; SSE2-NEXT:    por %xmm3, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm5
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    por %xmm5, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm4[1,1,3,3]
> +; SSE2-NEXT:    por %xmm0, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm5
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm3
> +; SSE2-NEXT:    por %xmm5, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
>  ; SSE2-NEXT:    pand %xmm2, %xmm1
> @@ -2596,7 +2652,7 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB7_4
>  ; SSE2-NEXT:  .LBB7_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> +; SSE2-NEXT:    pextrw $1, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 2(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -2612,6 +2668,8 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE4-NEXT:    movapd %xmm4, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
>  ; SSE4-NEXT:    movmskpd %xmm3, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -2623,11 +2681,11 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE4-NEXT:  .LBB7_4: # %else2
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB7_1: # %cond.store
> -; SSE4-NEXT:    pextrw $0, %xmm2, (%rdi)
> +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB7_4
>  ; SSE4-NEXT:  .LBB7_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $4, %xmm2, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v2i64_v2i16:
> @@ -2639,6 +2697,8 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm3 = [18446744073709518848,18446744073709518848]
>  ; AVX-NEXT:    vpcmpgtq %xmm3, %xmm0, %xmm4
>  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> +; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
>  ; AVX-NEXT:    xorl $3, %eax
> @@ -2654,7 +2714,7 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB7_4
>  ; AVX-NEXT:  .LBB7_3: # %cond.store1
> -; AVX-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v2i64_v2i16:
> @@ -2666,6 +2726,8 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [18446744073709518848,18446744073709518848]
>  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX512F-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB7_1
> @@ -2680,7 +2742,7 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB7_4
>  ; AVX512F-NEXT:  .LBB7_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2689,14 +2751,14 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [32767,32767]
>  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [18446744073709518848,18446744073709518848]
>  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> -; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2743,19 +2805,24 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm4[0,0,2,2]
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[1,1,3,3]
> -; SSE2-NEXT:    por %xmm3, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm5
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    por %xmm5, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm4[1,1,3,3]
> +; SSE2-NEXT:    por %xmm0, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm5
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm3
> +; SSE2-NEXT:    por %xmm5, %xmm3
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm3
> +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> +; SSE2-NEXT:    packuswb %xmm3, %xmm3
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm2, %xmm1
> -; SSE2-NEXT:    movmskpd %xmm1, %eax
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
> +; SSE2-NEXT:    pand %xmm2, %xmm0
> +; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    xorl $3, %eax
>  ; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movd %xmm3, %ecx
>  ; SSE2-NEXT:    jne .LBB8_1
>  ; SSE2-NEXT:  # %bb.2: # %else
>  ; SSE2-NEXT:    testb $2, %al
> @@ -2763,13 +2830,11 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE2-NEXT:  .LBB8_4: # %else2
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB8_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB8_4
>  ; SSE2-NEXT:  .LBB8_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $4, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i8:
> @@ -2784,6 +2849,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE4-NEXT:    movapd %xmm4, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm2, %xmm0
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm2 = xmm2[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
>  ; SSE4-NEXT:    movmskpd %xmm3, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -2799,7 +2865,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB8_4
>  ; SSE4-NEXT:  .LBB8_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $8, %xmm2, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm2, 1(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v2i64_v2i8:
> @@ -2811,6 +2877,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm3 = [18446744073709551488,18446744073709551488]
>  ; AVX-NEXT:    vpcmpgtq %xmm3, %xmm0, %xmm4
>  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
>  ; AVX-NEXT:    xorl $3, %eax
> @@ -2826,7 +2893,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB8_4
>  ; AVX-NEXT:  .LBB8_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v2i64_v2i8:
> @@ -2838,6 +2905,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512F-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [18446744073709551488,18446744073709551488]
>  ; AVX512F-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB8_1
> @@ -2852,7 +2920,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB8_4
>  ; AVX512F-NEXT:  .LBB8_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2861,13 +2929,13 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [127,127]
>  ; AVX512BW-NEXT:    vpminsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [18446744073709551488,18446744073709551488]
>  ; AVX512BW-NEXT:    vpmaxsq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> -; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -4642,29 +4710,8 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-LABEL: truncstore_v8i32_v8i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm4, %xmm4
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [127,127,127,127]
> -; SSE2-NEXT:    movdqa %xmm5, %xmm6
> -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm6
> -; SSE2-NEXT:    pand %xmm6, %xmm0
> -; SSE2-NEXT:    pandn %xmm5, %xmm6
> -; SSE2-NEXT:    por %xmm0, %xmm6
> -; SSE2-NEXT:    movdqa %xmm5, %xmm0
> -; SSE2-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm1
> -; SSE2-NEXT:    pandn %xmm5, %xmm0
> -; SSE2-NEXT:    por %xmm1, %xmm0
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm1 = [4294967168,4294967168,4294967168,4294967168]
> -; SSE2-NEXT:    movdqa %xmm0, %xmm5
> -; SSE2-NEXT:    pcmpgtd %xmm1, %xmm5
> -; SSE2-NEXT:    pand %xmm5, %xmm0
> -; SSE2-NEXT:    pandn %xmm1, %xmm5
> -; SSE2-NEXT:    por %xmm0, %xmm5
> -; SSE2-NEXT:    movdqa %xmm6, %xmm0
> -; SSE2-NEXT:    pcmpgtd %xmm1, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm6
> -; SSE2-NEXT:    pandn %xmm1, %xmm0
> -; SSE2-NEXT:    por %xmm6, %xmm0
> -; SSE2-NEXT:    packssdw %xmm5, %xmm0
> +; SSE2-NEXT:    packssdw %xmm1, %xmm0
> +; SSE2-NEXT:    packsswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm1, %xmm3
> @@ -4684,17 +4731,26 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB12_5
>  ; SSE2-NEXT:  .LBB12_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB12_7
> +; SSE2-NEXT:    je .LBB12_8
> +; SSE2-NEXT:  .LBB12_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB12_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB12_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB12_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB12_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB12_11
> +; SSE2-NEXT:    je .LBB12_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB12_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB12_13
> -; SSE2-NEXT:  .LBB12_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB12_15
>  ; SSE2-NEXT:  .LBB12_16: # %else14
> @@ -4704,50 +4760,29 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB12_4
>  ; SSE2-NEXT:  .LBB12_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB12_6
>  ; SSE2-NEXT:  .LBB12_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB12_8
> -; SSE2-NEXT:  .LBB12_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB12_10
> -; SSE2-NEXT:  .LBB12_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB12_12
> -; SSE2-NEXT:  .LBB12_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB12_14
> +; SSE2-NEXT:    jne .LBB12_7
> +; SSE2-NEXT:    jmp .LBB12_8
>  ; SSE2-NEXT:  .LBB12_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB12_16
>  ; SSE2-NEXT:  .LBB12_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i32_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm4, %xmm4
> -; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [127,127,127,127]
> -; SSE4-NEXT:    pminsd %xmm5, %xmm0
> -; SSE4-NEXT:    pminsd %xmm5, %xmm1
> -; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [4294967168,4294967168,4294967168,4294967168]
> -; SSE4-NEXT:    pmaxsd %xmm5, %xmm1
> -; SSE4-NEXT:    pmaxsd %xmm5, %xmm0
>  ; SSE4-NEXT:    packssdw %xmm1, %xmm0
> +; SSE4-NEXT:    packsswb %xmm0, %xmm0
>  ; SSE4-NEXT:    pcmpeqd %xmm4, %xmm3
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm1, %xmm3
> @@ -4786,43 +4821,38 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB12_4
>  ; SSE4-NEXT:  .LBB12_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB12_6
>  ; SSE4-NEXT:  .LBB12_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB12_8
>  ; SSE4-NEXT:  .LBB12_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB12_10
>  ; SSE4-NEXT:  .LBB12_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB12_12
>  ; SSE4-NEXT:  .LBB12_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB12_14
>  ; SSE4-NEXT:  .LBB12_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB12_16
>  ; SSE4-NEXT:  .LBB12_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v8i32_v8i8:
>  ; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [127,127,127,127]
> -; AVX1-NEXT:    vpminsd %xmm2, %xmm0, %xmm3
> -; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
> -; AVX1-NEXT:    vpminsd %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [4294967168,4294967168,4294967168,4294967168]
> -; AVX1-NEXT:    vpmaxsd %xmm2, %xmm0, %xmm0
> -; AVX1-NEXT:    vpmaxsd %xmm2, %xmm3, %xmm2
> -; AVX1-NEXT:    vpackssdw %xmm0, %xmm2, %xmm0
> +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
> +; AVX1-NEXT:    vpackssdw %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
>  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
>  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm2, %xmm2
> @@ -4861,43 +4891,40 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB12_4
>  ; AVX1-NEXT:  .LBB12_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB12_6
>  ; AVX1-NEXT:  .LBB12_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB12_8
>  ; AVX1-NEXT:  .LBB12_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    testb $16, %al
>  ; AVX1-NEXT:    je .LBB12_10
>  ; AVX1-NEXT:  .LBB12_9: # %cond.store7
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $32, %al
>  ; AVX1-NEXT:    je .LBB12_12
>  ; AVX1-NEXT:  .LBB12_11: # %cond.store9
> -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX1-NEXT:    testb $64, %al
>  ; AVX1-NEXT:    je .LBB12_14
>  ; AVX1-NEXT:  .LBB12_13: # %cond.store11
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    testb $-128, %al
>  ; AVX1-NEXT:    je .LBB12_16
>  ; AVX1-NEXT:  .LBB12_15: # %cond.store13
> -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: truncstore_v8i32_v8i8:
>  ; AVX2:       # %bb.0:
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm3 = [127,127,127,127,127,127,127,127]
> -; AVX2-NEXT:    vpminsd %ymm3, %ymm0, %ymm0
> -; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm3 = [4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168]
> -; AVX2-NEXT:    vpmaxsd %ymm3, %ymm0, %ymm0
>  ; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
>  ; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
>  ; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
>  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
>  ; AVX2-NEXT:    notl %eax
> @@ -4932,31 +4959,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB12_4
>  ; AVX2-NEXT:  .LBB12_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB12_6
>  ; AVX2-NEXT:  .LBB12_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB12_8
>  ; AVX2-NEXT:  .LBB12_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    testb $16, %al
>  ; AVX2-NEXT:    je .LBB12_10
>  ; AVX2-NEXT:  .LBB12_9: # %cond.store7
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $32, %al
>  ; AVX2-NEXT:    je .LBB12_12
>  ; AVX2-NEXT:  .LBB12_11: # %cond.store9
> -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX2-NEXT:    testb $64, %al
>  ; AVX2-NEXT:    je .LBB12_14
>  ; AVX2-NEXT:  .LBB12_13: # %cond.store11
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    testb $-128, %al
>  ; AVX2-NEXT:    je .LBB12_16
>  ; AVX2-NEXT:  .LBB12_15: # %cond.store13
> -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -4968,7 +4995,7 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512F-NEXT:    vpminsd %ymm1, %ymm0, %ymm0
>  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168]
>  ; AVX512F-NEXT:    vpmaxsd %ymm1, %ymm0, %ymm0
> -; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB12_1
> @@ -5001,31 +5028,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB12_4
>  ; AVX512F-NEXT:  .LBB12_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB12_6
>  ; AVX512F-NEXT:  .LBB12_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB12_8
>  ; AVX512F-NEXT:  .LBB12_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB12_10
>  ; AVX512F-NEXT:  .LBB12_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB12_12
>  ; AVX512F-NEXT:  .LBB12_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB12_14
>  ; AVX512F-NEXT:  .LBB12_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB12_16
>  ; AVX512F-NEXT:  .LBB12_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -5033,14 +5060,13 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [127,127,127,127,127,127,127,127]
>  ; AVX512BW-NEXT:    vpminsd %ymm1, %ymm0, %ymm0
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168,4294967168]
>  ; AVX512BW-NEXT:    vpmaxsd %ymm1, %ymm0, %ymm0
> -; AVX512BW-NEXT:    vpmovdw %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> -; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> +; AVX512BW-NEXT:    vpmovdb %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -5067,18 +5093,7 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE2-LABEL: truncstore_v4i32_v4i16:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [32767,32767,32767,32767]
> -; SSE2-NEXT:    movdqa %xmm3, %xmm4
> -; SSE2-NEXT:    pcmpgtd %xmm0, %xmm4
> -; SSE2-NEXT:    pand %xmm4, %xmm0
> -; SSE2-NEXT:    pandn %xmm3, %xmm4
> -; SSE2-NEXT:    por %xmm0, %xmm4
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [4294934528,4294934528,4294934528,4294934528]
> -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> -; SSE2-NEXT:    pcmpgtd %xmm3, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm4
> -; SSE2-NEXT:    pandn %xmm3, %xmm0
> -; SSE2-NEXT:    por %xmm4, %xmm0
> +; SSE2-NEXT:    packssdw %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE2-NEXT:    movmskps %xmm2, %eax
>  ; SSE2-NEXT:    xorl $15, %eax
> @@ -5101,25 +5116,24 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB13_4
>  ; SSE2-NEXT:  .LBB13_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 2(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB13_6
>  ; SSE2-NEXT:  .LBB13_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 4(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
>  ; SSE2-NEXT:    je .LBB13_8
>  ; SSE2-NEXT:  .LBB13_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 6(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i32_v4i16:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> -; SSE4-NEXT:    pminsd {{.*}}(%rip), %xmm0
> -; SSE4-NEXT:    pmaxsd {{.*}}(%rip), %xmm0
> +; SSE4-NEXT:    packssdw %xmm0, %xmm0
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskps %xmm2, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -5141,92 +5155,52 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB13_4
>  ; SSE4-NEXT:  .LBB13_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB13_6
>  ; SSE4-NEXT:  .LBB13_5: # %cond.store3
> -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB13_8
>  ; SSE4-NEXT:  .LBB13_7: # %cond.store5
> -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
> -; AVX1-LABEL: truncstore_v4i32_v4i16:
> -; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX1-NEXT:    vpminsd {{.*}}(%rip), %xmm0, %xmm0
> -; AVX1-NEXT:    vpmaxsd {{.*}}(%rip), %xmm0, %xmm0
> -; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> -; AVX1-NEXT:    vmovmskps %xmm1, %eax
> -; AVX1-NEXT:    xorl $15, %eax
> -; AVX1-NEXT:    testb $1, %al
> -; AVX1-NEXT:    jne .LBB13_1
> -; AVX1-NEXT:  # %bb.2: # %else
> -; AVX1-NEXT:    testb $2, %al
> -; AVX1-NEXT:    jne .LBB13_3
> -; AVX1-NEXT:  .LBB13_4: # %else2
> -; AVX1-NEXT:    testb $4, %al
> -; AVX1-NEXT:    jne .LBB13_5
> -; AVX1-NEXT:  .LBB13_6: # %else4
> -; AVX1-NEXT:    testb $8, %al
> -; AVX1-NEXT:    jne .LBB13_7
> -; AVX1-NEXT:  .LBB13_8: # %else6
> -; AVX1-NEXT:    retq
> -; AVX1-NEXT:  .LBB13_1: # %cond.store
> -; AVX1-NEXT:    vpextrw $0, %xmm0, (%rdi)
> -; AVX1-NEXT:    testb $2, %al
> -; AVX1-NEXT:    je .LBB13_4
> -; AVX1-NEXT:  .LBB13_3: # %cond.store1
> -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> -; AVX1-NEXT:    testb $4, %al
> -; AVX1-NEXT:    je .LBB13_6
> -; AVX1-NEXT:  .LBB13_5: # %cond.store3
> -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> -; AVX1-NEXT:    testb $8, %al
> -; AVX1-NEXT:    je .LBB13_8
> -; AVX1-NEXT:  .LBB13_7: # %cond.store5
> -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> -; AVX1-NEXT:    retq
> -;
> -; AVX2-LABEL: truncstore_v4i32_v4i16:
> -; AVX2:       # %bb.0:
> -; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [32767,32767,32767,32767]
> -; AVX2-NEXT:    vpminsd %xmm3, %xmm0, %xmm0
> -; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [4294934528,4294934528,4294934528,4294934528]
> -; AVX2-NEXT:    vpmaxsd %xmm3, %xmm0, %xmm0
> -; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> -; AVX2-NEXT:    vmovmskps %xmm1, %eax
> -; AVX2-NEXT:    xorl $15, %eax
> -; AVX2-NEXT:    testb $1, %al
> -; AVX2-NEXT:    jne .LBB13_1
> -; AVX2-NEXT:  # %bb.2: # %else
> -; AVX2-NEXT:    testb $2, %al
> -; AVX2-NEXT:    jne .LBB13_3
> -; AVX2-NEXT:  .LBB13_4: # %else2
> -; AVX2-NEXT:    testb $4, %al
> -; AVX2-NEXT:    jne .LBB13_5
> -; AVX2-NEXT:  .LBB13_6: # %else4
> -; AVX2-NEXT:    testb $8, %al
> -; AVX2-NEXT:    jne .LBB13_7
> -; AVX2-NEXT:  .LBB13_8: # %else6
> -; AVX2-NEXT:    retq
> -; AVX2-NEXT:  .LBB13_1: # %cond.store
> -; AVX2-NEXT:    vpextrw $0, %xmm0, (%rdi)
> -; AVX2-NEXT:    testb $2, %al
> -; AVX2-NEXT:    je .LBB13_4
> -; AVX2-NEXT:  .LBB13_3: # %cond.store1
> -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> -; AVX2-NEXT:    testb $4, %al
> -; AVX2-NEXT:    je .LBB13_6
> -; AVX2-NEXT:  .LBB13_5: # %cond.store3
> -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> -; AVX2-NEXT:    testb $8, %al
> -; AVX2-NEXT:    je .LBB13_8
> -; AVX2-NEXT:  .LBB13_7: # %cond.store5
> -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> -; AVX2-NEXT:    retq
> +; AVX-LABEL: truncstore_v4i32_v4i16:
> +; AVX:       # %bb.0:
> +; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> +; AVX-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> +; AVX-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
> +; AVX-NEXT:    vmovmskps %xmm1, %eax
> +; AVX-NEXT:    xorl $15, %eax
> +; AVX-NEXT:    testb $1, %al
> +; AVX-NEXT:    jne .LBB13_1
> +; AVX-NEXT:  # %bb.2: # %else
> +; AVX-NEXT:    testb $2, %al
> +; AVX-NEXT:    jne .LBB13_3
> +; AVX-NEXT:  .LBB13_4: # %else2
> +; AVX-NEXT:    testb $4, %al
> +; AVX-NEXT:    jne .LBB13_5
> +; AVX-NEXT:  .LBB13_6: # %else4
> +; AVX-NEXT:    testb $8, %al
> +; AVX-NEXT:    jne .LBB13_7
> +; AVX-NEXT:  .LBB13_8: # %else6
> +; AVX-NEXT:    retq
> +; AVX-NEXT:  .LBB13_1: # %cond.store
> +; AVX-NEXT:    vpextrw $0, %xmm0, (%rdi)
> +; AVX-NEXT:    testb $2, %al
> +; AVX-NEXT:    je .LBB13_4
> +; AVX-NEXT:  .LBB13_3: # %cond.store1
> +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
> +; AVX-NEXT:    testb $4, %al
> +; AVX-NEXT:    je .LBB13_6
> +; AVX-NEXT:  .LBB13_5: # %cond.store3
> +; AVX-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
> +; AVX-NEXT:    testb $8, %al
> +; AVX-NEXT:    je .LBB13_8
> +; AVX-NEXT:  .LBB13_7: # %cond.store5
> +; AVX-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
> +; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i32_v4i16:
>  ; AVX512F:       # %bb.0:
> @@ -5236,6 +5210,7 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512F-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
>  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [4294934528,4294934528,4294934528,4294934528]
>  ; AVX512F-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB13_1
> @@ -5256,15 +5231,15 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB13_4
>  ; AVX512F-NEXT:  .LBB13_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB13_6
>  ; AVX512F-NEXT:  .LBB13_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB13_8
>  ; AVX512F-NEXT:  .LBB13_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -5272,13 +5247,13 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [32767,32767,32767,32767]
>  ; AVX512BW-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [4294934528,4294934528,4294934528,4294934528]
>  ; AVX512BW-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
> -; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -5310,45 +5285,48 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE2-NEXT:    pand %xmm4, %xmm0
>  ; SSE2-NEXT:    pandn %xmm3, %xmm4
>  ; SSE2-NEXT:    por %xmm0, %xmm4
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [4294967168,4294967168,4294967168,4294967168]
> -; SSE2-NEXT:    movdqa %xmm4, %xmm0
> -; SSE2-NEXT:    pcmpgtd %xmm3, %xmm0
> -; SSE2-NEXT:    pand %xmm0, %xmm4
> -; SSE2-NEXT:    pandn %xmm3, %xmm0
> -; SSE2-NEXT:    por %xmm4, %xmm0
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm0 = [4294967168,4294967168,4294967168,4294967168]
> +; SSE2-NEXT:    movdqa %xmm4, %xmm3
> +; SSE2-NEXT:    pcmpgtd %xmm0, %xmm3
> +; SSE2-NEXT:    pand %xmm3, %xmm4
> +; SSE2-NEXT:    pandn %xmm0, %xmm3
> +; SSE2-NEXT:    por %xmm4, %xmm3
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm3
> +; SSE2-NEXT:    packuswb %xmm3, %xmm3
> +; SSE2-NEXT:    packuswb %xmm3, %xmm3
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> -; SSE2-NEXT:    movmskps %xmm2, %eax
> -; SSE2-NEXT:    xorl $15, %eax
> -; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movmskps %xmm2, %ecx
> +; SSE2-NEXT:    xorl $15, %ecx
> +; SSE2-NEXT:    testb $1, %cl
> +; SSE2-NEXT:    movd %xmm3, %eax
>  ; SSE2-NEXT:    jne .LBB14_1
>  ; SSE2-NEXT:  # %bb.2: # %else
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    jne .LBB14_3
>  ; SSE2-NEXT:  .LBB14_4: # %else2
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    jne .LBB14_5
>  ; SSE2-NEXT:  .LBB14_6: # %else4
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    jne .LBB14_7
>  ; SSE2-NEXT:  .LBB14_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB14_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, (%rdi)
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    movb %al, (%rdi)
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    je .LBB14_4
>  ; SSE2-NEXT:  .LBB14_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    movb %ah, 1(%rdi)
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    je .LBB14_6
>  ; SSE2-NEXT:  .LBB14_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    movl %eax, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    je .LBB14_8
>  ; SSE2-NEXT:  .LBB14_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm0, %eax
> +; SSE2-NEXT:    shrl $24, %eax
>  ; SSE2-NEXT:    movb %al, 3(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -5357,6 +5335,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
>  ; SSE4-NEXT:    pminsd {{.*}}(%rip), %xmm0
>  ; SSE4-NEXT:    pmaxsd {{.*}}(%rip), %xmm0
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskps %xmm2, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -5378,15 +5357,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB14_4
>  ; SSE4-NEXT:  .LBB14_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB14_6
>  ; SSE4-NEXT:  .LBB14_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB14_8
>  ; SSE4-NEXT:  .LBB14_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i32_v4i8:
> @@ -5394,6 +5373,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
>  ; AVX1-NEXT:    vpminsd {{.*}}(%rip), %xmm0, %xmm0
>  ; AVX1-NEXT:    vpmaxsd {{.*}}(%rip), %xmm0, %xmm0
> +; AVX1-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -5415,15 +5395,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB14_4
>  ; AVX1-NEXT:  .LBB14_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB14_6
>  ; AVX1-NEXT:  .LBB14_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB14_8
>  ; AVX1-NEXT:  .LBB14_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: truncstore_v4i32_v4i8:
> @@ -5433,6 +5413,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX2-NEXT:    vpminsd %xmm3, %xmm0, %xmm0
>  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [4294967168,4294967168,4294967168,4294967168]
>  ; AVX2-NEXT:    vpmaxsd %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -5454,15 +5435,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB14_4
>  ; AVX2-NEXT:  .LBB14_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB14_6
>  ; AVX2-NEXT:  .LBB14_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB14_8
>  ; AVX2-NEXT:  .LBB14_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i32_v4i8:
> @@ -5473,6 +5454,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512F-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
>  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [4294967168,4294967168,4294967168,4294967168]
>  ; AVX512F-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB14_1
> @@ -5493,15 +5475,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB14_4
>  ; AVX512F-NEXT:  .LBB14_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB14_6
>  ; AVX512F-NEXT:  .LBB14_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB14_8
>  ; AVX512F-NEXT:  .LBB14_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -5509,13 +5491,13 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [127,127,127,127]
>  ; AVX512BW-NEXT:    vpminsd %xmm1, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [4294967168,4294967168,4294967168,4294967168]
>  ; AVX512BW-NEXT:    vpmaxsd %xmm1, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> -; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -7373,8 +7355,7 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-LABEL: truncstore_v8i16_v8i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> -; SSE2-NEXT:    pminsw {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    pmaxsw {{.*}}(%rip), %xmm0
> +; SSE2-NEXT:    packsswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqw %xmm1, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm2, %xmm1
> @@ -7391,17 +7372,26 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB17_5
>  ; SSE2-NEXT:  .LBB17_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB17_7
> +; SSE2-NEXT:    je .LBB17_8
> +; SSE2-NEXT:  .LBB17_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB17_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB17_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB17_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB17_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB17_11
> +; SSE2-NEXT:    je .LBB17_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB17_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB17_13
> -; SSE2-NEXT:  .LBB17_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB17_15
>  ; SSE2-NEXT:  .LBB17_16: # %else14
> @@ -7411,45 +7401,28 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB17_4
>  ; SSE2-NEXT:  .LBB17_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB17_6
>  ; SSE2-NEXT:  .LBB17_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB17_8
> -; SSE2-NEXT:  .LBB17_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB17_10
> -; SSE2-NEXT:  .LBB17_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB17_12
> -; SSE2-NEXT:  .LBB17_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB17_14
> +; SSE2-NEXT:    jne .LBB17_7
> +; SSE2-NEXT:    jmp .LBB17_8
>  ; SSE2-NEXT:  .LBB17_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB17_16
>  ; SSE2-NEXT:  .LBB17_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i16_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
> -; SSE4-NEXT:    pminsw {{.*}}(%rip), %xmm0
> -; SSE4-NEXT:    pmaxsw {{.*}}(%rip), %xmm0
> +; SSE4-NEXT:    packsswb %xmm0, %xmm0
>  ; SSE4-NEXT:    pcmpeqw %xmm1, %xmm2
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm2, %xmm1
> @@ -7485,38 +7458,37 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB17_4
>  ; SSE4-NEXT:  .LBB17_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB17_6
>  ; SSE4-NEXT:  .LBB17_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB17_8
>  ; SSE4-NEXT:  .LBB17_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB17_10
>  ; SSE4-NEXT:  .LBB17_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB17_12
>  ; SSE4-NEXT:  .LBB17_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB17_14
>  ; SSE4-NEXT:  .LBB17_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB17_16
>  ; SSE4-NEXT:  .LBB17_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v8i16_v8i8:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
> -; AVX-NEXT:    vpminsw {{.*}}(%rip), %xmm0, %xmm0
> -; AVX-NEXT:    vpmaxsw {{.*}}(%rip), %xmm0, %xmm0
> +; AVX-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
>  ; AVX-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> @@ -7552,31 +7524,31 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB17_4
>  ; AVX-NEXT:  .LBB17_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    testb $4, %al
>  ; AVX-NEXT:    je .LBB17_6
>  ; AVX-NEXT:  .LBB17_5: # %cond.store3
> -; AVX-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    testb $8, %al
>  ; AVX-NEXT:    je .LBB17_8
>  ; AVX-NEXT:  .LBB17_7: # %cond.store5
> -; AVX-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX-NEXT:    testb $16, %al
>  ; AVX-NEXT:    je .LBB17_10
>  ; AVX-NEXT:  .LBB17_9: # %cond.store7
> -; AVX-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX-NEXT:    testb $32, %al
>  ; AVX-NEXT:    je .LBB17_12
>  ; AVX-NEXT:  .LBB17_11: # %cond.store9
> -; AVX-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX-NEXT:    testb $64, %al
>  ; AVX-NEXT:    je .LBB17_14
>  ; AVX-NEXT:  .LBB17_13: # %cond.store11
> -; AVX-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX-NEXT:    testb $-128, %al
>  ; AVX-NEXT:    je .LBB17_16
>  ; AVX-NEXT:  .LBB17_15: # %cond.store13
> -; AVX-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v8i16_v8i8:
> @@ -7588,6 +7560,7 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpminsw {{.*}}(%rip), %xmm0, %xmm0
>  ; AVX512F-NEXT:    vpmaxsw {{.*}}(%rip), %xmm0, %xmm0
> +; AVX512F-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB17_1
> @@ -7620,31 +7593,31 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB17_4
>  ; AVX512F-NEXT:  .LBB17_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB17_6
>  ; AVX512F-NEXT:  .LBB17_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB17_8
>  ; AVX512F-NEXT:  .LBB17_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB17_10
>  ; AVX512F-NEXT:  .LBB17_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB17_12
>  ; AVX512F-NEXT:  .LBB17_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB17_14
>  ; AVX512F-NEXT:  .LBB17_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB17_16
>  ; AVX512F-NEXT:  .LBB17_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -7652,11 +7625,11 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmw %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
>  ; AVX512BW-NEXT:    vpminsw {{.*}}(%rip), %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpmaxsw {{.*}}(%rip), %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
> -; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/masked_store_trunc_usat.ll Wed Aug  7 09:24:26 2019
> @@ -872,6 +872,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    por %xmm2, %xmm0
>  ; SSE2-NEXT:    packuswb %xmm1, %xmm0
>  ; SSE2-NEXT:    packuswb %xmm0, %xmm7
> +; SSE2-NEXT:    packuswb %xmm7, %xmm7
>  ; SSE2-NEXT:    pcmpeqd %xmm8, %xmm5
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm0
>  ; SSE2-NEXT:    pxor %xmm0, %xmm5
> @@ -891,17 +892,26 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB2_5
>  ; SSE2-NEXT:  .LBB2_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB2_7
> +; SSE2-NEXT:    je .LBB2_8
> +; SSE2-NEXT:  .LBB2_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB2_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB2_9
> +; SSE2-NEXT:    pextrw $2, %xmm7, %ecx
> +; SSE2-NEXT:    je .LBB2_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB2_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB2_11
> +; SSE2-NEXT:    je .LBB2_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB2_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm7, %ecx
>  ; SSE2-NEXT:    jne .LBB2_13
> -; SSE2-NEXT:  .LBB2_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB2_15
>  ; SSE2-NEXT:  .LBB2_16: # %else14
> @@ -911,38 +921,22 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB2_4
>  ; SSE2-NEXT:  .LBB2_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB2_6
>  ; SSE2-NEXT:  .LBB2_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm7, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB2_8
> -; SSE2-NEXT:  .LBB2_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm7, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB2_10
> -; SSE2-NEXT:  .LBB2_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm7, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB2_12
> -; SSE2-NEXT:  .LBB2_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm7, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB2_14
> +; SSE2-NEXT:    jne .LBB2_7
> +; SSE2-NEXT:    jmp .LBB2_8
>  ; SSE2-NEXT:  .LBB2_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm7, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB2_16
>  ; SSE2-NEXT:  .LBB2_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm7, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i64_v8i8:
> @@ -977,6 +971,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm6
>  ; SSE4-NEXT:    packusdw %xmm7, %xmm6
>  ; SSE4-NEXT:    packusdw %xmm6, %xmm1
> +; SSE4-NEXT:    packuswb %xmm1, %xmm1
>  ; SSE4-NEXT:    pcmpeqd %xmm8, %xmm5
>  ; SSE4-NEXT:    pcmpeqd %xmm0, %xmm0
>  ; SSE4-NEXT:    pxor %xmm0, %xmm5
> @@ -1015,31 +1010,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB2_4
>  ; SSE4-NEXT:  .LBB2_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm1, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm1, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB2_6
>  ; SSE4-NEXT:  .LBB2_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm1, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm1, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB2_8
>  ; SSE4-NEXT:  .LBB2_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm1, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm1, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB2_10
>  ; SSE4-NEXT:  .LBB2_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm1, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm1, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB2_12
>  ; SSE4-NEXT:  .LBB2_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm1, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm1, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB2_14
>  ; SSE4-NEXT:  .LBB2_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm1, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm1, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB2_16
>  ; SSE4-NEXT:  .LBB2_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm1, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm1, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v8i64_v8i8:
> @@ -1064,6 +1059,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX1-NEXT:    vblendvpd %xmm8, %xmm0, %xmm5, %xmm0
>  ; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
>  ; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> +; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm1
>  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
> @@ -1102,31 +1098,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB2_4
>  ; AVX1-NEXT:  .LBB2_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB2_6
>  ; AVX1-NEXT:  .LBB2_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB2_8
>  ; AVX1-NEXT:  .LBB2_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    testb $16, %al
>  ; AVX1-NEXT:    je .LBB2_10
>  ; AVX1-NEXT:  .LBB2_9: # %cond.store7
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $32, %al
>  ; AVX1-NEXT:    je .LBB2_12
>  ; AVX1-NEXT:  .LBB2_11: # %cond.store9
> -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX1-NEXT:    testb $64, %al
>  ; AVX1-NEXT:    je .LBB2_14
>  ; AVX1-NEXT:  .LBB2_13: # %cond.store11
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    testb $-128, %al
>  ; AVX1-NEXT:    je .LBB2_16
>  ; AVX1-NEXT:  .LBB2_15: # %cond.store13
> -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -1135,17 +1131,24 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm4 = [255,255,255,255]
>  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm5 = [9223372036854775808,9223372036854775808,9223372036854775808,9223372036854775808]
> -; AVX2-NEXT:    vpxor %ymm5, %ymm1, %ymm6
> +; AVX2-NEXT:    vpxor %ymm5, %ymm0, %ymm6
>  ; AVX2-NEXT:    vpbroadcastq {{.*#+}} ymm7 = [9223372036854776063,9223372036854776063,9223372036854776063,9223372036854776063]
>  ; AVX2-NEXT:    vpcmpgtq %ymm6, %ymm7, %ymm6
> -; AVX2-NEXT:    vblendvpd %ymm6, %ymm1, %ymm4, %ymm1
> -; AVX2-NEXT:    vpxor %ymm5, %ymm0, %ymm5
> +; AVX2-NEXT:    vblendvpd %ymm6, %ymm0, %ymm4, %ymm0
> +; AVX2-NEXT:    vpxor %ymm5, %ymm1, %ymm5
>  ; AVX2-NEXT:    vpcmpgtq %ymm5, %ymm7, %ymm5
> -; AVX2-NEXT:    vblendvpd %ymm5, %ymm0, %ymm4, %ymm0
> -; AVX2-NEXT:    vpackusdw %ymm1, %ymm0, %ymm0
> -; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
> -; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm1
> -; AVX2-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
> +; AVX2-NEXT:    vblendvpd %ymm5, %ymm1, %ymm4, %ymm1
> +; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm4
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 = <u,u,0,8,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm1, %xmm1
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
> +; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm4
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm5 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm4, %xmm4
> +; AVX2-NEXT:    vpshufb %xmm5, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
> +; AVX2-NEXT:    vpblendd {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
>  ; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm2, %ymm1
>  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
>  ; AVX2-NEXT:    notl %eax
> @@ -1180,31 +1183,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB2_4
>  ; AVX2-NEXT:  .LBB2_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB2_6
>  ; AVX2-NEXT:  .LBB2_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB2_8
>  ; AVX2-NEXT:  .LBB2_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    testb $16, %al
>  ; AVX2-NEXT:    je .LBB2_10
>  ; AVX2-NEXT:  .LBB2_9: # %cond.store7
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $32, %al
>  ; AVX2-NEXT:    je .LBB2_12
>  ; AVX2-NEXT:  .LBB2_11: # %cond.store9
> -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX2-NEXT:    testb $64, %al
>  ; AVX2-NEXT:    je .LBB2_14
>  ; AVX2-NEXT:  .LBB2_13: # %cond.store11
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    testb $-128, %al
>  ; AVX2-NEXT:    je .LBB2_16
>  ; AVX2-NEXT:  .LBB2_15: # %cond.store13
> -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -1213,7 +1216,7 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX512F-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpminuq {{.*}}(%rip){1to8}, %zmm0, %zmm0
> -; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
> +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB2_1
> @@ -1246,31 +1249,31 @@ define void @truncstore_v8i64_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB2_4
>  ; AVX512F-NEXT:  .LBB2_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB2_6
>  ; AVX512F-NEXT:  .LBB2_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB2_8
>  ; AVX512F-NEXT:  .LBB2_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB2_10
>  ; AVX512F-NEXT:  .LBB2_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB2_12
>  ; AVX512F-NEXT:  .LBB2_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB2_14
>  ; AVX512F-NEXT:  .LBB2_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB2_16
>  ; AVX512F-NEXT:  .LBB2_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1504,7 +1507,7 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:    pxor %xmm3, %xmm3
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [65535,65535]
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [9223372039002259456,9223372039002259456]
> -; SSE2-NEXT:    movdqa %xmm1, %xmm6
> +; SSE2-NEXT:    movdqa %xmm0, %xmm6
>  ; SSE2-NEXT:    pxor %xmm5, %xmm6
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm9 = [9223372039002324991,9223372039002324991]
>  ; SSE2-NEXT:    movdqa %xmm9, %xmm7
> @@ -1515,22 +1518,26 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:    pand %xmm4, %xmm6
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm7[1,1,3,3]
>  ; SSE2-NEXT:    por %xmm6, %xmm4
> -; SSE2-NEXT:    pand %xmm4, %xmm1
> +; SSE2-NEXT:    pand %xmm4, %xmm0
>  ; SSE2-NEXT:    pandn %xmm8, %xmm4
> -; SSE2-NEXT:    por %xmm1, %xmm4
> -; SSE2-NEXT:    pxor %xmm0, %xmm5
> -; SSE2-NEXT:    movdqa %xmm9, %xmm1
> -; SSE2-NEXT:    pcmpgtd %xmm5, %xmm1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm1[0,0,2,2]
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pxor %xmm1, %xmm5
> +; SSE2-NEXT:    movdqa %xmm9, %xmm0
> +; SSE2-NEXT:    pcmpgtd %xmm5, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm0[0,0,2,2]
>  ; SSE2-NEXT:    pcmpeqd %xmm9, %xmm5
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm5[1,1,3,3]
>  ; SSE2-NEXT:    pand %xmm6, %xmm5
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; SSE2-NEXT:    por %xmm5, %xmm1
> -; SSE2-NEXT:    pand %xmm1, %xmm0
> -; SSE2-NEXT:    pandn %xmm8, %xmm1
> -; SSE2-NEXT:    por %xmm0, %xmm1
> -; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,2],xmm4[0,2]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> +; SSE2-NEXT:    por %xmm5, %xmm0
> +; SSE2-NEXT:    pand %xmm0, %xmm1
> +; SSE2-NEXT:    pandn %xmm8, %xmm0
> +; SSE2-NEXT:    por %xmm1, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
>  ; SSE2-NEXT:    movmskps %xmm3, %eax
>  ; SSE2-NEXT:    xorl $15, %eax
> @@ -1548,45 +1555,49 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE2-NEXT:  .LBB4_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB4_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm1, %ecx
> +; SSE2-NEXT:    movd %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB4_4
>  ; SSE2-NEXT:  .LBB4_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm1, %ecx
> +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 2(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB4_6
>  ; SSE2-NEXT:  .LBB4_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm1, %ecx
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 4(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
>  ; SSE2-NEXT:    je .LBB4_8
>  ; SSE2-NEXT:  .LBB4_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm1, %eax
> +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 6(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i64_v4i16:
>  ; SSE4:       # %bb.0:
> -; SSE4-NEXT:    movdqa %xmm0, %xmm8
> -; SSE4-NEXT:    pxor %xmm6, %xmm6
> -; SSE4-NEXT:    movapd {{.*#+}} xmm5 = [65535,65535]
> +; SSE4-NEXT:    movdqa %xmm0, %xmm5
> +; SSE4-NEXT:    pxor %xmm8, %xmm8
> +; SSE4-NEXT:    movapd {{.*#+}} xmm6 = [65535,65535]
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm7 = [9223372036854775808,9223372036854775808]
> -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> +; SSE4-NEXT:    movdqa %xmm0, %xmm3
>  ; SSE4-NEXT:    pxor %xmm7, %xmm3
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm4 = [9223372036854841343,9223372036854841343]
>  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
>  ; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> -; SSE4-NEXT:    movapd %xmm5, %xmm3
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm3
> -; SSE4-NEXT:    pxor %xmm8, %xmm7
> +; SSE4-NEXT:    movapd %xmm6, %xmm3
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm5, %xmm3
> +; SSE4-NEXT:    pxor %xmm1, %xmm7
>  ; SSE4-NEXT:    pcmpgtq %xmm7, %xmm4
>  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm8, %xmm5
> -; SSE4-NEXT:    packusdw %xmm3, %xmm5
> -; SSE4-NEXT:    pcmpeqd %xmm2, %xmm6
> -; SSE4-NEXT:    movmskps %xmm6, %eax
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm6
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm6[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> +; SSE4-NEXT:    pcmpeqd %xmm2, %xmm8
> +; SSE4-NEXT:    movmskps %xmm8, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
>  ; SSE4-NEXT:    testb $1, %al
>  ; SSE4-NEXT:    jne .LBB4_1
> @@ -1602,19 +1613,19 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; SSE4-NEXT:  .LBB4_8: # %else6
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB4_1: # %cond.store
> -; SSE4-NEXT:    pextrw $0, %xmm5, (%rdi)
> +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB4_4
>  ; SSE4-NEXT:  .LBB4_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $2, %xmm5, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB4_6
>  ; SSE4-NEXT:  .LBB4_5: # %cond.store3
> -; SSE4-NEXT:    pextrw $4, %xmm5, 4(%rdi)
> +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB4_8
>  ; SSE4-NEXT:  .LBB4_7: # %cond.store5
> -; SSE4-NEXT:    pextrw $6, %xmm5, 6(%rdi)
> +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i64_v4i16:
> @@ -1629,8 +1640,12 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm5, %xmm3
>  ; AVX1-NEXT:    vmovapd {{.*#+}} xmm5 = [65535,65535]
>  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm6, %xmm5, %xmm3
> +; AVX1-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
>  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm0, %xmm5, %xmm0
> -; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> +; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX1-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -1653,15 +1668,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB4_4
>  ; AVX1-NEXT:  .LBB4_3: # %cond.store1
> -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB4_6
>  ; AVX1-NEXT:  .LBB4_5: # %cond.store3
> -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB4_8
>  ; AVX1-NEXT:  .LBB4_7: # %cond.store5
> -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -1675,7 +1690,11 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm5, %ymm4
>  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
>  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> -; AVX2-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpermilps {{.*#+}} xmm3 = xmm3[0,2,2,3]
> +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm3 = xmm3[0,2,2,3,4,5,6,7]
> +; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX2-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -1698,15 +1717,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB4_4
>  ; AVX2-NEXT:  .LBB4_3: # %cond.store1
> -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB4_6
>  ; AVX2-NEXT:  .LBB4_5: # %cond.store3
> -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB4_8
>  ; AVX2-NEXT:  .LBB4_7: # %cond.store5
> -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -1717,7 +1736,7 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [65535,65535,65535,65535]
>  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovqw %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB4_1
> @@ -1738,15 +1757,15 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB4_4
>  ; AVX512F-NEXT:  .LBB4_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB4_6
>  ; AVX512F-NEXT:  .LBB4_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB4_8
>  ; AVX512F-NEXT:  .LBB4_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -1755,12 +1774,11 @@ define void @truncstore_v4i64_v4i16(<4 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [65535,65535,65535,65535]
> -; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
> +; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [65535,65535,65535,65535]
> +; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> +; AVX512BW-NEXT:    vpmovqw %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -1783,92 +1801,99 @@ define void @truncstore_v4i64_v4i16(<4 x
>  define void @truncstore_v4i64_v4i8(<4 x i64> %x, <4 x i8>* %p, <4 x i32> %mask) {
>  ; SSE2-LABEL: truncstore_v4i64_v4i8:
>  ; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pxor %xmm3, %xmm3
> +; SSE2-NEXT:    pxor %xmm9, %xmm9
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm8 = [255,255]
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [9223372039002259456,9223372039002259456]
> -; SSE2-NEXT:    movdqa %xmm1, %xmm6
> -; SSE2-NEXT:    pxor %xmm5, %xmm6
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm9 = [9223372039002259711,9223372039002259711]
> -; SSE2-NEXT:    movdqa %xmm9, %xmm7
> -; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm7[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm9, %xmm6
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm6[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm4, %xmm6
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm6 = [9223372039002259456,9223372039002259456]
> +; SSE2-NEXT:    movdqa %xmm0, %xmm4
> +; SSE2-NEXT:    pxor %xmm6, %xmm4
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm10 = [9223372039002259711,9223372039002259711]
> +; SSE2-NEXT:    movdqa %xmm10, %xmm7
> +; SSE2-NEXT:    pcmpgtd %xmm4, %xmm7
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm7[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm4
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm4[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm3, %xmm5
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm7[1,1,3,3]
> -; SSE2-NEXT:    por %xmm6, %xmm4
> -; SSE2-NEXT:    pand %xmm4, %xmm1
> +; SSE2-NEXT:    por %xmm5, %xmm4
> +; SSE2-NEXT:    pand %xmm4, %xmm0
>  ; SSE2-NEXT:    pandn %xmm8, %xmm4
> -; SSE2-NEXT:    por %xmm1, %xmm4
> -; SSE2-NEXT:    pxor %xmm0, %xmm5
> -; SSE2-NEXT:    movdqa %xmm9, %xmm1
> -; SSE2-NEXT:    pcmpgtd %xmm5, %xmm1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm1[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm9, %xmm5
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm5[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm5
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> -; SSE2-NEXT:    por %xmm5, %xmm1
> -; SSE2-NEXT:    pand %xmm1, %xmm0
> -; SSE2-NEXT:    pandn %xmm8, %xmm1
> -; SSE2-NEXT:    por %xmm0, %xmm1
> -; SSE2-NEXT:    packuswb %xmm4, %xmm1
> -; SSE2-NEXT:    pcmpeqd %xmm2, %xmm3
> -; SSE2-NEXT:    movmskps %xmm3, %eax
> -; SSE2-NEXT:    xorl $15, %eax
> -; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pxor %xmm1, %xmm6
> +; SSE2-NEXT:    movdqa %xmm10, %xmm0
> +; SSE2-NEXT:    pcmpgtd %xmm6, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,0,2,2]
> +; SSE2-NEXT:    pcmpeqd %xmm10, %xmm6
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm6[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm3, %xmm5
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> +; SSE2-NEXT:    por %xmm5, %xmm0
> +; SSE2-NEXT:    pand %xmm0, %xmm1
> +; SSE2-NEXT:    pandn %xmm8, %xmm0
> +; SSE2-NEXT:    por %xmm1, %xmm0
> +; SSE2-NEXT:    pand %xmm8, %xmm0
> +; SSE2-NEXT:    pand %xmm8, %xmm4
> +; SSE2-NEXT:    packuswb %xmm0, %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    pcmpeqd %xmm2, %xmm9
> +; SSE2-NEXT:    movmskps %xmm9, %ecx
> +; SSE2-NEXT:    xorl $15, %ecx
> +; SSE2-NEXT:    testb $1, %cl
> +; SSE2-NEXT:    movd %xmm4, %eax
>  ; SSE2-NEXT:    jne .LBB5_1
>  ; SSE2-NEXT:  # %bb.2: # %else
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    jne .LBB5_3
>  ; SSE2-NEXT:  .LBB5_4: # %else2
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    jne .LBB5_5
>  ; SSE2-NEXT:  .LBB5_6: # %else4
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    jne .LBB5_7
>  ; SSE2-NEXT:  .LBB5_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB5_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm1, %ecx
> -; SSE2-NEXT:    movb %cl, (%rdi)
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    movb %al, (%rdi)
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    je .LBB5_4
>  ; SSE2-NEXT:  .LBB5_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm1, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    movb %ah, 1(%rdi)
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    je .LBB5_6
>  ; SSE2-NEXT:  .LBB5_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm1, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    movl %eax, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    je .LBB5_8
>  ; SSE2-NEXT:  .LBB5_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm1, %eax
> +; SSE2-NEXT:    shrl $24, %eax
>  ; SSE2-NEXT:    movb %al, 3(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v4i64_v4i8:
>  ; SSE4:       # %bb.0:
> -; SSE4-NEXT:    movdqa %xmm0, %xmm8
> -; SSE4-NEXT:    pxor %xmm6, %xmm6
> -; SSE4-NEXT:    movapd {{.*#+}} xmm5 = [255,255]
> -; SSE4-NEXT:    movdqa {{.*#+}} xmm7 = [9223372036854775808,9223372036854775808]
> -; SSE4-NEXT:    movdqa %xmm1, %xmm3
> -; SSE4-NEXT:    pxor %xmm7, %xmm3
> +; SSE4-NEXT:    movdqa %xmm0, %xmm3
> +; SSE4-NEXT:    pxor %xmm8, %xmm8
> +; SSE4-NEXT:    movapd {{.*#+}} xmm7 = [255,255]
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm6 = [9223372036854775808,9223372036854775808]
> +; SSE4-NEXT:    movdqa %xmm0, %xmm5
> +; SSE4-NEXT:    pxor %xmm6, %xmm5
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm4 = [9223372036854776063,9223372036854776063]
>  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> -; SSE4-NEXT:    pcmpgtq %xmm3, %xmm0
> -; SSE4-NEXT:    movapd %xmm5, %xmm3
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm3
> -; SSE4-NEXT:    pxor %xmm8, %xmm7
> -; SSE4-NEXT:    pcmpgtq %xmm7, %xmm4
> +; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> +; SSE4-NEXT:    movapd %xmm7, %xmm5
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm3, %xmm5
> +; SSE4-NEXT:    pxor %xmm1, %xmm6
> +; SSE4-NEXT:    pcmpgtq %xmm6, %xmm4
>  ; SSE4-NEXT:    movdqa %xmm4, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm8, %xmm5
> -; SSE4-NEXT:    packusdw %xmm3, %xmm5
> -; SSE4-NEXT:    pcmpeqd %xmm2, %xmm6
> -; SSE4-NEXT:    movmskps %xmm6, %eax
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm7
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; SSE4-NEXT:    pshufb %xmm0, %xmm7
> +; SSE4-NEXT:    pshufb %xmm0, %xmm5
> +; SSE4-NEXT:    punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm7[0],xmm5[1],xmm7[1],xmm5[2],xmm7[2],xmm5[3],xmm7[3]
> +; SSE4-NEXT:    pcmpeqd %xmm2, %xmm8
> +; SSE4-NEXT:    movmskps %xmm8, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
>  ; SSE4-NEXT:    testb $1, %al
>  ; SSE4-NEXT:    jne .LBB5_1
> @@ -1888,15 +1913,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB5_4
>  ; SSE4-NEXT:  .LBB5_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $4, %xmm5, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm5, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB5_6
>  ; SSE4-NEXT:  .LBB5_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $8, %xmm5, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm5, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB5_8
>  ; SSE4-NEXT:  .LBB5_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $12, %xmm5, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm5, 3(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i64_v4i8:
> @@ -1911,8 +1936,11 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm5, %xmm3
>  ; AVX1-NEXT:    vmovapd {{.*#+}} xmm5 = [255,255]
>  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm6, %xmm5, %xmm3
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm6 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX1-NEXT:    vpshufb %xmm6, %xmm3, %xmm3
>  ; AVX1-NEXT:    vblendvpd %xmm4, %xmm0, %xmm5, %xmm0
> -; AVX1-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> +; AVX1-NEXT:    vpshufb %xmm6, %xmm0, %xmm0
> +; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -1935,15 +1963,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB5_4
>  ; AVX1-NEXT:  .LBB5_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB5_6
>  ; AVX1-NEXT:  .LBB5_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB5_8
>  ; AVX1-NEXT:  .LBB5_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -1957,7 +1985,10 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX2-NEXT:    vpcmpgtq %ymm4, %ymm5, %ymm4
>  ; AVX2-NEXT:    vblendvpd %ymm4, %ymm0, %ymm3, %ymm0
>  ; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
> -; AVX2-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -1980,15 +2011,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB5_4
>  ; AVX2-NEXT:  .LBB5_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB5_6
>  ; AVX2-NEXT:  .LBB5_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB5_8
>  ; AVX2-NEXT:  .LBB5_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -1999,7 +2030,7 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [255,255,255,255]
>  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> -; AVX512F-NEXT:    vpmovqd %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB5_1
> @@ -2020,15 +2051,15 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB5_4
>  ; AVX512F-NEXT:  .LBB5_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB5_6
>  ; AVX512F-NEXT:  .LBB5_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB5_8
>  ; AVX512F-NEXT:  .LBB5_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2037,12 +2068,11 @@ define void @truncstore_v4i64_v4i8(<4 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [255,255,255,255]
> -; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> -; AVX512BW-NEXT:    vpmovqd %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
> +; AVX512BW-NEXT:    vpbroadcastq {{.*#+}} ymm1 = [255,255,255,255]
> +; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> +; AVX512BW-NEXT:    vpmovqb %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2065,25 +2095,26 @@ define void @truncstore_v4i64_v4i8(<4 x
>  define void @truncstore_v2i64_v2i32(<2 x i64> %x, <2 x i32>* %p, <2 x i64> %mask) {
>  ; SSE2-LABEL: truncstore_v2i64_v2i32:
>  ; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [9223372039002259456,9223372039002259456]
> -; SSE2-NEXT:    pxor %xmm0, %xmm2
> +; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [9223372039002259456,9223372039002259456]
> +; SSE2-NEXT:    pxor %xmm0, %xmm3
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [9223372039002259455,9223372039002259455]
>  ; SSE2-NEXT:    movdqa %xmm4, %xmm5
> -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm5
> +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm5[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm4
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm5[1,1,3,3]
> -; SSE2-NEXT:    por %xmm4, %xmm2
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> -; SSE2-NEXT:    por %xmm0, %xmm2
> -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm3, %xmm0
> -; SSE2-NEXT:    movmskpd %xmm0, %eax
> +; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm5[1,1,3,3]
> +; SSE2-NEXT:    por %xmm3, %xmm4
> +; SSE2-NEXT:    pand %xmm4, %xmm0
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> +; SSE2-NEXT:    pand %xmm2, %xmm1
> +; SSE2-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-NEXT:    xorl $3, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne .LBB6_1
> @@ -2093,26 +2124,27 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE2-NEXT:  .LBB6_4: # %else2
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB6_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm2, (%rdi)
> +; SSE2-NEXT:    movd %xmm0, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB6_4
>  ; SSE2-NEXT:  .LBB6_3: # %cond.store1
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[2,3,0,1]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
>  ; SSE2-NEXT:    movd %xmm0, 4(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i32:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    movdqa %xmm0, %xmm2
> -; SSE4-NEXT:    pxor %xmm4, %xmm4
> -; SSE4-NEXT:    movapd {{.*#+}} xmm3 = [4294967295,4294967295]
> +; SSE4-NEXT:    pxor %xmm3, %xmm3
> +; SSE4-NEXT:    movapd {{.*#+}} xmm4 = [4294967295,4294967295]
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [9223372036854775808,9223372036854775808]
>  ; SSE4-NEXT:    pxor %xmm0, %xmm5
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = [9223372041149743103,9223372041149743103]
>  ; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> -; SSE4-NEXT:    pcmpeqq %xmm1, %xmm4
> -; SSE4-NEXT:    movmskpd %xmm4, %eax
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm4
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> +; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> +; SSE4-NEXT:    movmskpd %xmm3, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
>  ; SSE4-NEXT:    testb $1, %al
>  ; SSE4-NEXT:    jne .LBB6_1
> @@ -2122,11 +2154,11 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; SSE4-NEXT:  .LBB6_4: # %else2
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB6_1: # %cond.store
> -; SSE4-NEXT:    movss %xmm3, (%rdi)
> +; SSE4-NEXT:    movd %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB6_4
>  ; SSE4-NEXT:  .LBB6_3: # %cond.store1
> -; SSE4-NEXT:    extractps $2, %xmm3, 4(%rdi)
> +; SSE4-NEXT:    pextrd $1, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v2i64_v2i32:
> @@ -2135,12 +2167,12 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX1-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> +; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX1-NEXT:    vmovapd {{.*#+}} xmm2 = [4294967295,4294967295]
>  ; AVX1-NEXT:    vpxor {{.*}}(%rip), %xmm0, %xmm3
>  ; AVX1-NEXT:    vmovdqa {{.*#+}} xmm4 = [9223372041149743103,9223372041149743103]
>  ; AVX1-NEXT:    vpcmpgtq %xmm3, %xmm4, %xmm3
>  ; AVX1-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> -; AVX1-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX1-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX1-NEXT:    vmaskmovps %xmm0, %xmm1, (%rdi)
>  ; AVX1-NEXT:    retq
> @@ -2151,12 +2183,12 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> +; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX2-NEXT:    vmovapd {{.*#+}} xmm2 = [4294967295,4294967295]
>  ; AVX2-NEXT:    vpxor {{.*}}(%rip), %xmm0, %xmm3
>  ; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 = [9223372041149743103,9223372041149743103]
>  ; AVX2-NEXT:    vpcmpgtq %xmm3, %xmm4, %xmm3
>  ; AVX2-NEXT:    vblendvpd %xmm3, %xmm0, %xmm2, %xmm0
> -; AVX2-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,2],zero,zero
>  ; AVX2-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX2-NEXT:    vpmaskmovd %xmm0, %xmm1, (%rdi)
>  ; AVX2-NEXT:    retq
> @@ -2166,11 +2198,11 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX512F-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512F-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [4294967295,4294967295]
>  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
>  ; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX512F-NEXT:    kshiftlw $14, %k0, %k0
> -; AVX512F-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512F-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
> @@ -2187,11 +2219,11 @@ define void @truncstore_v2i64_v2i32(<2 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [4294967295,4294967295]
>  ; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; AVX512BW-NEXT:    kshiftlw $14, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrw $14, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu32 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2206,25 +2238,27 @@ define void @truncstore_v2i64_v2i32(<2 x
>  define void @truncstore_v2i64_v2i16(<2 x i64> %x, <2 x i16>* %p, <2 x i64> %mask) {
>  ; SSE2-LABEL: truncstore_v2i64_v2i16:
>  ; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [9223372039002259456,9223372039002259456]
> -; SSE2-NEXT:    pxor %xmm0, %xmm2
> +; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [9223372039002259456,9223372039002259456]
> +; SSE2-NEXT:    pxor %xmm0, %xmm3
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [9223372039002324991,9223372039002324991]
>  ; SSE2-NEXT:    movdqa %xmm4, %xmm5
> -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm5
> +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm5[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm4
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm5[1,1,3,3]
> -; SSE2-NEXT:    por %xmm4, %xmm2
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> -; SSE2-NEXT:    por %xmm0, %xmm2
> -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm3, %xmm0
> -; SSE2-NEXT:    movmskpd %xmm0, %eax
> +; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm5[1,1,3,3]
> +; SSE2-NEXT:    por %xmm3, %xmm4
> +; SSE2-NEXT:    pand %xmm4, %xmm0
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm2[1,0,3,2]
> +; SSE2-NEXT:    pand %xmm2, %xmm1
> +; SSE2-NEXT:    movmskpd %xmm1, %eax
>  ; SSE2-NEXT:    xorl $3, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne .LBB7_1
> @@ -2234,27 +2268,29 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE2-NEXT:  .LBB7_4: # %else2
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB7_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm2, %ecx
> +; SSE2-NEXT:    movd %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB7_4
>  ; SSE2-NEXT:  .LBB7_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $4, %xmm2, %eax
> +; SSE2-NEXT:    pextrw $1, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 2(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i16:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    movdqa %xmm0, %xmm2
> -; SSE4-NEXT:    pxor %xmm4, %xmm4
> -; SSE4-NEXT:    movapd {{.*#+}} xmm3 = [65535,65535]
> +; SSE4-NEXT:    pxor %xmm3, %xmm3
> +; SSE4-NEXT:    movapd {{.*#+}} xmm4 = [65535,65535]
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [9223372036854775808,9223372036854775808]
>  ; SSE4-NEXT:    pxor %xmm0, %xmm5
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = [9223372036854841343,9223372036854841343]
>  ; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
> -; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> -; SSE4-NEXT:    pcmpeqq %xmm1, %xmm4
> -; SSE4-NEXT:    movmskpd %xmm4, %eax
> +; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm4
> +; SSE4-NEXT:    pshufd {{.*#+}} xmm0 = xmm4[0,2,2,3]
> +; SSE4-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> +; SSE4-NEXT:    pcmpeqq %xmm1, %xmm3
> +; SSE4-NEXT:    movmskpd %xmm3, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
>  ; SSE4-NEXT:    testb $1, %al
>  ; SSE4-NEXT:    jne .LBB7_1
> @@ -2264,11 +2300,11 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; SSE4-NEXT:  .LBB7_4: # %else2
>  ; SSE4-NEXT:    retq
>  ; SSE4-NEXT:  .LBB7_1: # %cond.store
> -; SSE4-NEXT:    pextrw $0, %xmm3, (%rdi)
> +; SSE4-NEXT:    pextrw $0, %xmm0, (%rdi)
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB7_4
>  ; SSE4-NEXT:  .LBB7_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $4, %xmm3, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v2i64_v2i16:
> @@ -2279,6 +2315,8 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm5 = [9223372036854841343,9223372036854841343]
>  ; AVX-NEXT:    vpcmpgtq %xmm4, %xmm5, %xmm4
>  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> +; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
>  ; AVX-NEXT:    xorl $3, %eax
> @@ -2294,7 +2332,7 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB7_4
>  ; AVX-NEXT:  .LBB7_3: # %cond.store1
> -; AVX-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v2i64_v2i16:
> @@ -2304,6 +2342,8 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [65535,65535]
>  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> +; AVX512F-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; AVX512F-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB7_1
> @@ -2318,7 +2358,7 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB7_4
>  ; AVX512F-NEXT:  .LBB7_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2327,12 +2367,12 @@ define void @truncstore_v2i64_v2i16(<2 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [65535,65535]
>  ; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
>  ; AVX512BW-NEXT:    vpshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
> -; AVX512BW-NEXT:    kshiftld $30, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrd $30, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -2354,27 +2394,32 @@ define void @truncstore_v2i64_v2i16(<2 x
>  define void @truncstore_v2i64_v2i8(<2 x i64> %x, <2 x i8>* %p, <2 x i64> %mask) {
>  ; SSE2-LABEL: truncstore_v2i64_v2i8:
>  ; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [9223372039002259456,9223372039002259456]
> -; SSE2-NEXT:    pxor %xmm0, %xmm2
> +; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [9223372039002259456,9223372039002259456]
> +; SSE2-NEXT:    pxor %xmm0, %xmm3
>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [9223372039002259711,9223372039002259711]
>  ; SSE2-NEXT:    movdqa %xmm4, %xmm5
> -; SSE2-NEXT:    pcmpgtd %xmm2, %xmm5
> +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
>  ; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm5[0,0,2,2]
> -; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm2[1,1,3,3]
> -; SSE2-NEXT:    pand %xmm6, %xmm4
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm5[1,1,3,3]
> -; SSE2-NEXT:    por %xmm4, %xmm2
> +; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]
> +; SSE2-NEXT:    pand %xmm6, %xmm3
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm4 = xmm5[1,1,3,3]
> +; SSE2-NEXT:    por %xmm3, %xmm4
> +; SSE2-NEXT:    pand %xmm4, %xmm0
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[1,0,3,2]
>  ; SSE2-NEXT:    pand %xmm2, %xmm0
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> -; SSE2-NEXT:    por %xmm0, %xmm2
> -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm3[1,0,3,2]
> -; SSE2-NEXT:    pand %xmm3, %xmm0
>  ; SSE2-NEXT:    movmskpd %xmm0, %eax
>  ; SSE2-NEXT:    xorl $3, %eax
>  ; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    movd %xmm4, %ecx
>  ; SSE2-NEXT:    jne .LBB8_1
>  ; SSE2-NEXT:  # %bb.2: # %else
>  ; SSE2-NEXT:    testb $2, %al
> @@ -2382,13 +2427,11 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE2-NEXT:  .LBB8_4: # %else2
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB8_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm2, %ecx
>  ; SSE2-NEXT:    movb %cl, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB8_4
>  ; SSE2-NEXT:  .LBB8_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $4, %xmm2, %eax
> -; SSE2-NEXT:    movb %al, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v2i64_v2i8:
> @@ -2401,6 +2444,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm0 = [9223372036854776063,9223372036854776063]
>  ; SSE4-NEXT:    pcmpgtq %xmm5, %xmm0
>  ; SSE4-NEXT:    blendvpd %xmm0, %xmm2, %xmm3
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm3 = xmm3[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqq %xmm1, %xmm4
>  ; SSE4-NEXT:    movmskpd %xmm4, %eax
>  ; SSE4-NEXT:    xorl $3, %eax
> @@ -2416,7 +2460,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB8_4
>  ; SSE4-NEXT:  .LBB8_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $8, %xmm3, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm3, 1(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v2i64_v2i8:
> @@ -2427,6 +2471,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX-NEXT:    vmovdqa {{.*#+}} xmm5 = [9223372036854776063,9223372036854776063]
>  ; AVX-NEXT:    vpcmpgtq %xmm4, %xmm5, %xmm4
>  ; AVX-NEXT:    vblendvpd %xmm4, %xmm0, %xmm3, %xmm0
> +; AVX-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vmovmskpd %xmm1, %eax
>  ; AVX-NEXT:    xorl $3, %eax
> @@ -2442,7 +2487,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB8_4
>  ; AVX-NEXT:  .LBB8_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v2i64_v2i8:
> @@ -2452,6 +2497,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vmovdqa {{.*#+}} xmm1 = [255,255]
>  ; AVX512F-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB8_1
> @@ -2466,7 +2512,7 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB8_4
>  ; AVX512F-NEXT:  .LBB8_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -2475,11 +2521,11 @@ define void @truncstore_v2i64_v2i8(<2 x
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    # kill: def $xmm0 killed $xmm0 def $zmm0
>  ; AVX512BW-NEXT:    vptestmq %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqa {{.*#+}} xmm1 = [255,255]
>  ; AVX512BW-NEXT:    vpminuq %zmm1, %zmm0, %zmm0
>  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u]
> -; AVX512BW-NEXT:    kshiftlq $62, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $62, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -4352,6 +4398,7 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    pandn %xmm9, %xmm6
>  ; SSE2-NEXT:    por %xmm0, %xmm6
>  ; SSE2-NEXT:    packuswb %xmm4, %xmm6
> +; SSE2-NEXT:    packuswb %xmm6, %xmm6
>  ; SSE2-NEXT:    pcmpeqd %xmm8, %xmm3
>  ; SSE2-NEXT:    pcmpeqd %xmm0, %xmm0
>  ; SSE2-NEXT:    pxor %xmm0, %xmm3
> @@ -4371,17 +4418,26 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB12_5
>  ; SSE2-NEXT:  .LBB12_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB12_7
> +; SSE2-NEXT:    je .LBB12_8
> +; SSE2-NEXT:  .LBB12_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB12_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB12_9
> +; SSE2-NEXT:    pextrw $2, %xmm6, %ecx
> +; SSE2-NEXT:    je .LBB12_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB12_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB12_11
> +; SSE2-NEXT:    je .LBB12_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB12_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm6, %ecx
>  ; SSE2-NEXT:    jne .LBB12_13
> -; SSE2-NEXT:  .LBB12_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB12_15
>  ; SSE2-NEXT:  .LBB12_16: # %else14
> @@ -4391,47 +4447,34 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB12_4
>  ; SSE2-NEXT:  .LBB12_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB12_6
>  ; SSE2-NEXT:  .LBB12_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm6, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB12_8
> -; SSE2-NEXT:  .LBB12_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm6, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB12_10
> -; SSE2-NEXT:  .LBB12_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm6, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB12_12
> -; SSE2-NEXT:  .LBB12_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm6, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB12_14
> +; SSE2-NEXT:    jne .LBB12_7
> +; SSE2-NEXT:    jmp .LBB12_8
>  ; SSE2-NEXT:  .LBB12_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm6, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB12_16
>  ; SSE2-NEXT:  .LBB12_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm6, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i32_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm4, %xmm4
>  ; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = [255,255,255,255]
> -; SSE4-NEXT:    pminud %xmm5, %xmm1
>  ; SSE4-NEXT:    pminud %xmm5, %xmm0
> -; SSE4-NEXT:    packusdw %xmm1, %xmm0
> +; SSE4-NEXT:    pminud %xmm5, %xmm1
> +; SSE4-NEXT:    movdqa {{.*#+}} xmm5 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> +; SSE4-NEXT:    pshufb %xmm5, %xmm1
> +; SSE4-NEXT:    pshufb %xmm5, %xmm0
> +; SSE4-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
>  ; SSE4-NEXT:    pcmpeqd %xmm4, %xmm3
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm1, %xmm3
> @@ -4470,40 +4513,43 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB12_4
>  ; SSE4-NEXT:  .LBB12_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB12_6
>  ; SSE4-NEXT:  .LBB12_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB12_8
>  ; SSE4-NEXT:  .LBB12_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB12_10
>  ; SSE4-NEXT:  .LBB12_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB12_12
>  ; SSE4-NEXT:  .LBB12_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB12_14
>  ; SSE4-NEXT:  .LBB12_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB12_16
>  ; SSE4-NEXT:  .LBB12_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v8i32_v8i8:
>  ; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm3 = [255,255,255,255]
> -; AVX1-NEXT:    vpminud %xmm3, %xmm2, %xmm2
> -; AVX1-NEXT:    vpminud %xmm3, %xmm0, %xmm0
> -; AVX1-NEXT:    vpackusdw %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [255,255,255,255]
> +; AVX1-NEXT:    vpminud %xmm2, %xmm0, %xmm3
> +; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
> +; AVX1-NEXT:    vpminud %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX1-NEXT:    vpshufb %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vpshufb %xmm2, %xmm3, %xmm2
> +; AVX1-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
>  ; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
>  ; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
>  ; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm2, %xmm2
> @@ -4542,31 +4588,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB12_4
>  ; AVX1-NEXT:  .LBB12_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB12_6
>  ; AVX1-NEXT:  .LBB12_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB12_8
>  ; AVX1-NEXT:  .LBB12_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    testb $16, %al
>  ; AVX1-NEXT:    je .LBB12_10
>  ; AVX1-NEXT:  .LBB12_9: # %cond.store7
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $32, %al
>  ; AVX1-NEXT:    je .LBB12_12
>  ; AVX1-NEXT:  .LBB12_11: # %cond.store9
> -; AVX1-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX1-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX1-NEXT:    testb $64, %al
>  ; AVX1-NEXT:    je .LBB12_14
>  ; AVX1-NEXT:  .LBB12_13: # %cond.store11
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    testb $-128, %al
>  ; AVX1-NEXT:    je .LBB12_16
>  ; AVX1-NEXT:  .LBB12_15: # %cond.store13
> -; AVX1-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX1-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX1-NEXT:    vzeroupper
>  ; AVX1-NEXT:    retq
>  ;
> @@ -4576,7 +4622,10 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm3 = [255,255,255,255,255,255,255,255]
>  ; AVX2-NEXT:    vpminud %ymm3, %ymm0, %ymm0
>  ; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
> -; AVX2-NEXT:    vpackusdw %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vmovdqa {{.*#+}} xmm4 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm3, %xmm3
> +; AVX2-NEXT:    vpshufb %xmm4, %xmm0, %xmm0
> +; AVX2-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
>  ; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
>  ; AVX2-NEXT:    vmovmskps %ymm1, %eax
>  ; AVX2-NEXT:    notl %eax
> @@ -4611,31 +4660,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB12_4
>  ; AVX2-NEXT:  .LBB12_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB12_6
>  ; AVX2-NEXT:  .LBB12_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB12_8
>  ; AVX2-NEXT:  .LBB12_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    testb $16, %al
>  ; AVX2-NEXT:    je .LBB12_10
>  ; AVX2-NEXT:  .LBB12_9: # %cond.store7
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $32, %al
>  ; AVX2-NEXT:    je .LBB12_12
>  ; AVX2-NEXT:  .LBB12_11: # %cond.store9
> -; AVX2-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX2-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX2-NEXT:    testb $64, %al
>  ; AVX2-NEXT:    je .LBB12_14
>  ; AVX2-NEXT:  .LBB12_13: # %cond.store11
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    testb $-128, %al
>  ; AVX2-NEXT:    je .LBB12_16
>  ; AVX2-NEXT:  .LBB12_15: # %cond.store13
> -; AVX2-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX2-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX2-NEXT:    vzeroupper
>  ; AVX2-NEXT:    retq
>  ;
> @@ -4645,7 +4694,7 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255]
>  ; AVX512F-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> -; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
> +; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB12_1
> @@ -4678,31 +4727,31 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB12_4
>  ; AVX512F-NEXT:  .LBB12_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB12_6
>  ; AVX512F-NEXT:  .LBB12_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB12_8
>  ; AVX512F-NEXT:  .LBB12_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB12_10
>  ; AVX512F-NEXT:  .LBB12_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB12_12
>  ; AVX512F-NEXT:  .LBB12_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB12_14
>  ; AVX512F-NEXT:  .LBB12_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB12_16
>  ; AVX512F-NEXT:  .LBB12_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -4710,12 +4759,11 @@ define void @truncstore_v8i32_v8i8(<8 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255]
> -; AVX512BW-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> -; AVX512BW-NEXT:    vpmovdw %zmm0, %ymm0
> -; AVX512BW-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> +; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} ymm1 = [255,255,255,255,255,255,255,255]
> +; AVX512BW-NEXT:    vpminud %ymm1, %ymm0, %ymm0
> +; AVX512BW-NEXT:    vpmovdb %zmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -4738,16 +4786,19 @@ define void @truncstore_v8i32_v8i8(<8 x
>  define void @truncstore_v4i32_v4i16(<4 x i32> %x, <4 x i16>* %p, <4 x i32> %mask) {
>  ; SSE2-LABEL: truncstore_v4i32_v4i16:
>  ; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648,2147483648,2147483648]
> -; SSE2-NEXT:    pxor %xmm0, %xmm4
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [2147549183,2147549183,2147549183,2147549183]
> -; SSE2-NEXT:    pcmpgtd %xmm4, %xmm2
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> -; SSE2-NEXT:    por %xmm0, %xmm2
> -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> -; SSE2-NEXT:    movmskps %xmm3, %eax
> +; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [2147483648,2147483648,2147483648,2147483648]
> +; SSE2-NEXT:    pxor %xmm0, %xmm3
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147549183,2147549183,2147549183,2147549183]
> +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm4
> +; SSE2-NEXT:    pand %xmm4, %xmm0
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm4[0,2,2,3,4,5,6,7]
> +; SSE2-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE2-NEXT:    movmskps %xmm2, %eax
>  ; SSE2-NEXT:    xorl $15, %eax
>  ; SSE2-NEXT:    testb $1, %al
>  ; SSE2-NEXT:    jne .LBB13_1
> @@ -4763,22 +4814,22 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE2-NEXT:  .LBB13_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB13_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm2, %ecx
> +; SSE2-NEXT:    movd %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, (%rdi)
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB13_4
>  ; SSE2-NEXT:  .LBB13_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm2, %ecx
> +; SSE2-NEXT:    pextrw $1, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 2(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB13_6
>  ; SSE2-NEXT:  .LBB13_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm2, %ecx
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
>  ; SSE2-NEXT:    movw %cx, 4(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
>  ; SSE2-NEXT:    je .LBB13_8
>  ; SSE2-NEXT:  .LBB13_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm2, %eax
> +; SSE2-NEXT:    pextrw $3, %xmm0, %eax
>  ; SSE2-NEXT:    movw %ax, 6(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -4786,6 +4837,7 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
>  ; SSE4-NEXT:    pminud {{.*}}(%rip), %xmm0
> +; SSE4-NEXT:    packusdw %xmm0, %xmm0
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskps %xmm2, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -4807,21 +4859,22 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB13_4
>  ; SSE4-NEXT:  .LBB13_3: # %cond.store1
> -; SSE4-NEXT:    pextrw $2, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrw $1, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB13_6
>  ; SSE4-NEXT:  .LBB13_5: # %cond.store3
> -; SSE4-NEXT:    pextrw $4, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrw $2, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB13_8
>  ; SSE4-NEXT:  .LBB13_7: # %cond.store5
> -; SSE4-NEXT:    pextrw $6, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrw $3, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i32_v4i16:
>  ; AVX1:       # %bb.0:
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
>  ; AVX1-NEXT:    vpminud {{.*}}(%rip), %xmm0, %xmm0
> +; AVX1-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -4843,15 +4896,15 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB13_4
>  ; AVX1-NEXT:  .LBB13_3: # %cond.store1
> -; AVX1-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB13_6
>  ; AVX1-NEXT:  .LBB13_5: # %cond.store3
> -; AVX1-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX1-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB13_8
>  ; AVX1-NEXT:  .LBB13_7: # %cond.store5
> -; AVX1-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX1-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: truncstore_v4i32_v4i16:
> @@ -4859,6 +4912,7 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
>  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [65535,65535,65535,65535]
>  ; AVX2-NEXT:    vpminud %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -4880,15 +4934,15 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB13_4
>  ; AVX2-NEXT:  .LBB13_3: # %cond.store1
> -; AVX2-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB13_6
>  ; AVX2-NEXT:  .LBB13_5: # %cond.store3
> -; AVX2-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX2-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB13_8
>  ; AVX2-NEXT:  .LBB13_7: # %cond.store5
> -; AVX2-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX2-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i32_v4i16:
> @@ -4897,6 +4951,7 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [65535,65535,65535,65535]
>  ; AVX512F-NEXT:    vpminud %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB13_1
> @@ -4917,15 +4972,15 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB13_4
>  ; AVX512F-NEXT:  .LBB13_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrw $2, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrw $1, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB13_6
>  ; AVX512F-NEXT:  .LBB13_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrw $4, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrw $2, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB13_8
>  ; AVX512F-NEXT:  .LBB13_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrw $6, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrw $3, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -4933,11 +4988,11 @@ define void @truncstore_v4i32_v4i16(<4 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [65535,65535,65535,65535]
>  ; AVX512BW-NEXT:    vpminud %xmm1, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpackusdw %xmm0, %xmm0, %xmm0
> -; AVX512BW-NEXT:    kshiftld $28, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrd $28, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu16 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -4959,47 +5014,50 @@ define void @truncstore_v4i32_v4i16(<4 x
>  define void @truncstore_v4i32_v4i8(<4 x i32> %x, <4 x i8>* %p, <4 x i32> %mask) {
>  ; SSE2-LABEL: truncstore_v4i32_v4i8:
>  ; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pxor %xmm3, %xmm3
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648,2147483648,2147483648]
> -; SSE2-NEXT:    pxor %xmm0, %xmm4
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [2147483903,2147483903,2147483903,2147483903]
> -; SSE2-NEXT:    pcmpgtd %xmm4, %xmm2
> -; SSE2-NEXT:    pand %xmm2, %xmm0
> -; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm2
> -; SSE2-NEXT:    por %xmm0, %xmm2
> -; SSE2-NEXT:    pcmpeqd %xmm1, %xmm3
> -; SSE2-NEXT:    movmskps %xmm3, %eax
> -; SSE2-NEXT:    xorl $15, %eax
> -; SSE2-NEXT:    testb $1, %al
> +; SSE2-NEXT:    pxor %xmm2, %xmm2
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [2147483648,2147483648,2147483648,2147483648]
> +; SSE2-NEXT:    pxor %xmm0, %xmm3
> +; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483903,2147483903,2147483903,2147483903]
> +; SSE2-NEXT:    pcmpgtd %xmm3, %xmm4
> +; SSE2-NEXT:    pand %xmm4, %xmm0
> +; SSE2-NEXT:    pandn {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    por %xmm0, %xmm4
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    packuswb %xmm4, %xmm4
> +; SSE2-NEXT:    pcmpeqd %xmm1, %xmm2
> +; SSE2-NEXT:    movmskps %xmm2, %ecx
> +; SSE2-NEXT:    xorl $15, %ecx
> +; SSE2-NEXT:    testb $1, %cl
> +; SSE2-NEXT:    movd %xmm4, %eax
>  ; SSE2-NEXT:    jne .LBB14_1
>  ; SSE2-NEXT:  # %bb.2: # %else
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    jne .LBB14_3
>  ; SSE2-NEXT:  .LBB14_4: # %else2
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    jne .LBB14_5
>  ; SSE2-NEXT:  .LBB14_6: # %else4
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    jne .LBB14_7
>  ; SSE2-NEXT:  .LBB14_8: # %else6
>  ; SSE2-NEXT:    retq
>  ; SSE2-NEXT:  .LBB14_1: # %cond.store
> -; SSE2-NEXT:    movd %xmm2, %ecx
> -; SSE2-NEXT:    movb %cl, (%rdi)
> -; SSE2-NEXT:    testb $2, %al
> +; SSE2-NEXT:    movb %al, (%rdi)
> +; SSE2-NEXT:    testb $2, %cl
>  ; SSE2-NEXT:    je .LBB14_4
>  ; SSE2-NEXT:  .LBB14_3: # %cond.store1
> -; SSE2-NEXT:    pextrw $2, %xmm2, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> -; SSE2-NEXT:    testb $4, %al
> +; SSE2-NEXT:    movb %ah, 1(%rdi)
> +; SSE2-NEXT:    testb $4, %cl
>  ; SSE2-NEXT:    je .LBB14_6
>  ; SSE2-NEXT:  .LBB14_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $4, %xmm2, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> -; SSE2-NEXT:    testb $8, %al
> +; SSE2-NEXT:    movl %eax, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
> +; SSE2-NEXT:    testb $8, %cl
>  ; SSE2-NEXT:    je .LBB14_8
>  ; SSE2-NEXT:  .LBB14_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $6, %xmm2, %eax
> +; SSE2-NEXT:    shrl $24, %eax
>  ; SSE2-NEXT:    movb %al, 3(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
> @@ -5007,6 +5065,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
>  ; SSE4-NEXT:    pminud {{.*}}(%rip), %xmm0
> +; SSE4-NEXT:    pshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm2
>  ; SSE4-NEXT:    movmskps %xmm2, %eax
>  ; SSE4-NEXT:    xorl $15, %eax
> @@ -5028,21 +5087,22 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB14_4
>  ; SSE4-NEXT:  .LBB14_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $4, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB14_6
>  ; SSE4-NEXT:  .LBB14_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $8, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB14_8
>  ; SSE4-NEXT:  .LBB14_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $12, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX1-LABEL: truncstore_v4i32_v4i8:
>  ; AVX1:       # %bb.0:
>  ; AVX1-NEXT:    vpxor %xmm2, %xmm2, %xmm2
>  ; AVX1-NEXT:    vpminud {{.*}}(%rip), %xmm0, %xmm0
> +; AVX1-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX1-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX1-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX1-NEXT:    xorl $15, %eax
> @@ -5064,15 +5124,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX1-NEXT:    testb $2, %al
>  ; AVX1-NEXT:    je .LBB14_4
>  ; AVX1-NEXT:  .LBB14_3: # %cond.store1
> -; AVX1-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX1-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX1-NEXT:    testb $4, %al
>  ; AVX1-NEXT:    je .LBB14_6
>  ; AVX1-NEXT:  .LBB14_5: # %cond.store3
> -; AVX1-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX1-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX1-NEXT:    testb $8, %al
>  ; AVX1-NEXT:    je .LBB14_8
>  ; AVX1-NEXT:  .LBB14_7: # %cond.store5
> -; AVX1-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX1-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX1-NEXT:    retq
>  ;
>  ; AVX2-LABEL: truncstore_v4i32_v4i8:
> @@ -5080,6 +5140,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
>  ; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm3 = [255,255,255,255]
>  ; AVX2-NEXT:    vpminud %xmm3, %xmm0, %xmm0
> +; AVX2-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
>  ; AVX2-NEXT:    vmovmskps %xmm1, %eax
>  ; AVX2-NEXT:    xorl $15, %eax
> @@ -5101,15 +5162,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX2-NEXT:    testb $2, %al
>  ; AVX2-NEXT:    je .LBB14_4
>  ; AVX2-NEXT:  .LBB14_3: # %cond.store1
> -; AVX2-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX2-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX2-NEXT:    testb $4, %al
>  ; AVX2-NEXT:    je .LBB14_6
>  ; AVX2-NEXT:  .LBB14_5: # %cond.store3
> -; AVX2-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX2-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX2-NEXT:    testb $8, %al
>  ; AVX2-NEXT:    je .LBB14_8
>  ; AVX2-NEXT:  .LBB14_7: # %cond.store5
> -; AVX2-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX2-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX2-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v4i32_v4i8:
> @@ -5118,6 +5179,7 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512F-NEXT:    vptestmd %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [255,255,255,255]
>  ; AVX512F-NEXT:    vpminud %xmm1, %xmm0, %xmm0
> +; AVX512F-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB14_1
> @@ -5138,15 +5200,15 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB14_4
>  ; AVX512F-NEXT:  .LBB14_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB14_6
>  ; AVX512F-NEXT:  .LBB14_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB14_8
>  ; AVX512F-NEXT:  .LBB14_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -5154,11 +5216,11 @@ define void @truncstore_v4i32_v4i8(<4 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmd %zmm1, %zmm1, %k0
> +; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> +; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
>  ; AVX512BW-NEXT:    vpbroadcastd {{.*#+}} xmm1 = [255,255,255,255]
>  ; AVX512BW-NEXT:    vpminud %xmm1, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
> -; AVX512BW-NEXT:    kshiftlq $60, %k0, %k0
> -; AVX512BW-NEXT:    kshiftrq $60, %k0, %k1
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
> @@ -7041,10 +7103,10 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-LABEL: truncstore_v8i16_v8i8:
>  ; SSE2:       # %bb.0:
>  ; SSE2-NEXT:    pxor %xmm2, %xmm2
> -; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [32768,32768,32768,32768,32768,32768,32768,32768]
> -; SSE2-NEXT:    pxor %xmm3, %xmm0
> +; SSE2-NEXT:    pxor {{.*}}(%rip), %xmm0
>  ; SSE2-NEXT:    pminsw {{.*}}(%rip), %xmm0
> -; SSE2-NEXT:    pxor %xmm3, %xmm0
> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm0
> +; SSE2-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE2-NEXT:    pcmpeqw %xmm1, %xmm2
>  ; SSE2-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE2-NEXT:    pxor %xmm2, %xmm1
> @@ -7061,17 +7123,26 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-NEXT:    jne .LBB17_5
>  ; SSE2-NEXT:  .LBB17_6: # %else4
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    jne .LBB17_7
> +; SSE2-NEXT:    je .LBB17_8
> +; SSE2-NEXT:  .LBB17_7: # %cond.store5
> +; SSE2-NEXT:    shrl $24, %ecx
> +; SSE2-NEXT:    movb %cl, 3(%rdi)
>  ; SSE2-NEXT:  .LBB17_8: # %else6
>  ; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    jne .LBB17_9
> +; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> +; SSE2-NEXT:    je .LBB17_10
> +; SSE2-NEXT:  # %bb.9: # %cond.store7
> +; SSE2-NEXT:    movb %cl, 4(%rdi)
>  ; SSE2-NEXT:  .LBB17_10: # %else8
>  ; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    jne .LBB17_11
> +; SSE2-NEXT:    je .LBB17_12
> +; SSE2-NEXT:  # %bb.11: # %cond.store9
> +; SSE2-NEXT:    movb %ch, 5(%rdi)
>  ; SSE2-NEXT:  .LBB17_12: # %else10
>  ; SSE2-NEXT:    testb $64, %al
> +; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
>  ; SSE2-NEXT:    jne .LBB17_13
> -; SSE2-NEXT:  .LBB17_14: # %else12
> +; SSE2-NEXT:  # %bb.14: # %else12
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    jne .LBB17_15
>  ; SSE2-NEXT:  .LBB17_16: # %else14
> @@ -7081,44 +7152,29 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE2-NEXT:    testb $2, %al
>  ; SSE2-NEXT:    je .LBB17_4
>  ; SSE2-NEXT:  .LBB17_3: # %cond.store1
> -; SSE2-NEXT:    shrl $16, %ecx
> -; SSE2-NEXT:    movb %cl, 1(%rdi)
> +; SSE2-NEXT:    movb %ch, 1(%rdi)
>  ; SSE2-NEXT:    testb $4, %al
>  ; SSE2-NEXT:    je .LBB17_6
>  ; SSE2-NEXT:  .LBB17_5: # %cond.store3
> -; SSE2-NEXT:    pextrw $2, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 2(%rdi)
> +; SSE2-NEXT:    movl %ecx, %edx
> +; SSE2-NEXT:    shrl $16, %edx
> +; SSE2-NEXT:    movb %dl, 2(%rdi)
>  ; SSE2-NEXT:    testb $8, %al
> -; SSE2-NEXT:    je .LBB17_8
> -; SSE2-NEXT:  .LBB17_7: # %cond.store5
> -; SSE2-NEXT:    pextrw $3, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 3(%rdi)
> -; SSE2-NEXT:    testb $16, %al
> -; SSE2-NEXT:    je .LBB17_10
> -; SSE2-NEXT:  .LBB17_9: # %cond.store7
> -; SSE2-NEXT:    pextrw $4, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 4(%rdi)
> -; SSE2-NEXT:    testb $32, %al
> -; SSE2-NEXT:    je .LBB17_12
> -; SSE2-NEXT:  .LBB17_11: # %cond.store9
> -; SSE2-NEXT:    pextrw $5, %xmm0, %ecx
> -; SSE2-NEXT:    movb %cl, 5(%rdi)
> -; SSE2-NEXT:    testb $64, %al
> -; SSE2-NEXT:    je .LBB17_14
> +; SSE2-NEXT:    jne .LBB17_7
> +; SSE2-NEXT:    jmp .LBB17_8
>  ; SSE2-NEXT:  .LBB17_13: # %cond.store11
> -; SSE2-NEXT:    pextrw $6, %xmm0, %ecx
>  ; SSE2-NEXT:    movb %cl, 6(%rdi)
>  ; SSE2-NEXT:    testb $-128, %al
>  ; SSE2-NEXT:    je .LBB17_16
>  ; SSE2-NEXT:  .LBB17_15: # %cond.store13
> -; SSE2-NEXT:    pextrw $7, %xmm0, %eax
> -; SSE2-NEXT:    movb %al, 7(%rdi)
> +; SSE2-NEXT:    movb %ch, 7(%rdi)
>  ; SSE2-NEXT:    retq
>  ;
>  ; SSE4-LABEL: truncstore_v8i16_v8i8:
>  ; SSE4:       # %bb.0:
>  ; SSE4-NEXT:    pxor %xmm2, %xmm2
>  ; SSE4-NEXT:    pminuw {{.*}}(%rip), %xmm0
> +; SSE4-NEXT:    packuswb %xmm0, %xmm0
>  ; SSE4-NEXT:    pcmpeqw %xmm1, %xmm2
>  ; SSE4-NEXT:    pcmpeqd %xmm1, %xmm1
>  ; SSE4-NEXT:    pxor %xmm2, %xmm1
> @@ -7154,37 +7210,38 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; SSE4-NEXT:    testb $2, %al
>  ; SSE4-NEXT:    je .LBB17_4
>  ; SSE4-NEXT:  .LBB17_3: # %cond.store1
> -; SSE4-NEXT:    pextrb $2, %xmm0, 1(%rdi)
> +; SSE4-NEXT:    pextrb $1, %xmm0, 1(%rdi)
>  ; SSE4-NEXT:    testb $4, %al
>  ; SSE4-NEXT:    je .LBB17_6
>  ; SSE4-NEXT:  .LBB17_5: # %cond.store3
> -; SSE4-NEXT:    pextrb $4, %xmm0, 2(%rdi)
> +; SSE4-NEXT:    pextrb $2, %xmm0, 2(%rdi)
>  ; SSE4-NEXT:    testb $8, %al
>  ; SSE4-NEXT:    je .LBB17_8
>  ; SSE4-NEXT:  .LBB17_7: # %cond.store5
> -; SSE4-NEXT:    pextrb $6, %xmm0, 3(%rdi)
> +; SSE4-NEXT:    pextrb $3, %xmm0, 3(%rdi)
>  ; SSE4-NEXT:    testb $16, %al
>  ; SSE4-NEXT:    je .LBB17_10
>  ; SSE4-NEXT:  .LBB17_9: # %cond.store7
> -; SSE4-NEXT:    pextrb $8, %xmm0, 4(%rdi)
> +; SSE4-NEXT:    pextrb $4, %xmm0, 4(%rdi)
>  ; SSE4-NEXT:    testb $32, %al
>  ; SSE4-NEXT:    je .LBB17_12
>  ; SSE4-NEXT:  .LBB17_11: # %cond.store9
> -; SSE4-NEXT:    pextrb $10, %xmm0, 5(%rdi)
> +; SSE4-NEXT:    pextrb $5, %xmm0, 5(%rdi)
>  ; SSE4-NEXT:    testb $64, %al
>  ; SSE4-NEXT:    je .LBB17_14
>  ; SSE4-NEXT:  .LBB17_13: # %cond.store11
> -; SSE4-NEXT:    pextrb $12, %xmm0, 6(%rdi)
> +; SSE4-NEXT:    pextrb $6, %xmm0, 6(%rdi)
>  ; SSE4-NEXT:    testb $-128, %al
>  ; SSE4-NEXT:    je .LBB17_16
>  ; SSE4-NEXT:  .LBB17_15: # %cond.store13
> -; SSE4-NEXT:    pextrb $14, %xmm0, 7(%rdi)
> +; SSE4-NEXT:    pextrb $7, %xmm0, 7(%rdi)
>  ; SSE4-NEXT:    retq
>  ;
>  ; AVX-LABEL: truncstore_v8i16_v8i8:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpxor %xmm2, %xmm2, %xmm2
>  ; AVX-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> +; AVX-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
>  ; AVX-NEXT:    vpcmpeqd %xmm2, %xmm2, %xmm2
>  ; AVX-NEXT:    vpxor %xmm2, %xmm1, %xmm1
> @@ -7220,31 +7277,31 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX-NEXT:    testb $2, %al
>  ; AVX-NEXT:    je .LBB17_4
>  ; AVX-NEXT:  .LBB17_3: # %cond.store1
> -; AVX-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX-NEXT:    testb $4, %al
>  ; AVX-NEXT:    je .LBB17_6
>  ; AVX-NEXT:  .LBB17_5: # %cond.store3
> -; AVX-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX-NEXT:    testb $8, %al
>  ; AVX-NEXT:    je .LBB17_8
>  ; AVX-NEXT:  .LBB17_7: # %cond.store5
> -; AVX-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX-NEXT:    testb $16, %al
>  ; AVX-NEXT:    je .LBB17_10
>  ; AVX-NEXT:  .LBB17_9: # %cond.store7
> -; AVX-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX-NEXT:    testb $32, %al
>  ; AVX-NEXT:    je .LBB17_12
>  ; AVX-NEXT:  .LBB17_11: # %cond.store9
> -; AVX-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX-NEXT:    testb $64, %al
>  ; AVX-NEXT:    je .LBB17_14
>  ; AVX-NEXT:  .LBB17_13: # %cond.store11
> -; AVX-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX-NEXT:    testb $-128, %al
>  ; AVX-NEXT:    je .LBB17_16
>  ; AVX-NEXT:  .LBB17_15: # %cond.store13
> -; AVX-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX-NEXT:    retq
>  ;
>  ; AVX512F-LABEL: truncstore_v8i16_v8i8:
> @@ -7255,6 +7312,7 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512F-NEXT:    vpmovsxwq %xmm1, %zmm1
>  ; AVX512F-NEXT:    vptestmq %zmm1, %zmm1, %k0
>  ; AVX512F-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> +; AVX512F-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX512F-NEXT:    kmovw %k0, %eax
>  ; AVX512F-NEXT:    testb $1, %al
>  ; AVX512F-NEXT:    jne .LBB17_1
> @@ -7287,31 +7345,31 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512F-NEXT:    testb $2, %al
>  ; AVX512F-NEXT:    je .LBB17_4
>  ; AVX512F-NEXT:  .LBB17_3: # %cond.store1
> -; AVX512F-NEXT:    vpextrb $2, %xmm0, 1(%rdi)
> +; AVX512F-NEXT:    vpextrb $1, %xmm0, 1(%rdi)
>  ; AVX512F-NEXT:    testb $4, %al
>  ; AVX512F-NEXT:    je .LBB17_6
>  ; AVX512F-NEXT:  .LBB17_5: # %cond.store3
> -; AVX512F-NEXT:    vpextrb $4, %xmm0, 2(%rdi)
> +; AVX512F-NEXT:    vpextrb $2, %xmm0, 2(%rdi)
>  ; AVX512F-NEXT:    testb $8, %al
>  ; AVX512F-NEXT:    je .LBB17_8
>  ; AVX512F-NEXT:  .LBB17_7: # %cond.store5
> -; AVX512F-NEXT:    vpextrb $6, %xmm0, 3(%rdi)
> +; AVX512F-NEXT:    vpextrb $3, %xmm0, 3(%rdi)
>  ; AVX512F-NEXT:    testb $16, %al
>  ; AVX512F-NEXT:    je .LBB17_10
>  ; AVX512F-NEXT:  .LBB17_9: # %cond.store7
> -; AVX512F-NEXT:    vpextrb $8, %xmm0, 4(%rdi)
> +; AVX512F-NEXT:    vpextrb $4, %xmm0, 4(%rdi)
>  ; AVX512F-NEXT:    testb $32, %al
>  ; AVX512F-NEXT:    je .LBB17_12
>  ; AVX512F-NEXT:  .LBB17_11: # %cond.store9
> -; AVX512F-NEXT:    vpextrb $10, %xmm0, 5(%rdi)
> +; AVX512F-NEXT:    vpextrb $5, %xmm0, 5(%rdi)
>  ; AVX512F-NEXT:    testb $64, %al
>  ; AVX512F-NEXT:    je .LBB17_14
>  ; AVX512F-NEXT:  .LBB17_13: # %cond.store11
> -; AVX512F-NEXT:    vpextrb $12, %xmm0, 6(%rdi)
> +; AVX512F-NEXT:    vpextrb $6, %xmm0, 6(%rdi)
>  ; AVX512F-NEXT:    testb $-128, %al
>  ; AVX512F-NEXT:    je .LBB17_16
>  ; AVX512F-NEXT:  .LBB17_15: # %cond.store13
> -; AVX512F-NEXT:    vpextrb $14, %xmm0, 7(%rdi)
> +; AVX512F-NEXT:    vpextrb $7, %xmm0, 7(%rdi)
>  ; AVX512F-NEXT:    vzeroupper
>  ; AVX512F-NEXT:    retq
>  ;
> @@ -7319,10 +7377,10 @@ define void @truncstore_v8i16_v8i8(<8 x
>  ; AVX512BW:       # %bb.0:
>  ; AVX512BW-NEXT:    # kill: def $xmm1 killed $xmm1 def $zmm1
>  ; AVX512BW-NEXT:    vptestmw %zmm1, %zmm1, %k0
> -; AVX512BW-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> -; AVX512BW-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    kshiftlq $56, %k0, %k0
>  ; AVX512BW-NEXT:    kshiftrq $56, %k0, %k1
> +; AVX512BW-NEXT:    vpminuw {{.*}}(%rip), %xmm0, %xmm0
> +; AVX512BW-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
>  ; AVX512BW-NEXT:    vmovdqu8 %zmm0, (%rdi) {%k1}
>  ; AVX512BW-NEXT:    vzeroupper
>  ; AVX512BW-NEXT:    retq
>
> Modified: llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/merge-consecutive-loads-256.ll Wed Aug  7 09:24:26 2019
> @@ -676,18 +676,18 @@ define <16 x i16> @merge_16i16_i16_0uu3z
>  define <2 x i8> @PR42846(<2 x i8>* %j, <2 x i8> %k) {
>  ; AVX-LABEL: PR42846:
>  ; AVX:       # %bb.0:
> -; AVX-NEXT:    vmovdqa {{.*}}(%rip), %ymm1
> -; AVX-NEXT:    vpmovzxbq {{.*#+}} xmm0 = xmm1[0],zero,zero,zero,zero,zero,zero,zero,xmm1[1],zero,zero,zero,zero,zero,zero,zero
> -; AVX-NEXT:    vpextrw $0, %xmm1, (%rdi)
> +; AVX-NEXT:    vmovdqa {{.*}}(%rip), %ymm0
> +; AVX-NEXT:    vpextrw $0, %xmm0, (%rdi)
> +; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
>  ; AVX-NEXT:    vzeroupper
>  ; AVX-NEXT:    retq
>  ;
>  ; X32-AVX-LABEL: PR42846:
>  ; X32-AVX:       # %bb.0:
>  ; X32-AVX-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; X32-AVX-NEXT:    vmovdqa l, %ymm1
> -; X32-AVX-NEXT:    vpmovzxbq {{.*#+}} xmm0 = xmm1[0],zero,zero,zero,zero,zero,zero,zero,xmm1[1],zero,zero,zero,zero,zero,zero,zero
> -; X32-AVX-NEXT:    vpextrw $0, %xmm1, (%eax)
> +; X32-AVX-NEXT:    vmovdqa l, %ymm0
> +; X32-AVX-NEXT:    vpextrw $0, %xmm0, (%eax)
> +; X32-AVX-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
>  ; X32-AVX-NEXT:    vzeroupper
>  ; X32-AVX-NEXT:    retl
>    %t0 = load volatile <32 x i8>, <32 x i8>* @l, align 32
>
> Modified: llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/mmx-arg-passing-x86-64.ll Wed Aug  7 09:24:26 2019
> @@ -22,13 +22,12 @@ define void @t3() nounwind  {
>  define void @t4(x86_mmx %v1, x86_mmx %v2) nounwind  {
>  ; X86-64-LABEL: t4:
>  ; X86-64:       ## %bb.0:
> -; X86-64-NEXT:    movdq2q %xmm1, %mm0
> -; X86-64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
>  ; X86-64-NEXT:    movdq2q %xmm0, %mm0
>  ; X86-64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X86-64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X86-64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X86-64-NEXT:    paddb %xmm1, %xmm0
> +; X86-64-NEXT:    movdq2q %xmm1, %mm0
> +; X86-64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> +; X86-64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
> +; X86-64-NEXT:    paddb -{{[0-9]+}}(%rsp), %xmm0
>  ; X86-64-NEXT:    movb $1, %al
>  ; X86-64-NEXT:    jmp _pass_v8qi ## TAILCALL
>    %v1a = bitcast x86_mmx %v1 to <8 x i8>
>
> Modified: llvm/trunk/test/CodeGen/X86/mmx-arith.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mmx-arith.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/mmx-arith.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/mmx-arith.ll Wed Aug  7 09:24:26 2019
> @@ -13,8 +13,8 @@ define void @test0(x86_mmx* %A, x86_mmx*
>  ; X32-NEXT:    .cfi_offset %ebp, -8
>  ; X32-NEXT:    movl %esp, %ebp
>  ; X32-NEXT:    .cfi_def_cfa_register %ebp
> -; X32-NEXT:    andl $-8, %esp
> -; X32-NEXT:    subl $16, %esp
> +; X32-NEXT:    andl $-16, %esp
> +; X32-NEXT:    subl $48, %esp
>  ; X32-NEXT:    movl 12(%ebp), %ecx
>  ; X32-NEXT:    movl 8(%ebp), %eax
>  ; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> @@ -26,7 +26,7 @@ define void @test0(x86_mmx* %A, x86_mmx*
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    paddusb (%ecx), %mm0
>  ; X32-NEXT:    movq %mm0, {{[0-9]+}}(%esp)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    movdqa {{[0-9]+}}(%esp), %xmm0
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X32-NEXT:    psubb %xmm1, %xmm0
> @@ -36,37 +36,24 @@ define void @test0(x86_mmx* %A, x86_mmx*
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    psubusb (%ecx), %mm0
>  ; X32-NEXT:    movq %mm0, (%esp)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; X32-NEXT:    movdqa (%esp), %xmm0
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X32-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> -; X32-NEXT:    pmullw %xmm0, %xmm1
> -; X32-NEXT:    movdqa {{.*#+}} xmm0 = [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
> -; X32-NEXT:    movdqa %xmm1, %xmm2
> -; X32-NEXT:    pand %xmm0, %xmm2
> -; X32-NEXT:    packuswb %xmm0, %xmm2
> -; X32-NEXT:    movq %xmm2, (%eax)
> -; X32-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> -; X32-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> -; X32-NEXT:    pand %xmm1, %xmm2
> -; X32-NEXT:    movdqa %xmm2, %xmm1
> +; X32-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; X32-NEXT:    pmullw %xmm1, %xmm0
> +; X32-NEXT:    pand {{\.LCPI.*}}, %xmm0
> +; X32-NEXT:    packuswb %xmm0, %xmm0
> +; X32-NEXT:    movq %xmm0, (%eax)
> +; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X32-NEXT:    pand %xmm0, %xmm1
> -; X32-NEXT:    packuswb %xmm0, %xmm1
>  ; X32-NEXT:    movq %xmm1, (%eax)
> +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    por %xmm1, %xmm0
> +; X32-NEXT:    movq %xmm0, (%eax)
>  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> -; X32-NEXT:    por %xmm2, %xmm1
> -; X32-NEXT:    movdqa %xmm1, %xmm2
> -; X32-NEXT:    pand %xmm0, %xmm2
> -; X32-NEXT:    packuswb %xmm0, %xmm2
> -; X32-NEXT:    movq %xmm2, (%eax)
> -; X32-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> -; X32-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> -; X32-NEXT:    pxor %xmm1, %xmm2
> -; X32-NEXT:    pand %xmm0, %xmm2
> -; X32-NEXT:    packuswb %xmm0, %xmm2
> -; X32-NEXT:    movq %xmm2, (%eax)
> +; X32-NEXT:    pxor %xmm0, %xmm1
> +; X32-NEXT:    movq %xmm1, (%eax)
>  ; X32-NEXT:    emms
>  ; X32-NEXT:    movl %ebp, %esp
>  ; X32-NEXT:    popl %ebp
> @@ -84,7 +71,7 @@ define void @test0(x86_mmx* %A, x86_mmx*
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    paddusb (%rsi), %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X64-NEXT:    psubb %xmm1, %xmm0
> @@ -94,37 +81,24 @@ define void @test0(x86_mmx* %A, x86_mmx*
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    psubusb (%rsi), %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; X64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X64-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> -; X64-NEXT:    pmullw %xmm0, %xmm1
> -; X64-NEXT:    movdqa {{.*#+}} xmm0 = [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
> -; X64-NEXT:    movdqa %xmm1, %xmm2
> -; X64-NEXT:    pand %xmm0, %xmm2
> -; X64-NEXT:    packuswb %xmm0, %xmm2
> -; X64-NEXT:    movq %xmm2, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> -; X64-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> -; X64-NEXT:    pand %xmm1, %xmm2
> -; X64-NEXT:    movdqa %xmm2, %xmm1
> +; X64-NEXT:    punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
> +; X64-NEXT:    pmullw %xmm1, %xmm0
> +; X64-NEXT:    pand {{.*}}(%rip), %xmm0
> +; X64-NEXT:    packuswb %xmm0, %xmm0
> +; X64-NEXT:    movq %xmm0, (%rdi)
> +; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X64-NEXT:    pand %xmm0, %xmm1
> -; X64-NEXT:    packuswb %xmm0, %xmm1
>  ; X64-NEXT:    movq %xmm1, (%rdi)
> +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    por %xmm1, %xmm0
> +; X64-NEXT:    movq %xmm0, (%rdi)
>  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
> -; X64-NEXT:    por %xmm2, %xmm1
> -; X64-NEXT:    movdqa %xmm1, %xmm2
> -; X64-NEXT:    pand %xmm0, %xmm2
> -; X64-NEXT:    packuswb %xmm0, %xmm2
> -; X64-NEXT:    movq %xmm2, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm2 = mem[0],zero
> -; X64-NEXT:    punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
> -; X64-NEXT:    pxor %xmm1, %xmm2
> -; X64-NEXT:    pand %xmm0, %xmm2
> -; X64-NEXT:    packuswb %xmm0, %xmm2
> -; X64-NEXT:    movq %xmm2, (%rdi)
> +; X64-NEXT:    pxor %xmm0, %xmm1
> +; X64-NEXT:    movq %xmm1, (%rdi)
>  ; X64-NEXT:    emms
>  ; X64-NEXT:    retq
>  entry:
> @@ -182,66 +156,56 @@ entry:
>  define void @test1(x86_mmx* %A, x86_mmx* %B) {
>  ; X32-LABEL: test1:
>  ; X32:       # %bb.0: # %entry
> -; X32-NEXT:    movl {{[0-9]+}}(%esp), %ecx
>  ; X32-NEXT:    movl {{[0-9]+}}(%esp), %eax
> -; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,1,1,3]
> -; X32-NEXT:    paddq %xmm0, %xmm1
> -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> -; X32-NEXT:    movq %xmm0, (%eax)
> -; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; X32-NEXT:    pmuludq %xmm1, %xmm0
> -; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> -; X32-NEXT:    movq %xmm1, (%eax)
> -; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,1,1,3]
> -; X32-NEXT:    andps %xmm0, %xmm1
> -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> -; X32-NEXT:    movq %xmm0, (%eax)
> -; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; X32-NEXT:    orps %xmm1, %xmm0
> -; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> -; X32-NEXT:    movq %xmm1, (%eax)
> -; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    shufps {{.*#+}} xmm1 = xmm1[0,1,1,3]
> -; X32-NEXT:    xorps %xmm0, %xmm1
> -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> -; X32-NEXT:    movq %xmm0, (%eax)
> +; X32-NEXT:    movl {{[0-9]+}}(%esp), %ecx
> +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> +; X32-NEXT:    paddd %xmm0, %xmm1
> +; X32-NEXT:    movq %xmm1, (%ecx)
> +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    pshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
> +; X32-NEXT:    pmuludq %xmm0, %xmm1
> +; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]
> +; X32-NEXT:    pmuludq %xmm0, %xmm2
> +; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm2[0,2,2,3]
> +; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; X32-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
> +; X32-NEXT:    movq %xmm1, (%ecx)
> +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    pand %xmm1, %xmm0
> +; X32-NEXT:    movq %xmm0, (%ecx)
> +; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> +; X32-NEXT:    por %xmm0, %xmm1
> +; X32-NEXT:    movq %xmm1, (%ecx)
> +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    pxor %xmm1, %xmm0
> +; X32-NEXT:    movq %xmm0, (%ecx)
>  ; X32-NEXT:    emms
>  ; X32-NEXT:    retl
>  ;
>  ; X64-LABEL: test1:
>  ; X64:       # %bb.0: # %entry
>  ; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
>  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
> -; X64-NEXT:    paddq %xmm0, %xmm1
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> -; X64-NEXT:    movq %xmm0, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; X64-NEXT:    pmuludq %xmm1, %xmm0
> -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> +; X64-NEXT:    paddd %xmm0, %xmm1
>  ; X64-NEXT:    movq %xmm1, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
> -; X64-NEXT:    pand %xmm0, %xmm1
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> -; X64-NEXT:    movq %xmm0, (%rdi)
>  ; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; X64-NEXT:    por %xmm1, %xmm0
> -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[0,2,2,3]
> +; X64-NEXT:    pshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
> +; X64-NEXT:    pmuludq %xmm0, %xmm1
> +; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
> +; X64-NEXT:    pmuludq %xmm2, %xmm0
> +; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; X64-NEXT:    punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
>  ; X64-NEXT:    movq %xmm1, (%rdi)
> +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    pand %xmm1, %xmm0
> +; X64-NEXT:    movq %xmm0, (%rdi)
>  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,1,1,3]
> -; X64-NEXT:    pxor %xmm0, %xmm1
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm1[0,2,2,3]
> +; X64-NEXT:    por %xmm0, %xmm1
> +; X64-NEXT:    movq %xmm1, (%rdi)
> +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    pxor %xmm1, %xmm0
>  ; X64-NEXT:    movq %xmm0, (%rdi)
>  ; X64-NEXT:    emms
>  ; X64-NEXT:    retq
> @@ -294,8 +258,8 @@ define void @test2(x86_mmx* %A, x86_mmx*
>  ; X32-NEXT:    .cfi_offset %ebp, -8
>  ; X32-NEXT:    movl %esp, %ebp
>  ; X32-NEXT:    .cfi_def_cfa_register %ebp
> -; X32-NEXT:    andl $-8, %esp
> -; X32-NEXT:    subl $24, %esp
> +; X32-NEXT:    andl $-16, %esp
> +; X32-NEXT:    subl $64, %esp
>  ; X32-NEXT:    movl 12(%ebp), %ecx
>  ; X32-NEXT:    movl 8(%ebp), %eax
>  ; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> @@ -307,7 +271,7 @@ define void @test2(x86_mmx* %A, x86_mmx*
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    paddusw (%ecx), %mm0
>  ; X32-NEXT:    movq %mm0, {{[0-9]+}}(%esp)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    movdqa {{[0-9]+}}(%esp), %xmm0
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X32-NEXT:    psubw %xmm1, %xmm0
> @@ -317,40 +281,25 @@ define void @test2(x86_mmx* %A, x86_mmx*
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    psubusw (%ecx), %mm0
>  ; X32-NEXT:    movq %mm0, {{[0-9]+}}(%esp)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
>  ; X32-NEXT:    movq %mm0, (%eax)
> -; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    pmullw %xmm0, %xmm1
> -; X32-NEXT:    movdq2q %xmm1, %mm0
> -; X32-NEXT:    movq %xmm1, (%eax)
> +; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    pmullw {{[0-9]+}}(%esp), %xmm0
> +; X32-NEXT:    movdq2q %xmm0, %mm0
> +; X32-NEXT:    movq %xmm0, (%eax)
>  ; X32-NEXT:    pmulhw (%ecx), %mm0
>  ; X32-NEXT:    movq %mm0, (%eax)
>  ; X32-NEXT:    pmaddwd (%ecx), %mm0
>  ; X32-NEXT:    movq %mm0, (%esp)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
>  ; X32-NEXT:    movq %mm0, (%eax)
> -; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> -; X32-NEXT:    pand %xmm0, %xmm1
> -; X32-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> -; X32-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; X32-NEXT:    movq %xmm0, (%eax)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> -; X32-NEXT:    por %xmm1, %xmm0
> -; X32-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> -; X32-NEXT:    pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,6,6,7]
> -; X32-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> -; X32-NEXT:    movq %xmm1, (%eax)
> -; X32-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X32-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> -; X32-NEXT:    pxor %xmm0, %xmm1
> -; X32-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> -; X32-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> -; X32-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; X32-NEXT:    movq %xmm0, (%eax)
> +; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    andps (%esp), %xmm0
> +; X32-NEXT:    movlps %xmm0, (%eax)
> +; X32-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> +; X32-NEXT:    orps %xmm0, %xmm1
> +; X32-NEXT:    movlps %xmm1, (%eax)
> +; X32-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> +; X32-NEXT:    xorps %xmm1, %xmm0
> +; X32-NEXT:    movlps %xmm0, (%eax)
>  ; X32-NEXT:    emms
>  ; X32-NEXT:    movl %ebp, %esp
>  ; X32-NEXT:    popl %ebp
> @@ -368,7 +317,7 @@ define void @test2(x86_mmx* %A, x86_mmx*
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    paddusw (%rsi), %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    movdqa -{{[0-9]+}}(%rsp), %xmm0
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
>  ; X64-NEXT:    psubw %xmm1, %xmm0
> @@ -378,40 +327,25 @@ define void @test2(x86_mmx* %A, x86_mmx*
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    psubusw (%rsi), %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
>  ; X64-NEXT:    movq %mm0, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    pmullw %xmm0, %xmm1
> -; X64-NEXT:    movdq2q %xmm1, %mm0
> -; X64-NEXT:    movq %xmm1, (%rdi)
> +; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    pmullw -{{[0-9]+}}(%rsp), %xmm0
> +; X64-NEXT:    movdq2q %xmm0, %mm0
> +; X64-NEXT:    movq %xmm0, (%rdi)
>  ; X64-NEXT:    pmulhw (%rsi), %mm0
>  ; X64-NEXT:    movq %mm0, (%rdi)
>  ; X64-NEXT:    pmaddwd (%rsi), %mm0
>  ; X64-NEXT:    movq %mm0, -{{[0-9]+}}(%rsp)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
>  ; X64-NEXT:    movq %mm0, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> -; X64-NEXT:    pand %xmm0, %xmm1
> -; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> -; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; X64-NEXT:    movq %xmm0, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X64-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3]
> -; X64-NEXT:    por %xmm1, %xmm0
> -; X64-NEXT:    pshuflw {{.*#+}} xmm1 = xmm0[0,2,2,3,4,5,6,7]
> -; X64-NEXT:    pshufhw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,6,6,7]
> -; X64-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> -; X64-NEXT:    movq %xmm1, (%rdi)
> -; X64-NEXT:    movq {{.*#+}} xmm1 = mem[0],zero
> -; X64-NEXT:    punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
> -; X64-NEXT:    pxor %xmm0, %xmm1
> -; X64-NEXT:    pshuflw {{.*#+}} xmm0 = xmm1[0,2,2,3,4,5,6,7]
> -; X64-NEXT:    pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,6,6,7]
> -; X64-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> -; X64-NEXT:    movq %xmm0, (%rdi)
> +; X64-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    andps -{{[0-9]+}}(%rsp), %xmm0
> +; X64-NEXT:    movlps %xmm0, (%rdi)
> +; X64-NEXT:    movsd {{.*#+}} xmm1 = mem[0],zero
> +; X64-NEXT:    orps %xmm0, %xmm1
> +; X64-NEXT:    movlps %xmm1, (%rdi)
> +; X64-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
> +; X64-NEXT:    xorps %xmm1, %xmm0
> +; X64-NEXT:    movlps %xmm0, (%rdi)
>  ; X64-NEXT:    emms
>  ; X64-NEXT:    retq
>  entry:
> @@ -479,45 +413,34 @@ define <1 x i64> @test3(<1 x i64>* %a, <
>  ; X32-LABEL: test3:
>  ; X32:       # %bb.0: # %entry
>  ; X32-NEXT:    pushl %ebp
> -; X32-NEXT:    movl %esp, %ebp
>  ; X32-NEXT:    pushl %ebx
>  ; X32-NEXT:    pushl %edi
>  ; X32-NEXT:    pushl %esi
> -; X32-NEXT:    andl $-8, %esp
> -; X32-NEXT:    subl $16, %esp
> -; X32-NEXT:    cmpl $0, 16(%ebp)
> +; X32-NEXT:    cmpl $0, {{[0-9]+}}(%esp)
>  ; X32-NEXT:    je .LBB3_1
>  ; X32-NEXT:  # %bb.2: # %bb26.preheader
> +; X32-NEXT:    movl {{[0-9]+}}(%esp), %esi
> +; X32-NEXT:    movl {{[0-9]+}}(%esp), %edi
>  ; X32-NEXT:    xorl %ebx, %ebx
>  ; X32-NEXT:    xorl %eax, %eax
>  ; X32-NEXT:    xorl %edx, %edx
>  ; X32-NEXT:    .p2align 4, 0x90
>  ; X32-NEXT:  .LBB3_3: # %bb26
>  ; X32-NEXT:    # =>This Inner Loop Header: Depth=1
> -; X32-NEXT:    movl 8(%ebp), %ecx
> -; X32-NEXT:    movl %ecx, %esi
> -; X32-NEXT:    movl (%ecx,%ebx,8), %ecx
> -; X32-NEXT:    movl 4(%esi,%ebx,8), %esi
> -; X32-NEXT:    movl 12(%ebp), %edi
> -; X32-NEXT:    addl (%edi,%ebx,8), %ecx
> -; X32-NEXT:    adcl 4(%edi,%ebx,8), %esi
> -; X32-NEXT:    addl %eax, %ecx
> -; X32-NEXT:    movl %ecx, (%esp)
> -; X32-NEXT:    adcl %edx, %esi
> -; X32-NEXT:    movl %esi, {{[0-9]+}}(%esp)
> -; X32-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
> -; X32-NEXT:    movd %xmm0, %eax
> -; X32-NEXT:    shufps {{.*#+}} xmm0 = xmm0[1,1,0,1]
> -; X32-NEXT:    movd %xmm0, %edx
> +; X32-NEXT:    movl (%edi,%ebx,8), %ebp
> +; X32-NEXT:    movl 4(%edi,%ebx,8), %ecx
> +; X32-NEXT:    addl (%esi,%ebx,8), %ebp
> +; X32-NEXT:    adcl 4(%esi,%ebx,8), %ecx
> +; X32-NEXT:    addl %ebp, %eax
> +; X32-NEXT:    adcl %ecx, %edx
>  ; X32-NEXT:    incl %ebx
> -; X32-NEXT:    cmpl 16(%ebp), %ebx
> +; X32-NEXT:    cmpl {{[0-9]+}}(%esp), %ebx
>  ; X32-NEXT:    jb .LBB3_3
>  ; X32-NEXT:    jmp .LBB3_4
>  ; X32-NEXT:  .LBB3_1:
>  ; X32-NEXT:    xorl %eax, %eax
>  ; X32-NEXT:    xorl %edx, %edx
>  ; X32-NEXT:  .LBB3_4: # %bb31
> -; X32-NEXT:    leal -12(%ebp), %esp
>  ; X32-NEXT:    popl %esi
>  ; X32-NEXT:    popl %edi
>  ; X32-NEXT:    popl %ebx
>
> Modified: llvm/trunk/test/CodeGen/X86/mmx-cvt.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mmx-cvt.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/mmx-cvt.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/mmx-cvt.ll Wed Aug  7 09:24:26 2019
> @@ -296,8 +296,8 @@ define <4 x float> @sitofp_v2i32_v2f32(<
>  ; X86:       # %bb.0:
>  ; X86-NEXT:    pushl %ebp
>  ; X86-NEXT:    movl %esp, %ebp
> -; X86-NEXT:    andl $-8, %esp
> -; X86-NEXT:    subl $8, %esp
> +; X86-NEXT:    andl $-16, %esp
> +; X86-NEXT:    subl $32, %esp
>  ; X86-NEXT:    movl 8(%ebp), %eax
>  ; X86-NEXT:    movq (%eax), %mm0
>  ; X86-NEXT:    paddd %mm0, %mm0
>
> Modified: llvm/trunk/test/CodeGen/X86/mulvi32.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/mulvi32.ll?rev=368183&r1=368182&r2=368183&view=diff
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/mulvi32.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/mulvi32.ll Wed Aug  7 09:24:26 2019
> @@ -7,36 +7,39 @@
>  ; PR6399
>
>  define <2 x i32> @_mul2xi32a(<2 x i32>, <2 x i32>) {
> -; SSE-LABEL: _mul2xi32a:
> -; SSE:       # %bb.0:
> -; SSE-NEXT:    pmuludq %xmm1, %xmm0
> -; SSE-NEXT:    retq
> +; SSE2-LABEL: _mul2xi32a:
> +; SSE2:       # %bb.0:
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]
> +; SSE2-NEXT:    pmuludq %xmm1, %xmm0
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
> +; SSE2-NEXT:    pmuludq %xmm2, %xmm1
> +; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
> +; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
> +; SSE2-NEXT:    retq
> +;
> +; SSE42-LABEL: _mul2xi32a:
> +; SSE42:       # %bb.0:
> +; SSE42-NEXT:    pmulld %xmm1, %xmm0
> +; SSE42-NEXT:    retq
>  ;
>  ; AVX-LABEL: _mul2xi32a:
>  ; AVX:       # %bb.0:
> -; AVX-NEXT:    vpmuludq %xmm1, %xmm0, %xmm0
> +; AVX-NEXT:    vpmulld %xmm1, %xmm0, %xmm0
>  ; AVX-NEXT:    retq
>    %r = mul <2 x i32> %0, %1
>    ret <2 x i32> %r
>  }
>
>  define <2 x i32> @_mul2xi32b(<2 x i32>, <2 x i32>) {
> -; SSE2-LABEL: _mul2xi32b:
> -; SSE2:       # %bb.0:
> -; SSE2-NEXT:    pmuludq %xmm1, %xmm0
> -; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,1,1,3]
> -; SSE2-NEXT:    retq
> -;
> -; SSE42-LABEL: _mul2xi32b:
> -; SSE42:       # %bb.0:
> -; SSE42-NEXT:    pmuludq %xmm1, %xmm0
> -; SSE42-NEXT:    pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
> -; SSE42-NEXT:    retq
> +; SSE-LABEL: _mul2xi32b:
> +; SSE:       # %bb.0:
> +; SSE-NEXT:    pmuludq %xmm1, %xmm0
> +; SSE-NEXT:    retq
>  ;
>  ; AVX-LABEL: _mul2xi32b:
>  ; AVX:       # %bb.0:
>  ; AVX-NEXT:    vpmuludq %xmm1, %xmm0, %xmm0
> -; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
>  ; AVX-NEXT:    retq
>    %factor0 = shufflevector <2 x i32> %0, <2 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 2, i32 undef>
>    %factor1 = shufflevector <2 x i32> %1, <2 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 2, i32 undef>
> @@ -153,8 +156,8 @@ define <4 x i64> @_mul4xi32toi64a(<4 x i
>  ;
>  ; AVX1-LABEL: _mul4xi32toi64a:
>  ; AVX1:       # %bb.0:
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm1[2,2,3,3]
> -; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,2,3,3]
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm2 = xmm1[2,1,3,3]
> +; AVX1-NEXT:    vpshufd {{.*#+}} xmm3 = xmm0[2,1,3,3]
>  ; AVX1-NEXT:    vpmuludq %xmm2, %xmm3, %xmm2
>  ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero
>  ; AVX1-NEXT:    vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits


More information about the llvm-commits mailing list