[llvm] 7f6bbb3 - [RISCV][TTI] Reduce cost of a build_vector pattern (#108419)

Fri Sep 20 08:34:40 PDT 2024

Author: Philip Reames
Date: 2024-09-20T08:34:36-07:00
New Revision: 7f6bbb3c4f2c5f03d2abd3a79c85f1e0e4e03e05

URL: https://github.com/llvm/llvm-project/commit/7f6bbb3c4f2c5f03d2abd3a79c85f1e0e4e03e05
DIFF: https://github.com/llvm/llvm-project/commit/7f6bbb3c4f2c5f03d2abd3a79c85f1e0e4e03e05.diff

LOG: [RISCV][TTI] Reduce cost of a build_vector pattern (#108419)

This change is actually two related changes, but they're very hard to
meaningfully separate as the second balances the first, and yet doesn't
do much good on it's own.

First, we can reduce the cost of a build_vector pattern. Our current
costing for this defers to generic insertelement costing which isn't
unreasonable, but also isn't correct. While inserting N elements
requires N-1 slides and N vmv.s.x, doing the full build_vector only
requires N vslide1down. (Note there are other cases that our build
vector lowering can do more cheaply, this is simply the easiest upper
bound which appears to be "good enough" for SLP costing purposes.)

Second, we need to tell SLP that calls don't preserve vector registers.
Without this, SLP will vectorize scalar code which performs e.g. 4 x
float @exp calls as two <2 x float> @exp intrinsic calls. Oddly, the
costing works out that this is in fact the optimal choice - except that
we don't actually have a <2 x float> @exp, and unroll during DAG. This
would be fine (or at least cost neutral) except that the libcall for the
scalar @exp blows all vector registers. So the net effect is we added a
bunch of spills that SLP had no idea about. Thankfully, AArch64 has a
similiar problem, and has taught SLP how to reason about spill cost once
the right TTI hook is implemented.

Now, for some implications...

The SLP solution for spill costing has some inaccuracies. In particular,
it basically just guesses whether a intrinsic will be lowered to a call
or not, and can be wrong in both directions. It also has no mechanism to
differentiate on calling convention.

This has the effect of making partial vectorization (i.e. starting in
scalar) more profitable. In practice, the major effect of this is to
make it more like SLP will vectorize part of a tree in an intersecting
forrest, and then vectorize the remaining tree once those uses have been
removed.

This has the effect of biasing us slightly away from strided, or indexed
loads during vectorization - because the scalar cost is more accurately
modeled, and these instructions look relevatively less profitable.

Added: 
    

Modified: 
    llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
    llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
    llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
    llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
    llvm/test/Analysis/CostModel/RISCV/fp-min-max-abs.ll
    llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
    llvm/test/Analysis/CostModel/RISCV/fp-trig-log-exp.ll
    llvm/test/Analysis/CostModel/RISCV/gep.ll
    llvm/test/Analysis/CostModel/RISCV/int-sat-math.ll
    llvm/test/Analysis/CostModel/RISCV/rvv-intrinsics.ll
    llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll
    llvm/test/Transforms/SLPVectorizer/RISCV/mixed-extracts-types.ll
    llvm/test/Transforms/SLPVectorizer/RISCV/remarks-insert-into-small-vector.ll
    llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll

Removed: 
    


################################################################################
diff  --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index d4513a576e7b85..595475f37a7a69 100644

--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -616,6 +616,40 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
   return BaseT::getShuffleCost(Kind, Tp, Mask, CostKind, Index, SubTp);
 }
 
+static unsigned isM1OrSmaller(MVT VT) {
+  RISCVII::VLMUL LMUL = RISCVTargetLowering::getLMUL(VT);
+  return (LMUL == RISCVII::VLMUL::LMUL_F8 || LMUL == RISCVII::VLMUL::LMUL_F4 ||
+          LMUL == RISCVII::VLMUL::LMUL_F2 || LMUL == RISCVII::VLMUL::LMUL_1);
+}
+
+InstructionCost RISCVTTIImpl::getScalarizationOverhead(
+    VectorType *Ty, const APInt &DemandedElts, bool Insert, bool Extract,
+    TTI::TargetCostKind CostKind) {
+  if (isa<ScalableVectorType>(Ty))
+    return InstructionCost::getInvalid();
+
+  // A build_vector (which is m1 sized or smaller) can be done in no
+  // worse than one vslide1down.vx per element in the type.  We could
+  // in theory do an explode_vector in the inverse manner, but our
+  // lowering today does not have a first class node for this pattern.
+  InstructionCost Cost = BaseT::getScalarizationOverhead(
+      Ty, DemandedElts, Insert, Extract, CostKind);
+  std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Ty);
+  if (Insert && !Extract && LT.first.isValid() && LT.second.isVector() &&
+      Ty->getScalarSizeInBits() != 1) {
+    assert(LT.second.isFixedLengthVector());
+    MVT ContainerVT = TLI->getContainerForFixedLengthVector(LT.second);
+    if (isM1OrSmaller(ContainerVT)) {
+      InstructionCost BV =
+          cast<FixedVectorType>(Ty)->getNumElements() *
+          getRISCVInstructionCost(RISCV::VSLIDE1DOWN_VX, LT.second, CostKind);
+      if (BV < Cost)
+        Cost = BV;
+    }
+  }
+  return Cost;
+}
+
 InstructionCost
 RISCVTTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src, Align Alignment,
                                     unsigned AddressSpace,
@@ -767,6 +801,23 @@ InstructionCost RISCVTTIImpl::getStridedMemoryOpCost(
   return NumLoads * MemOpCost;
 }
 
+InstructionCost
+RISCVTTIImpl::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) {
+  // FIXME: This is a property of the default vector convention, not
+  // all possible calling conventions.  Fixing that will require
+  // some TTI API and SLP rework.
+  InstructionCost Cost = 0;
+  TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  for (auto *Ty : Tys) {
+    if (!Ty->isVectorTy())
+      continue;
+    Align A = DL.getPrefTypeAlign(Ty);
+    Cost += getMemoryOpCost(Instruction::Store, Ty, A, 0, CostKind) +
+            getMemoryOpCost(Instruction::Load, Ty, A, 0, CostKind);
+  }
+  return Cost;
+}
+
 // Currently, these represent both throughput and codesize costs
 // for the respective intrinsics.  The costs in this table are simply
 // instruction counts with the following adjustments made:

diff  --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 308fbc55b2d595..f16c4fc0eed023 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -149,6 +149,11 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                  ArrayRef<const Value *> Args = {},
                                  const Instruction *CxtI = nullptr);
 
+  InstructionCost getScalarizationOverhead(VectorType *Ty,
+                                           const APInt &DemandedElts,
+                                           bool Insert, bool Extract,
+                                           TTI::TargetCostKind CostKind);
+
   InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
                                         TTI::TargetCostKind CostKind);
 
@@ -169,6 +174,8 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                          TTI::TargetCostKind CostKind,
                                          const Instruction *I);
 
+  InstructionCost getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);
+
   InstructionCost getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
                                    TTI::CastContextHint CCH,
                                    TTI::TargetCostKind CostKind,

diff  --git a/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll b/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
index e92b2b3c50b898..ff2609b5cc4e0a 100644
--- a/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/arith-fp.ll
@@ -361,8 +361,8 @@ define void @frem() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %F32 = frem float undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %F64 = frem double undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F32 = frem <1 x float> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F32 = frem <2 x float> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F32 = frem <4 x float> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F32 = frem <2 x float> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4F32 = frem <4 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F32 = frem <8 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 63 for instruction: %V16F32 = frem <16 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F32 = frem <vscale x 1 x float> undef, undef
@@ -371,7 +371,7 @@ define void @frem() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV8F32 = frem <vscale x 8 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV16F32 = frem <vscale x 16 x float> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F64 = frem <1 x double> undef, undef
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F64 = frem <2 x double> undef, undef
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F64 = frem <2 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F64 = frem <4 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F64 = frem <8 x double> undef, undef
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F64 = frem <vscale x 1 x double> undef, undef
@@ -412,9 +412,9 @@ define void @frem_f16() {
 ; ZVFH-LABEL: 'frem_f16'
 ; ZVFH-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %F16 = frem half undef, undef
 ; ZVFH-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F16 = frem <1 x half> undef, undef
-; ZVFH-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F16 = frem <2 x half> undef, undef
-; ZVFH-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F16 = frem <4 x half> undef, undef
-; ZVFH-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F16 = frem <8 x half> undef, undef
+; ZVFH-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F16 = frem <2 x half> undef, undef
+; ZVFH-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4F16 = frem <4 x half> undef, undef
+; ZVFH-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V8F16 = frem <8 x half> undef, undef
 ; ZVFH-NEXT:  Cost Model: Found an estimated cost of 63 for instruction: %V16F16 = frem <16 x half> undef, undef
 ; ZVFH-NEXT:  Cost Model: Found an estimated cost of 127 for instruction: %V32F16 = frem <32 x half> undef, undef
 ; ZVFH-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F16 = frem <vscale x 1 x half> undef, undef
@@ -428,9 +428,9 @@ define void @frem_f16() {
 ; ZVFHMIN-LABEL: 'frem_f16'
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %F16 = frem half undef, undef
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F16 = frem <1 x half> undef, undef
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F16 = frem <2 x half> undef, undef
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %V4F16 = frem <4 x half> undef, undef
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %V8F16 = frem <8 x half> undef, undef
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %V2F16 = frem <2 x half> undef, undef
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %V4F16 = frem <4 x half> undef, undef
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: %V8F16 = frem <8 x half> undef, undef
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 63 for instruction: %V16F16 = frem <16 x half> undef, undef
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 127 for instruction: %V32F16 = frem <32 x half> undef, undef
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %NXV1F16 = frem <vscale x 1 x half> undef, undef
@@ -620,9 +620,9 @@ define void @fcopysign_f16() {
 ; ZVFHMIN-LABEL: 'fcopysign_f16'
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %F16 = call half @llvm.copysign.f16(half undef, half undef)
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V1F16 = call <1 x half> @llvm.copysign.v1f16(<1 x half> undef, <1 x half> undef)
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %V2F16 = call <2 x half> @llvm.copysign.v2f16(<2 x half> undef, <2 x half> undef)
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %V4F16 = call <4 x half> @llvm.copysign.v4f16(<4 x half> undef, <4 x half> undef)
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %V8F16 = call <8 x half> @llvm.copysign.v8f16(<8 x half> undef, <8 x half> undef)
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V2F16 = call <2 x half> @llvm.copysign.v2f16(<2 x half> undef, <2 x half> undef)
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V4F16 = call <4 x half> @llvm.copysign.v4f16(<4 x half> undef, <4 x half> undef)
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V8F16 = call <8 x half> @llvm.copysign.v8f16(<8 x half> undef, <8 x half> undef)
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %V16F16 = call <16 x half> @llvm.copysign.v16f16(<16 x half> undef, <16 x half> undef)
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %V32F16 = call <32 x half> @llvm.copysign.v32f16(<32 x half> undef, <32 x half> undef)
 ; ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %NXV1F16 = call <vscale x 1 x half> @llvm.copysign.nxv1f16(<vscale x 1 x half> undef, <vscale x 1 x half> undef)

diff  --git a/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll b/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
index 094916971ada79..b0bbb23a56f3fe 100644
--- a/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/fixed-vector-gather.ll
@@ -42,35 +42,35 @@ define i32 @masked_gather() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V4I8 = call <4 x i8> @llvm.masked.gather.v4i8.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V2I8 = call <2 x i8> @llvm.masked.gather.v2i8.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V1I8 = call <1 x i8> @llvm.masked.gather.v1i8.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V8F64.u = call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x double> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V4F64.u = call <4 x double> @llvm.masked.gather.v4f64.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x double> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2F64.u = call <2 x double> @llvm.masked.gather.v2f64.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8F64.u = call <8 x double> @llvm.masked.gather.v8f64.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4F64.u = call <4 x double> @llvm.masked.gather.v4f64.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F64.u = call <2 x double> @llvm.masked.gather.v2f64.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F64.u = call <1 x double> @llvm.masked.gather.v1f64.v1p0(<1 x ptr> undef, i32 2, <1 x i1> undef, <1 x double> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 68 for instruction: %V16F32.u = call <16 x float> @llvm.masked.gather.v16f32.v16p0(<16 x ptr> undef, i32 2, <16 x i1> undef, <16 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8F32.u = call <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4F32.u = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2F32.u = call <2 x float> @llvm.masked.gather.v2f32.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16F32.u = call <16 x float> @llvm.masked.gather.v16f32.v16p0(<16 x ptr> undef, i32 2, <16 x i1> undef, <16 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8F32.u = call <8 x float> @llvm.masked.gather.v8f32.v8p0(<8 x ptr> undef, i32 2, <8 x i1> undef, <8 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4F32.u = call <4 x float> @llvm.masked.gather.v4f32.v4p0(<4 x ptr> undef, i32 2, <4 x i1> undef, <4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F32.u = call <2 x float> @llvm.masked.gather.v2f32.v2p0(<2 x ptr> undef, i32 2, <2 x i1> undef, <2 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F32.u = call <1 x float> @llvm.masked.gather.v1f32.v1p0(<1 x ptr> undef, i32 2, <1 x i1> undef, <1 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 140 for instruction: %V32F16.u = call <32 x half> @llvm.masked.gather.v32f16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 70 for instruction: %V16F16.u = call <16 x half> @llvm.masked.gather.v16f16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %V8F16.u = call <8 x half> @llvm.masked.gather.v8f16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4F16.u = call <4 x half> @llvm.masked.gather.v4f16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2F16.u = call <2 x half> @llvm.masked.gather.v2f16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V32F16.u = call <32 x half> @llvm.masked.gather.v32f16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16F16.u = call <16 x half> @llvm.masked.gather.v16f16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8F16.u = call <8 x half> @llvm.masked.gather.v8f16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4F16.u = call <4 x half> @llvm.masked.gather.v4f16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x half> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2F16.u = call <2 x half> @llvm.masked.gather.v2f16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x half> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1F16.u = call <1 x half> @llvm.masked.gather.v1f16.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x half> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %V8I64.u = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> undef, i32 4, <8 x i1> undef, <8 x i64> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %V4I64.u = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> undef, i32 4, <4 x i1> undef, <4 x i64> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2I64.u = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> undef, i32 4, <2 x i1> undef, <2 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8I64.u = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> undef, i32 4, <8 x i1> undef, <8 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4I64.u = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> undef, i32 4, <4 x i1> undef, <4 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2I64.u = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> undef, i32 4, <2 x i1> undef, <2 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1I64.u = call <1 x i64> @llvm.masked.gather.v1i64.v1p0(<1 x ptr> undef, i32 4, <1 x i1> undef, <1 x i64> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 68 for instruction: %V16I32.u = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %V8I32.u = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4I32.u = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2I32.u = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16I32.u = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8I32.u = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4I32.u = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2I32.u = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1I32.u = call <1 x i32> @llvm.masked.gather.v1i32.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 140 for instruction: %V32I16.u = call <32 x i16> @llvm.masked.gather.v32i16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 70 for instruction: %V16I16.u = call <16 x i16> @llvm.masked.gather.v16i16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %V8I16.u = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 17 for instruction: %V4I16.u = call <4 x i16> @llvm.masked.gather.v4i16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %V2I16.u = call <2 x i16> @llvm.masked.gather.v2i16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 112 for instruction: %V32I16.u = call <32 x i16> @llvm.masked.gather.v32i16.v32p0(<32 x ptr> undef, i32 1, <32 x i1> undef, <32 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %V16I16.u = call <16 x i16> @llvm.masked.gather.v16i16.v16p0(<16 x ptr> undef, i32 1, <16 x i1> undef, <16 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %V8I16.u = call <8 x i16> @llvm.masked.gather.v8i16.v8p0(<8 x ptr> undef, i32 1, <8 x i1> undef, <8 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %V4I16.u = call <4 x i16> @llvm.masked.gather.v4i16.v4p0(<4 x ptr> undef, i32 1, <4 x i1> undef, <4 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %V2I16.u = call <2 x i16> @llvm.masked.gather.v2i16.v2p0(<2 x ptr> undef, i32 1, <2 x i1> undef, <2 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %V1I16.u = call <1 x i16> @llvm.masked.gather.v1i16.v1p0(<1 x ptr> undef, i32 1, <1 x i1> undef, <1 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret i32 0
 ;

diff  --git a/llvm/test/Analysis/CostModel/RISCV/fp-min-max-abs.ll b/llvm/test/Analysis/CostModel/RISCV/fp-min-max-abs.ll
index 64dc264f12735b..6e4061a42bf9b8 100644
--- a/llvm/test/Analysis/CostModel/RISCV/fp-min-max-abs.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/fp-min-max-abs.ll
@@ -473,9 +473,9 @@ define void @copysign_f16() {
 ;
 ; ZVFHMIN-LABEL: 'copysign_f16'
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = call half @llvm.copysign.f16(half undef, half undef)
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %2 = call <2 x half> @llvm.copysign.v2f16(<2 x half> undef, <2 x half> undef)
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %3 = call <4 x half> @llvm.copysign.v4f16(<4 x half> undef, <4 x half> undef)
-; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %4 = call <8 x half> @llvm.copysign.v8f16(<8 x half> undef, <8 x half> undef)
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %2 = call <2 x half> @llvm.copysign.v2f16(<2 x half> undef, <2 x half> undef)
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %3 = call <4 x half> @llvm.copysign.v4f16(<4 x half> undef, <4 x half> undef)
+; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %4 = call <8 x half> @llvm.copysign.v8f16(<8 x half> undef, <8 x half> undef)
 ; ZVFHMIN-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %5 = call <16 x half> @llvm.copysign.v16f16(<16 x half> undef, <16 x half> undef)
 ; ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x half> @llvm.copysign.nxv1f16(<vscale x 1 x half> undef, <vscale x 1 x half> undef)
 ; ZVFHMIN-NEXT:  Cost Model: Invalid cost for instruction: %7 = call <vscale x 2 x half> @llvm.copysign.nxv2f16(<vscale x 2 x half> undef, <vscale x 2 x half> undef)

diff  --git a/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll b/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
index 0555482c5b2414..78acba832dbbf2 100644
--- a/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/fp-sqrt-pow.ll
@@ -96,8 +96,8 @@ declare <vscale x 8 x double> @llvm.sqrt.nvx8f64(<vscale x 8 x double>)
 define void @pow() {
 ; CHECK-LABEL: 'pow'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.pow.f32(float undef, float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.pow.v2f32(<2 x float> undef, <2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.pow.v4f32(<4 x float> undef, <4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.pow.v2f32(<2 x float> undef, <2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.pow.v4f32(<4 x float> undef, <4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.pow.v8f32(<8 x float> undef, <8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.pow.v16f32(<16 x float> undef, <16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.pow.nxv1f32(<vscale x 1 x float> undef, <vscale x 1 x float> undef)
@@ -106,7 +106,7 @@ define void @pow() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.pow.nxv8f32(<vscale x 8 x float> undef, <vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.pow.nxv16f32(<vscale x 16 x float> undef, <vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.pow.f64(double undef, double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.pow.v2f64(<2 x double> undef, <2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.pow.v2f64(<2 x double> undef, <2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.pow.v4f64(<4 x double> undef, <4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.pow.v8f64(<8 x double> undef, <8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.pow.v16f64(<16 x double> undef, <16 x double> undef)

diff  --git a/llvm/test/Analysis/CostModel/RISCV/fp-trig-log-exp.ll b/llvm/test/Analysis/CostModel/RISCV/fp-trig-log-exp.ll
index 12562608322d50..af779116a62780 100644
--- a/llvm/test/Analysis/CostModel/RISCV/fp-trig-log-exp.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/fp-trig-log-exp.ll
@@ -4,8 +4,8 @@
 define void @sin() {
 ; CHECK-LABEL: 'sin'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.sin.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.sin.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.sin.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.sin.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.sin.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.sin.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.sin.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.sin.nxv1f32(<vscale x 1 x float> undef)
@@ -14,7 +14,7 @@ define void @sin() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.sin.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.sin.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.sin.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.sin.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.sin.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.sin.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.sin.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.sin.v16f64(<16 x double> undef)
@@ -49,8 +49,8 @@ define void @sin() {
 define void @cos() {
 ; CHECK-LABEL: 'cos'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.cos.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.cos.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.cos.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.cos.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.cos.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.cos.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.cos.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.cos.nxv1f32(<vscale x 1 x float> undef)
@@ -59,7 +59,7 @@ define void @cos() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.cos.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.cos.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.cos.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.cos.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.cos.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.cos.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.cos.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.cos.v16f64(<16 x double> undef)
@@ -94,8 +94,8 @@ define void @cos() {
 define void @exp() {
 ; CHECK-LABEL: 'exp'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.exp.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.exp.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.exp.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.exp.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.exp.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.exp.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.exp.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.exp.nxv1f32(<vscale x 1 x float> undef)
@@ -104,7 +104,7 @@ define void @exp() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.exp.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.exp.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.exp.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.exp.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.exp.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.exp.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.exp.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.exp.v16f64(<16 x double> undef)
@@ -139,8 +139,8 @@ define void @exp() {
 define void @exp2() {
 ; CHECK-LABEL: 'exp2'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.exp2.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.exp2.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.exp2.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.exp2.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.exp2.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.exp2.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.exp2.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.exp2.nxv1f32(<vscale x 1 x float> undef)
@@ -149,7 +149,7 @@ define void @exp2() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.exp2.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.exp2.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.exp2.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.exp2.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.exp2.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.exp2.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.exp2.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.exp2.v16f64(<16 x double> undef)
@@ -184,8 +184,8 @@ define void @exp2() {
 define void @log() {
 ; CHECK-LABEL: 'log'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.log.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.log.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.log.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.log.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.log.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.log.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.log.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.log.nxv1f32(<vscale x 1 x float> undef)
@@ -194,7 +194,7 @@ define void @log() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.log.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.log.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.log.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.log.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.log.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.log.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.log.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.log.v16f64(<16 x double> undef)
@@ -229,8 +229,8 @@ define void @log() {
 define void @log10() {
 ; CHECK-LABEL: 'log10'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.log10.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.log10.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.log10.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.log10.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.log10.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.log10.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.log10.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.log10.nxv1f32(<vscale x 1 x float> undef)
@@ -239,7 +239,7 @@ define void @log10() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.log10.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.log10.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.log10.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.log10.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.log10.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.log10.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.log10.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.log10.v16f64(<16 x double> undef)
@@ -274,8 +274,8 @@ define void @log10() {
 define void @log2() {
 ; CHECK-LABEL: 'log2'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %1 = call float @llvm.log2.f32(float undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %2 = call <2 x float> @llvm.log2.v2f32(<2 x float> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %3 = call <4 x float> @llvm.log2.v4f32(<4 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %2 = call <2 x float> @llvm.log2.v2f32(<2 x float> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: %3 = call <4 x float> @llvm.log2.v4f32(<4 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %4 = call <8 x float> @llvm.log2.v8f32(<8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %5 = call <16 x float> @llvm.log2.v16f32(<16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 1 x float> @llvm.log2.nxv1f32(<vscale x 1 x float> undef)
@@ -284,7 +284,7 @@ define void @log2() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 8 x float> @llvm.log2.nxv8f32(<vscale x 8 x float> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %10 = call <vscale x 16 x float> @llvm.log2.nxv16f32(<vscale x 16 x float> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %11 = call double @llvm.log2.f64(double undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %12 = call <2 x double> @llvm.log2.v2f64(<2 x double> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: %12 = call <2 x double> @llvm.log2.v2f64(<2 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %13 = call <4 x double> @llvm.log2.v4f64(<4 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %14 = call <8 x double> @llvm.log2.v8f64(<8 x double> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 191 for instruction: %15 = call <16 x double> @llvm.log2.v16f64(<16 x double> undef)

diff  --git a/llvm/test/Analysis/CostModel/RISCV/gep.ll b/llvm/test/Analysis/CostModel/RISCV/gep.ll
index c95ba52ae7ffc8..44c0b5bb71070e 100644
--- a/llvm/test/Analysis/CostModel/RISCV/gep.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/gep.ll
@@ -268,7 +268,7 @@ define void @non_foldable_vector_uses(ptr %base, <2 x ptr> %base.vec) {
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %3 = getelementptr i8, <2 x ptr> %base.vec, <2 x i32> <i32 42, i32 43>
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x3 = call <2 x i8> @llvm.masked.gather.v2i8.v2p0(<2 x ptr> %3, i32 1, <2 x i1> undef, <2 x i8> undef)
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %4 = getelementptr i8, ptr %base, i32 42
-; RVI-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %x4 = call <2 x i8> @llvm.masked.expandload.v2i8(ptr %4, <2 x i1> undef, <2 x i8> undef)
+; RVI-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x4 = call <2 x i8> @llvm.masked.expandload.v2i8(ptr %4, <2 x i1> undef, <2 x i8> undef)
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %5 = getelementptr i8, ptr %base, i32 42
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x5 = call <2 x i8> @llvm.vp.load.v2i8.p0(ptr %5, <2 x i1> undef, i32 undef)
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %6 = getelementptr i8, ptr %base, i32 42
@@ -338,7 +338,7 @@ define void @foldable_vector_uses(ptr %base, <2 x ptr> %base.vec) {
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %3 = getelementptr i8, <2 x ptr> %base.vec, <2 x i32> zeroinitializer
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %x3 = call <2 x i8> @llvm.masked.gather.v2i8.v2p0(<2 x ptr> %3, i32 1, <2 x i1> undef, <2 x i8> undef)
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %4 = getelementptr i8, ptr %base, i32 0
-; RVI-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %x4 = call <2 x i8> @llvm.masked.expandload.v2i8(ptr %4, <2 x i1> undef, <2 x i8> undef)
+; RVI-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %x4 = call <2 x i8> @llvm.masked.expandload.v2i8(ptr %4, <2 x i1> undef, <2 x i8> undef)
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %5 = getelementptr i8, ptr %base, i32 0
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %x5 = call <2 x i8> @llvm.vp.load.v2i8.p0(ptr %5, <2 x i1> undef, i32 undef)
 ; RVI-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %6 = getelementptr i8, ptr %base, i32 0

diff  --git a/llvm/test/Analysis/CostModel/RISCV/int-sat-math.ll b/llvm/test/Analysis/CostModel/RISCV/int-sat-math.ll
index 398a41d2d190b0..185fcc9ce8b33c 100644
--- a/llvm/test/Analysis/CostModel/RISCV/int-sat-math.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/int-sat-math.ll
@@ -312,26 +312,26 @@ define void @ssub.sat() {
 define void @ushl.sat() {
 ; CHECK-LABEL: 'ushl.sat'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = call i8 @llvm.ushl.sat.i8(i8 undef, i8 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %2 = call <2 x i8> @llvm.ushl.sat.v2i8(<2 x i8> undef, <2 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %3 = call <4 x i8> @llvm.ushl.sat.v4i8(<4 x i8> undef, <4 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %4 = call <8 x i8> @llvm.ushl.sat.v8i8(<8 x i8> undef, <8 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %5 = call <16 x i8> @llvm.ushl.sat.v16i8(<16 x i8> undef, <16 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %2 = call <2 x i8> @llvm.ushl.sat.v2i8(<2 x i8> undef, <2 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %3 = call <4 x i8> @llvm.ushl.sat.v4i8(<4 x i8> undef, <4 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %4 = call <8 x i8> @llvm.ushl.sat.v8i8(<8 x i8> undef, <8 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %5 = call <16 x i8> @llvm.ushl.sat.v16i8(<16 x i8> undef, <16 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 2 x i8> @llvm.ushl.sat.nxv2i8(<vscale x 2 x i8> undef, <vscale x 2 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %7 = call <vscale x 4 x i8> @llvm.ushl.sat.nxv4i8(<vscale x 4 x i8> undef, <vscale x 4 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %8 = call <vscale x 8 x i8> @llvm.ushl.sat.nxv8i8(<vscale x 8 x i8> undef, <vscale x 8 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 16 x i8> @llvm.ushl.sat.nxv16i8(<vscale x 16 x i8> undef, <vscale x 16 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %10 = call i16 @llvm.ushl.sat.i16(i16 undef, i16 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %11 = call <2 x i16> @llvm.ushl.sat.v2i16(<2 x i16> undef, <2 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %12 = call <4 x i16> @llvm.ushl.sat.v4i16(<4 x i16> undef, <4 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %13 = call <8 x i16> @llvm.ushl.sat.v8i16(<8 x i16> undef, <8 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %11 = call <2 x i16> @llvm.ushl.sat.v2i16(<2 x i16> undef, <2 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %12 = call <4 x i16> @llvm.ushl.sat.v4i16(<4 x i16> undef, <4 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %13 = call <8 x i16> @llvm.ushl.sat.v8i16(<8 x i16> undef, <8 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %14 = call <16 x i16> @llvm.ushl.sat.v16i16(<16 x i16> undef, <16 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %15 = call <vscale x 2 x i16> @llvm.ushl.sat.nxv2i16(<vscale x 2 x i16> undef, <vscale x 2 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %16 = call <vscale x 4 x i16> @llvm.ushl.sat.nxv4i16(<vscale x 4 x i16> undef, <vscale x 4 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %17 = call <vscale x 8 x i16> @llvm.ushl.sat.nxv8i16(<vscale x 8 x i16> undef, <vscale x 8 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %18 = call <vscale x 16 x i16> @llvm.ushl.sat.nxv16i16(<vscale x 16 x i16> undef, <vscale x 16 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %19 = call i32 @llvm.ushl.sat.i32(i32 undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %20 = call <2 x i32> @llvm.ushl.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %21 = call <4 x i32> @llvm.ushl.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %20 = call <2 x i32> @llvm.ushl.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %21 = call <4 x i32> @llvm.ushl.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %22 = call <8 x i32> @llvm.ushl.sat.v8i32(<8 x i32> undef, <8 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %23 = call <16 x i32> @llvm.ushl.sat.v16i32(<16 x i32> undef, <16 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %24 = call <vscale x 2 x i32> @llvm.ushl.sat.nxv2i32(<vscale x 2 x i32> undef, <vscale x 2 x i32> undef)
@@ -339,7 +339,7 @@ define void @ushl.sat() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %26 = call <vscale x 8 x i32> @llvm.ushl.sat.nxv8i32(<vscale x 8 x i32> undef, <vscale x 8 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %27 = call <vscale x 16 x i32> @llvm.ushl.sat.nxv16i32(<vscale x 16 x i32> undef, <vscale x 16 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %28 = call i64 @llvm.ushl.sat.i64(i64 undef, i64 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %29 = call <2 x i64> @llvm.ushl.sat.v2i64(<2 x i64> undef, <2 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %29 = call <2 x i64> @llvm.ushl.sat.v2i64(<2 x i64> undef, <2 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %30 = call <4 x i64> @llvm.ushl.sat.v4i64(<4 x i64> undef, <4 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %31 = call <8 x i64> @llvm.ushl.sat.v8i64(<8 x i64> undef, <8 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %32 = call <16 x i64> @llvm.ushl.sat.v16i64(<16 x i64> undef, <16 x i64> undef)
@@ -389,26 +389,26 @@ define void @ushl.sat() {
 define void @sshl.sat() {
 ; CHECK-LABEL: 'sshl.sat'
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %1 = call i8 @llvm.sshl.sat.i8(i8 undef, i8 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %2 = call <2 x i8> @llvm.sshl.sat.v2i8(<2 x i8> undef, <2 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %3 = call <4 x i8> @llvm.sshl.sat.v4i8(<4 x i8> undef, <4 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %4 = call <8 x i8> @llvm.sshl.sat.v8i8(<8 x i8> undef, <8 x i8> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %5 = call <16 x i8> @llvm.sshl.sat.v16i8(<16 x i8> undef, <16 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %2 = call <2 x i8> @llvm.sshl.sat.v2i8(<2 x i8> undef, <2 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %3 = call <4 x i8> @llvm.sshl.sat.v4i8(<4 x i8> undef, <4 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %4 = call <8 x i8> @llvm.sshl.sat.v8i8(<8 x i8> undef, <8 x i8> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %5 = call <16 x i8> @llvm.sshl.sat.v16i8(<16 x i8> undef, <16 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %6 = call <vscale x 2 x i8> @llvm.sshl.sat.nxv2i8(<vscale x 2 x i8> undef, <vscale x 2 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %7 = call <vscale x 4 x i8> @llvm.sshl.sat.nxv4i8(<vscale x 4 x i8> undef, <vscale x 4 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %8 = call <vscale x 8 x i8> @llvm.sshl.sat.nxv8i8(<vscale x 8 x i8> undef, <vscale x 8 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %9 = call <vscale x 16 x i8> @llvm.sshl.sat.nxv16i8(<vscale x 16 x i8> undef, <vscale x 16 x i8> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %10 = call i16 @llvm.sshl.sat.i16(i16 undef, i16 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %11 = call <2 x i16> @llvm.sshl.sat.v2i16(<2 x i16> undef, <2 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %12 = call <4 x i16> @llvm.sshl.sat.v4i16(<4 x i16> undef, <4 x i16> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %13 = call <8 x i16> @llvm.sshl.sat.v8i16(<8 x i16> undef, <8 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %11 = call <2 x i16> @llvm.sshl.sat.v2i16(<2 x i16> undef, <2 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %12 = call <4 x i16> @llvm.sshl.sat.v4i16(<4 x i16> undef, <4 x i16> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %13 = call <8 x i16> @llvm.sshl.sat.v8i16(<8 x i16> undef, <8 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %14 = call <16 x i16> @llvm.sshl.sat.v16i16(<16 x i16> undef, <16 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %15 = call <vscale x 2 x i16> @llvm.sshl.sat.nxv2i16(<vscale x 2 x i16> undef, <vscale x 2 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %16 = call <vscale x 4 x i16> @llvm.sshl.sat.nxv4i16(<vscale x 4 x i16> undef, <vscale x 4 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %17 = call <vscale x 8 x i16> @llvm.sshl.sat.nxv8i16(<vscale x 8 x i16> undef, <vscale x 8 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %18 = call <vscale x 16 x i16> @llvm.sshl.sat.nxv16i16(<vscale x 16 x i16> undef, <vscale x 16 x i16> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %19 = call i32 @llvm.sshl.sat.i32(i32 undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %20 = call <2 x i32> @llvm.sshl.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %21 = call <4 x i32> @llvm.sshl.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %20 = call <2 x i32> @llvm.sshl.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %21 = call <4 x i32> @llvm.sshl.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %22 = call <8 x i32> @llvm.sshl.sat.v8i32(<8 x i32> undef, <8 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %23 = call <16 x i32> @llvm.sshl.sat.v16i32(<16 x i32> undef, <16 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %24 = call <vscale x 2 x i32> @llvm.sshl.sat.nxv2i32(<vscale x 2 x i32> undef, <vscale x 2 x i32> undef)
@@ -416,7 +416,7 @@ define void @sshl.sat() {
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %26 = call <vscale x 8 x i32> @llvm.sshl.sat.nxv8i32(<vscale x 8 x i32> undef, <vscale x 8 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Invalid cost for instruction: %27 = call <vscale x 16 x i32> @llvm.sshl.sat.nxv16i32(<vscale x 16 x i32> undef, <vscale x 16 x i32> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %28 = call i64 @llvm.sshl.sat.i64(i64 undef, i64 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %29 = call <2 x i64> @llvm.sshl.sat.v2i64(<2 x i64> undef, <2 x i64> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %29 = call <2 x i64> @llvm.sshl.sat.v2i64(<2 x i64> undef, <2 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %30 = call <4 x i64> @llvm.sshl.sat.v4i64(<4 x i64> undef, <4 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %31 = call <8 x i64> @llvm.sshl.sat.v8i64(<8 x i64> undef, <8 x i64> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %32 = call <16 x i64> @llvm.sshl.sat.v16i64(<16 x i64> undef, <16 x i64> undef)

diff  --git a/llvm/test/Analysis/CostModel/RISCV/rvv-intrinsics.ll b/llvm/test/Analysis/CostModel/RISCV/rvv-intrinsics.ll
index 6fe65d97d7448f..49e09a4e5fead1 100644
--- a/llvm/test/Analysis/CostModel/RISCV/rvv-intrinsics.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/rvv-intrinsics.ll
@@ -746,15 +746,15 @@ define void @abs() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; TYPEBASED-LABEL: 'abs'
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %1 = call <2 x i8> @llvm.vp.abs.v2i8(<2 x i8> undef, i1 false, <2 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %1 = call <2 x i8> @llvm.vp.abs.v2i8(<2 x i8> undef, i1 false, <2 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %2 = call <2 x i8> @llvm.abs.v2i8(<2 x i8> undef, i1 false)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %3 = call <4 x i8> @llvm.vp.abs.v4i8(<4 x i8> undef, i1 false, <4 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %3 = call <4 x i8> @llvm.vp.abs.v4i8(<4 x i8> undef, i1 false, <4 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %4 = call <4 x i8> @llvm.abs.v4i8(<4 x i8> undef, i1 false)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 69 for instruction: %5 = call <8 x i8> @llvm.vp.abs.v8i8(<8 x i8> undef, i1 false, <8 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 62 for instruction: %5 = call <8 x i8> @llvm.vp.abs.v8i8(<8 x i8> undef, i1 false, <8 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %6 = call <8 x i8> @llvm.abs.v8i8(<8 x i8> undef, i1 false)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 141 for instruction: %7 = call <16 x i8> @llvm.vp.abs.v16i8(<16 x i8> undef, i1 false, <16 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 126 for instruction: %7 = call <16 x i8> @llvm.vp.abs.v16i8(<16 x i8> undef, i1 false, <16 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %8 = call <16 x i8> @llvm.abs.v16i8(<16 x i8> undef, i1 false)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %9 = call <2 x i64> @llvm.vp.abs.v2i64(<2 x i64> undef, i1 false, <2 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %9 = call <2 x i64> @llvm.vp.abs.v2i64(<2 x i64> undef, i1 false, <2 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %10 = call <2 x i64> @llvm.abs.v2i64(<2 x i64> undef, i1 false)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 33 for instruction: %11 = call <4 x i64> @llvm.vp.abs.v4i64(<4 x i64> undef, i1 false, <4 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %12 = call <4 x i64> @llvm.abs.v4i64(<4 x i64> undef, i1 false)
@@ -1052,15 +1052,15 @@ define void @strided_load() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
 ; TYPEBASED-LABEL: 'strided_load'
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %t0 = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr undef, i64 undef, <2 x i1> undef, i32 undef)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %t2 = call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr undef, i64 undef, <4 x i1> undef, i32 undef)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 54 for instruction: %t4 = call <8 x i8> @llvm.experimental.vp.strided.load.v8i8.p0.i64(ptr undef, i64 undef, <8 x i1> undef, i32 undef)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 110 for instruction: %t6 = call <16 x i8> @llvm.experimental.vp.strided.load.v16i8.p0.i64(ptr undef, i64 undef, <16 x i1> undef, i32 undef)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %t8.a = call <2 x i64> @llvm.experimental.vp.strided.load.v2i64.p0.i64(ptr align 8 undef, i64 undef, <2 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %t0 = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr undef, i64 undef, <2 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %t2 = call <4 x i8> @llvm.experimental.vp.strided.load.v4i8.p0.i64(ptr undef, i64 undef, <4 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %t4 = call <8 x i8> @llvm.experimental.vp.strided.load.v8i8.p0.i64(ptr undef, i64 undef, <8 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 95 for instruction: %t6 = call <16 x i8> @llvm.experimental.vp.strided.load.v16i8.p0.i64(ptr undef, i64 undef, <16 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %t8.a = call <2 x i64> @llvm.experimental.vp.strided.load.v2i64.p0.i64(ptr align 8 undef, i64 undef, <2 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %t10.a = call <4 x i64> @llvm.experimental.vp.strided.load.v4i64.p0.i64(ptr align 8 undef, i64 undef, <4 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 54 for instruction: %t13.a = call <8 x i64> @llvm.experimental.vp.strided.load.v8i64.p0.i64(ptr align 8 undef, i64 undef, <8 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 110 for instruction: %t15.a = call <16 x i64> @llvm.experimental.vp.strided.load.v16i64.p0.i64(ptr align 8 undef, i64 undef, <16 x i1> undef, i32 undef)
-; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %t8 = call <2 x i64> @llvm.experimental.vp.strided.load.v2i64.p0.i64(ptr undef, i64 undef, <2 x i1> undef, i32 undef)
+; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %t8 = call <2 x i64> @llvm.experimental.vp.strided.load.v2i64.p0.i64(ptr undef, i64 undef, <2 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: %t10 = call <4 x i64> @llvm.experimental.vp.strided.load.v4i64.p0.i64(ptr undef, i64 undef, <4 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 54 for instruction: %t13 = call <8 x i64> @llvm.experimental.vp.strided.load.v8i64.p0.i64(ptr undef, i64 undef, <8 x i1> undef, i32 undef)
 ; TYPEBASED-NEXT:  Cost Model: Found an estimated cost of 110 for instruction: %t15 = call <16 x i64> @llvm.experimental.vp.strided.load.v16i64.p0.i64(ptr undef, i64 undef, <16 x i1> undef, i32 undef)

diff  --git a/llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll
index 01c842edd88e41..803af4d166b213 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll
@@ -24,12 +24,27 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; CHECK-NEXT:    [[ARRAYIDX3_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 4
 ; CHECK-NEXT:    [[ARRAYIDX5_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 4
 ; CHECK-NEXT:    [[ARRAYIDX8_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 1
-; CHECK-NEXT:    [[ARRAYIDX22_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 1
-; CHECK-NEXT:    [[ARRAYIDX25_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 5
-; CHECK-NEXT:    [[ARRAYIDX27_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 5
-; CHECK-NEXT:    [[ARRAYIDX32_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 3
+; CHECK-NEXT:    [[ARRAYIDX32_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 2
 ; CHECK-NEXT:    [[TMP14:%.*]] = load i8, ptr [[ARRAYIDX32_1]], align 1
-; CHECK-NEXT:    [[CONV33_1:%.*]] = zext i8 [[TMP14]] to i32
+; CHECK-NEXT:    [[ARRAYIDX25_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 6
+; CHECK-NEXT:    [[TMP4:%.*]] = load i8, ptr [[ARRAYIDX25_1]], align 1
+; CHECK-NEXT:    [[ARRAYIDX27_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 6
+; CHECK-NEXT:    [[TMP5:%.*]] = load i8, ptr [[ARRAYIDX27_1]], align 1
+; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x i8> poison, i8 [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertelement <2 x i8> [[TMP6]], i8 [[TMP5]], i32 1
+; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <2 x i8> [[TMP7]], i32 0
+; CHECK-NEXT:    [[TMP9:%.*]] = zext i8 [[TMP8]] to i32
+; CHECK-NEXT:    [[ARRAYIDX32_2:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 3
+; CHECK-NEXT:    [[ARRAYIDX34_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 3
+; CHECK-NEXT:    [[TMP18:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX32_2]], i64 4, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP20:%.*]] = zext <2 x i8> [[TMP18]] to <2 x i16>
+; CHECK-NEXT:    [[TMP12:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX34_1]], i64 4, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP13:%.*]] = zext <2 x i8> [[TMP12]] to <2 x i16>
+; CHECK-NEXT:    [[TMP28:%.*]] = sub <2 x i16> [[TMP20]], [[TMP13]]
+; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <2 x i16> [[TMP28]], i32 1
+; CHECK-NEXT:    [[TMP16:%.*]] = sext i16 [[TMP15]] to i32
+; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <2 x i16> [[TMP28]], i32 0
+; CHECK-NEXT:    [[CONV33_1:%.*]] = sext i16 [[TMP17]] to i32
 ; CHECK-NEXT:    [[ADD_PTR_1:%.*]] = getelementptr i8, ptr [[ADD_PTR]], i64 [[IDX_EXT]]
 ; CHECK-NEXT:    [[ADD_PTR64_1:%.*]] = getelementptr i8, ptr [[ADD_PTR64]], i64 [[IDX_EXT63]]
 ; CHECK-NEXT:    [[ARRAYIDX3_2:%.*]] = getelementptr i8, ptr [[ADD_PTR_1]], i64 4
@@ -38,40 +53,39 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; CHECK-NEXT:    [[ARRAYIDX10_2:%.*]] = getelementptr i8, ptr [[ADD_PTR64_1]], i64 1
 ; CHECK-NEXT:    [[ARRAYIDX13_2:%.*]] = getelementptr i8, ptr [[ADD_PTR_1]], i64 5
 ; CHECK-NEXT:    [[ARRAYIDX15_2:%.*]] = getelementptr i8, ptr [[ADD_PTR64_1]], i64 5
-; CHECK-NEXT:    [[TMP4:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ADD_PTR_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP16:%.*]] = zext <2 x i8> [[TMP4]] to <2 x i32>
-; CHECK-NEXT:    [[TMP6:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ADD_PTR64_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP7:%.*]] = zext <2 x i8> [[TMP6]] to <2 x i32>
-; CHECK-NEXT:    [[TMP8:%.*]] = sub <2 x i32> [[TMP16]], [[TMP7]]
-; CHECK-NEXT:    [[TMP9:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP13:%.*]] = zext <2 x i8> [[TMP9]] to <2 x i32>
+; CHECK-NEXT:    [[TMP19:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ADD_PTR_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP192:%.*]] = zext <2 x i8> [[TMP19]] to <2 x i32>
+; CHECK-NEXT:    [[TMP21:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ADD_PTR64_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP22:%.*]] = zext <2 x i8> [[TMP21]] to <2 x i32>
+; CHECK-NEXT:    [[TMP23:%.*]] = sub <2 x i32> [[TMP192]], [[TMP22]]
+; CHECK-NEXT:    [[TMP29:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP40:%.*]] = zext <2 x i8> [[TMP29]] to <2 x i32>
 ; CHECK-NEXT:    [[TMP26:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP12:%.*]] = zext <2 x i8> [[TMP26]] to <2 x i32>
-; CHECK-NEXT:    [[TMP24:%.*]] = sub <2 x i32> [[TMP13]], [[TMP12]]
+; CHECK-NEXT:    [[TMP27:%.*]] = zext <2 x i8> [[TMP26]] to <2 x i32>
+; CHECK-NEXT:    [[TMP24:%.*]] = sub <2 x i32> [[TMP40]], [[TMP27]]
 ; CHECK-NEXT:    [[TMP25:%.*]] = shl <2 x i32> [[TMP24]], <i32 16, i32 16>
-; CHECK-NEXT:    [[TMP15:%.*]] = add <2 x i32> [[TMP25]], [[TMP8]]
-; CHECK-NEXT:    [[TMP28:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX8_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP17:%.*]] = zext <2 x i8> [[TMP28]] to <2 x i32>
-; CHECK-NEXT:    [[TMP18:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX10_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP19:%.*]] = zext <2 x i8> [[TMP18]] to <2 x i32>
-; CHECK-NEXT:    [[TMP20:%.*]] = sub <2 x i32> [[TMP17]], [[TMP19]]
-; CHECK-NEXT:    [[TMP21:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX13_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP22:%.*]] = zext <2 x i8> [[TMP21]] to <2 x i32>
-; CHECK-NEXT:    [[TMP23:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX15_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP30:%.*]] = zext <2 x i8> [[TMP23]] to <2 x i32>
-; CHECK-NEXT:    [[TMP36:%.*]] = sub <2 x i32> [[TMP22]], [[TMP30]]
+; CHECK-NEXT:    [[TMP30:%.*]] = add <2 x i32> [[TMP25]], [[TMP23]]
+; CHECK-NEXT:    [[TMP31:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX8_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP32:%.*]] = zext <2 x i8> [[TMP31]] to <2 x i32>
+; CHECK-NEXT:    [[TMP33:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX10_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP49:%.*]] = zext <2 x i8> [[TMP33]] to <2 x i32>
+; CHECK-NEXT:    [[TMP35:%.*]] = sub <2 x i32> [[TMP32]], [[TMP49]]
+; CHECK-NEXT:    [[TMP50:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX13_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP51:%.*]] = zext <2 x i8> [[TMP50]] to <2 x i32>
+; CHECK-NEXT:    [[TMP38:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX15_2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP39:%.*]] = zext <2 x i8> [[TMP38]] to <2 x i32>
+; CHECK-NEXT:    [[TMP36:%.*]] = sub <2 x i32> [[TMP51]], [[TMP39]]
 ; CHECK-NEXT:    [[TMP37:%.*]] = shl <2 x i32> [[TMP36]], <i32 16, i32 16>
-; CHECK-NEXT:    [[TMP27:%.*]] = add <2 x i32> [[TMP37]], [[TMP20]]
-; CHECK-NEXT:    [[TMP38:%.*]] = add <2 x i32> [[TMP27]], [[TMP15]]
-; CHECK-NEXT:    [[TMP29:%.*]] = sub <2 x i32> [[TMP15]], [[TMP27]]
-; CHECK-NEXT:    [[SUB45_2:%.*]] = extractelement <2 x i32> [[TMP38]], i32 0
-; CHECK-NEXT:    [[SUB47_2:%.*]] = extractelement <2 x i32> [[TMP38]], i32 1
-; CHECK-NEXT:    [[ADD48_2:%.*]] = add i32 [[SUB47_2]], [[SUB45_2]]
-; CHECK-NEXT:    [[SUB51_2:%.*]] = sub i32 [[SUB45_2]], [[SUB47_2]]
-; CHECK-NEXT:    [[TMP32:%.*]] = extractelement <2 x i32> [[TMP29]], i32 0
-; CHECK-NEXT:    [[TMP34:%.*]] = extractelement <2 x i32> [[TMP29]], i32 1
-; CHECK-NEXT:    [[ADD55_2:%.*]] = add i32 [[TMP34]], [[TMP32]]
-; CHECK-NEXT:    [[SUB59_2:%.*]] = sub i32 [[TMP32]], [[TMP34]]
+; CHECK-NEXT:    [[TMP42:%.*]] = add <2 x i32> [[TMP37]], [[TMP35]]
+; CHECK-NEXT:    [[TMP56:%.*]] = add <2 x i32> [[TMP42]], [[TMP30]]
+; CHECK-NEXT:    [[TMP60:%.*]] = sub <2 x i32> [[TMP30]], [[TMP42]]
+; CHECK-NEXT:    [[TMP73:%.*]] = extractelement <2 x i32> [[TMP56]], i32 0
+; CHECK-NEXT:    [[TMP34:%.*]] = extractelement <2 x i32> [[TMP56]], i32 1
+; CHECK-NEXT:    [[ADD48_2:%.*]] = add i32 [[TMP34]], [[TMP73]]
+; CHECK-NEXT:    [[SUB51_2:%.*]] = sub i32 [[TMP73]], [[TMP34]]
+; CHECK-NEXT:    [[TMP47:%.*]] = extractelement <2 x i32> [[TMP60]], i32 0
+; CHECK-NEXT:    [[TMP48:%.*]] = extractelement <2 x i32> [[TMP60]], i32 1
+; CHECK-NEXT:    [[TMP68:%.*]] = sub i32 [[TMP47]], [[TMP48]]
 ; CHECK-NEXT:    [[ARRAYIDX3_3:%.*]] = getelementptr i8, ptr null, i64 4
 ; CHECK-NEXT:    [[ARRAYIDX5_3:%.*]] = getelementptr i8, ptr null, i64 4
 ; CHECK-NEXT:    [[ARRAYIDX8_3:%.*]] = getelementptr i8, ptr null, i64 1
@@ -80,216 +94,248 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; CHECK-NEXT:    [[ARRAYIDX15_3:%.*]] = getelementptr i8, ptr null, i64 5
 ; CHECK-NEXT:    [[TMP43:%.*]] = load i8, ptr null, align 1
 ; CHECK-NEXT:    [[TMP53:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 null, i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP58:%.*]] = zext <2 x i8> [[TMP53]] to <2 x i32>
+; CHECK-NEXT:    [[TMP52:%.*]] = zext <2 x i8> [[TMP53]] to <2 x i32>
 ; CHECK-NEXT:    [[TMP54:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 null, i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP39:%.*]] = zext <2 x i8> [[TMP54]] to <2 x i32>
-; CHECK-NEXT:    [[TMP40:%.*]] = sub <2 x i32> [[TMP58]], [[TMP39]]
+; CHECK-NEXT:    [[TMP61:%.*]] = zext <2 x i8> [[TMP54]] to <2 x i32>
+; CHECK-NEXT:    [[TMP55:%.*]] = sub <2 x i32> [[TMP52]], [[TMP61]]
 ; CHECK-NEXT:    [[TMP41:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3_3]], i64 -4, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP42:%.*]] = zext <2 x i8> [[TMP41]] to <2 x i32>
-; CHECK-NEXT:    [[TMP59:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP62:%.*]] = zext <2 x i8> [[TMP59]] to <2 x i32>
-; CHECK-NEXT:    [[TMP45:%.*]] = sub <2 x i32> [[TMP42]], [[TMP62]]
+; CHECK-NEXT:    [[TMP57:%.*]] = zext <2 x i8> [[TMP41]] to <2 x i32>
+; CHECK-NEXT:    [[TMP58:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP59:%.*]] = zext <2 x i8> [[TMP58]] to <2 x i32>
+; CHECK-NEXT:    [[TMP45:%.*]] = sub <2 x i32> [[TMP57]], [[TMP59]]
 ; CHECK-NEXT:    [[TMP46:%.*]] = shl <2 x i32> [[TMP45]], <i32 16, i32 16>
-; CHECK-NEXT:    [[TMP68:%.*]] = add <2 x i32> [[TMP46]], [[TMP40]]
-; CHECK-NEXT:    [[TMP48:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX8_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP49:%.*]] = zext <2 x i8> [[TMP48]] to <2 x i32>
-; CHECK-NEXT:    [[TMP50:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX10_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP51:%.*]] = zext <2 x i8> [[TMP50]] to <2 x i32>
-; CHECK-NEXT:    [[TMP52:%.*]] = sub <2 x i32> [[TMP49]], [[TMP51]]
+; CHECK-NEXT:    [[TMP62:%.*]] = add <2 x i32> [[TMP46]], [[TMP55]]
+; CHECK-NEXT:    [[TMP63:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX8_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP74:%.*]] = zext <2 x i8> [[TMP63]] to <2 x i32>
+; CHECK-NEXT:    [[TMP79:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX10_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP82:%.*]] = zext <2 x i8> [[TMP79]] to <2 x i32>
+; CHECK-NEXT:    [[TMP85:%.*]] = sub <2 x i32> [[TMP74]], [[TMP82]]
 ; CHECK-NEXT:    [[TMP64:%.*]] = insertelement <2 x i8> poison, i8 [[TMP44]], i32 0
 ; CHECK-NEXT:    [[TMP65:%.*]] = insertelement <2 x i8> [[TMP64]], i8 [[TMP43]], i32 1
-; CHECK-NEXT:    [[TMP55:%.*]] = zext <2 x i8> [[TMP65]] to <2 x i32>
-; CHECK-NEXT:    [[TMP56:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX15_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP57:%.*]] = zext <2 x i8> [[TMP56]] to <2 x i32>
-; CHECK-NEXT:    [[TMP69:%.*]] = sub <2 x i32> [[TMP55]], [[TMP57]]
+; CHECK-NEXT:    [[TMP94:%.*]] = zext <2 x i8> [[TMP65]] to <2 x i32>
+; CHECK-NEXT:    [[TMP103:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX15_3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP72:%.*]] = zext <2 x i8> [[TMP103]] to <2 x i32>
+; CHECK-NEXT:    [[TMP69:%.*]] = sub <2 x i32> [[TMP94]], [[TMP72]]
 ; CHECK-NEXT:    [[TMP70:%.*]] = shl <2 x i32> [[TMP69]], <i32 16, i32 16>
-; CHECK-NEXT:    [[TMP60:%.*]] = add <2 x i32> [[TMP70]], [[TMP52]]
-; CHECK-NEXT:    [[TMP47:%.*]] = add <2 x i32> [[TMP60]], [[TMP68]]
-; CHECK-NEXT:    [[TMP33:%.*]] = sub <2 x i32> [[TMP68]], [[TMP60]]
-; CHECK-NEXT:    [[TMP61:%.*]] = extractelement <2 x i32> [[TMP47]], i32 0
-; CHECK-NEXT:    [[TMP79:%.*]] = extractelement <2 x i32> [[TMP47]], i32 1
-; CHECK-NEXT:    [[ADD48_3:%.*]] = add i32 [[TMP79]], [[TMP61]]
-; CHECK-NEXT:    [[SUB51_3:%.*]] = sub i32 [[TMP61]], [[TMP79]]
-; CHECK-NEXT:    [[TMP63:%.*]] = extractelement <2 x i32> [[TMP33]], i32 0
-; CHECK-NEXT:    [[TMP71:%.*]] = extractelement <2 x i32> [[TMP33]], i32 1
-; CHECK-NEXT:    [[ADD55_3:%.*]] = add i32 [[TMP71]], [[TMP63]]
-; CHECK-NEXT:    [[SUB59_3:%.*]] = sub i32 [[TMP63]], [[TMP71]]
-; CHECK-NEXT:    [[ADD95:%.*]] = add i32 [[ADD48_3]], [[ADD48_2]]
-; CHECK-NEXT:    [[SUB102:%.*]] = sub i32 [[ADD48_2]], [[ADD48_3]]
-; CHECK-NEXT:    [[TMP77:%.*]] = extractelement <2 x i32> [[TMP58]], i32 0
+; CHECK-NEXT:    [[TMP75:%.*]] = add <2 x i32> [[TMP70]], [[TMP85]]
+; CHECK-NEXT:    [[TMP76:%.*]] = add <2 x i32> [[TMP75]], [[TMP62]]
+; CHECK-NEXT:    [[TMP106:%.*]] = sub <2 x i32> [[TMP62]], [[TMP75]]
+; CHECK-NEXT:    [[TMP78:%.*]] = extractelement <2 x i32> [[TMP76]], i32 0
+; CHECK-NEXT:    [[TMP71:%.*]] = extractelement <2 x i32> [[TMP76]], i32 1
+; CHECK-NEXT:    [[ADD48_3:%.*]] = add i32 [[TMP71]], [[TMP78]]
+; CHECK-NEXT:    [[SUB51_3:%.*]] = sub i32 [[TMP78]], [[TMP71]]
+; CHECK-NEXT:    [[TMP80:%.*]] = extractelement <2 x i32> [[TMP106]], i32 0
+; CHECK-NEXT:    [[TMP81:%.*]] = extractelement <2 x i32> [[TMP106]], i32 1
+; CHECK-NEXT:    [[ADD55_3:%.*]] = add i32 [[TMP81]], [[TMP80]]
+; CHECK-NEXT:    [[TMP107:%.*]] = sub i32 [[TMP80]], [[TMP81]]
+; CHECK-NEXT:    [[TMP77:%.*]] = extractelement <2 x i32> [[TMP52]], i32 0
 ; CHECK-NEXT:    [[SHR_I49_3:%.*]] = lshr i32 [[TMP77]], 15
 ; CHECK-NEXT:    [[AND_I50_3:%.*]] = and i32 [[SHR_I49_3]], 65537
 ; CHECK-NEXT:    [[MUL_I51_3:%.*]] = mul i32 [[AND_I50_3]], 65535
-; CHECK-NEXT:    [[SHR_I_1:%.*]] = lshr i32 [[SUB47_2]], 15
+; CHECK-NEXT:    [[SHR_I_1:%.*]] = lshr i32 [[TMP34]], 15
 ; CHECK-NEXT:    [[AND_I_1:%.*]] = and i32 [[SHR_I_1]], 65537
 ; CHECK-NEXT:    [[MUL_I_1:%.*]] = mul i32 [[AND_I_1]], 65535
-; CHECK-NEXT:    [[ADD94_1:%.*]] = add i32 [[ADD55_3]], [[ADD55_2]]
-; CHECK-NEXT:    [[SUB102_1:%.*]] = sub i32 [[ADD55_2]], [[ADD55_3]]
-; CHECK-NEXT:    [[TMP107:%.*]] = extractelement <2 x i32> [[TMP16]], i32 0
-; CHECK-NEXT:    [[SHR_I49_1:%.*]] = lshr i32 [[TMP107]], 15
+; CHECK-NEXT:    [[TMP83:%.*]] = extractelement <2 x i32> [[TMP32]], i32 0
+; CHECK-NEXT:    [[SHR_I_2:%.*]] = lshr i32 [[TMP83]], 15
+; CHECK-NEXT:    [[AND_I_2:%.*]] = and i32 [[SHR_I_2]], 65537
+; CHECK-NEXT:    [[MUL_I_2:%.*]] = mul i32 [[AND_I_2]], 65535
+; CHECK-NEXT:    [[TMP84:%.*]] = extractelement <2 x i32> [[TMP192]], i32 0
+; CHECK-NEXT:    [[SHR_I49_1:%.*]] = lshr i32 [[TMP84]], 15
 ; CHECK-NEXT:    [[AND_I50_1:%.*]] = and i32 [[SHR_I49_1]], 65537
-; CHECK-NEXT:    [[MUL_I51_1:%.*]] = mul i32 [[AND_I50_1]], 65535
-; CHECK-NEXT:    [[ADD94_4:%.*]] = add i32 [[SUB51_3]], [[SUB51_2]]
-; CHECK-NEXT:    [[SUB102_2:%.*]] = sub i32 [[SUB51_2]], [[SUB51_3]]
-; CHECK-NEXT:    [[SHR_I49_6:%.*]] = lshr i32 [[CONV_1]], 15
-; CHECK-NEXT:    [[AND_I50_6:%.*]] = and i32 [[SHR_I49_6]], 65537
-; CHECK-NEXT:    [[MUL_I51_6:%.*]] = mul i32 [[AND_I50_6]], 65535
-; CHECK-NEXT:    [[ADD94_5:%.*]] = add i32 [[SUB59_3]], [[SUB59_2]]
-; CHECK-NEXT:    [[SUB102_3:%.*]] = sub i32 [[SUB59_2]], [[SUB59_3]]
+; CHECK-NEXT:    [[ADD94_2:%.*]] = mul i32 [[AND_I50_1]], 65535
+; CHECK-NEXT:    [[ADD94_6:%.*]] = add i32 [[SUB51_3]], [[SUB51_2]]
+; CHECK-NEXT:    [[SHR_I49_2:%.*]] = lshr i32 [[CONV_1]], 15
+; CHECK-NEXT:    [[AND_I50_2:%.*]] = and i32 [[SHR_I49_2]], 65537
+; CHECK-NEXT:    [[MUL_I51_2:%.*]] = mul i32 [[AND_I50_2]], 65535
+; CHECK-NEXT:    [[ADD94_5:%.*]] = add i32 [[TMP107]], [[TMP68]]
+; CHECK-NEXT:    [[SUB102_3:%.*]] = sub i32 [[TMP68]], [[TMP107]]
 ; CHECK-NEXT:    [[SHR_I49_4:%.*]] = lshr i32 [[CONV1]], 15
 ; CHECK-NEXT:    [[AND_I50_4:%.*]] = and i32 [[SHR_I49_4]], 65537
 ; CHECK-NEXT:    [[MUL_I51_4:%.*]] = mul i32 [[AND_I50_4]], 65535
 ; CHECK-NEXT:    [[TMP66:%.*]] = load <2 x i8>, ptr [[ARRAYIDX8]], align 1
 ; CHECK-NEXT:    [[TMP102:%.*]] = zext <2 x i8> [[TMP66]] to <2 x i32>
 ; CHECK-NEXT:    [[TMP67:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[PIX2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP72:%.*]] = zext <2 x i8> [[TMP67]] to <2 x i32>
-; CHECK-NEXT:    [[TMP73:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[TMP1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP74:%.*]] = zext <2 x i8> [[TMP73]] to <2 x i32>
-; CHECK-NEXT:    [[TMP75:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP76:%.*]] = zext <2 x i8> [[TMP75]] to <2 x i32>
-; CHECK-NEXT:    [[TMP87:%.*]] = sub <2 x i32> [[TMP74]], [[TMP76]]
+; CHECK-NEXT:    [[TMP108:%.*]] = zext <2 x i8> [[TMP67]] to <2 x i32>
+; CHECK-NEXT:    [[TMP89:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[TMP1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP90:%.*]] = zext <2 x i8> [[TMP89]] to <2 x i32>
+; CHECK-NEXT:    [[TMP91:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP109:%.*]] = zext <2 x i8> [[TMP91]] to <2 x i32>
+; CHECK-NEXT:    [[TMP87:%.*]] = sub <2 x i32> [[TMP90]], [[TMP109]]
 ; CHECK-NEXT:    [[TMP88:%.*]] = shl <2 x i32> [[TMP87]], <i32 16, i32 16>
-; CHECK-NEXT:    [[TMP85:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX22]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP80:%.*]] = zext <2 x i8> [[TMP85]] to <2 x i32>
-; CHECK-NEXT:    [[TMP81:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX25]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP82:%.*]] = zext <2 x i8> [[TMP81]] to <2 x i32>
-; CHECK-NEXT:    [[TMP83:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX27]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP84:%.*]] = zext <2 x i8> [[TMP83]] to <2 x i32>
-; CHECK-NEXT:    [[TMP95:%.*]] = sub <2 x i32> [[TMP82]], [[TMP84]]
+; CHECK-NEXT:    [[TMP111:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX22]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP112:%.*]] = zext <2 x i8> [[TMP111]] to <2 x i32>
+; CHECK-NEXT:    [[TMP120:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX25]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP121:%.*]] = zext <2 x i8> [[TMP120]] to <2 x i32>
+; CHECK-NEXT:    [[TMP131:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX27]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP100:%.*]] = zext <2 x i8> [[TMP131]] to <2 x i32>
+; CHECK-NEXT:    [[TMP95:%.*]] = sub <2 x i32> [[TMP121]], [[TMP100]]
 ; CHECK-NEXT:    [[TMP96:%.*]] = shl <2 x i32> [[TMP95]], <i32 16, i32 16>
 ; CHECK-NEXT:    [[TMP97:%.*]] = insertelement <2 x i32> [[TMP102]], i32 [[CONV33]], i32 1
-; CHECK-NEXT:    [[TMP89:%.*]] = sub <2 x i32> [[TMP97]], [[TMP80]]
-; CHECK-NEXT:    [[TMP105:%.*]] = add <2 x i32> [[TMP96]], [[TMP89]]
+; CHECK-NEXT:    [[TMP132:%.*]] = sub <2 x i32> [[TMP97]], [[TMP112]]
+; CHECK-NEXT:    [[TMP105:%.*]] = add <2 x i32> [[TMP96]], [[TMP132]]
 ; CHECK-NEXT:    [[TMP86:%.*]] = insertelement <2 x i32> [[TMP102]], i32 [[CONV1]], i32 0
-; CHECK-NEXT:    [[TMP100:%.*]] = sub <2 x i32> [[TMP86]], [[TMP72]]
-; CHECK-NEXT:    [[TMP92:%.*]] = add <2 x i32> [[TMP88]], [[TMP100]]
+; CHECK-NEXT:    [[TMP133:%.*]] = sub <2 x i32> [[TMP86]], [[TMP108]]
+; CHECK-NEXT:    [[TMP92:%.*]] = add <2 x i32> [[TMP88]], [[TMP133]]
 ; CHECK-NEXT:    [[TMP93:%.*]] = shufflevector <2 x i32> [[TMP105]], <2 x i32> [[TMP92]], <2 x i32> <i32 0, i32 2>
-; CHECK-NEXT:    [[TMP91:%.*]] = add <2 x i32> [[TMP105]], [[TMP92]]
 ; CHECK-NEXT:    [[TMP101:%.*]] = sub <2 x i32> [[TMP92]], [[TMP105]]
-; CHECK-NEXT:    [[TMP94:%.*]] = extractelement <2 x i32> [[TMP91]], i32 0
-; CHECK-NEXT:    [[SUB47:%.*]] = extractelement <2 x i32> [[TMP91]], i32 1
-; CHECK-NEXT:    [[ADD78:%.*]] = add i32 [[SUB47]], [[TMP94]]
-; CHECK-NEXT:    [[SUB51:%.*]] = sub i32 [[TMP94]], [[SUB47]]
-; CHECK-NEXT:    [[TMP98:%.*]] = extractelement <2 x i32> [[TMP101]], i32 0
 ; CHECK-NEXT:    [[TMP99:%.*]] = extractelement <2 x i32> [[TMP101]], i32 1
-; CHECK-NEXT:    [[ADD55:%.*]] = add i32 [[TMP99]], [[TMP98]]
-; CHECK-NEXT:    [[SUB59:%.*]] = sub i32 [[TMP98]], [[TMP99]]
-; CHECK-NEXT:    [[SHR_I59:%.*]] = lshr i32 [[SUB47]], 15
-; CHECK-NEXT:    [[AND_I60:%.*]] = and i32 [[SHR_I59]], 65537
-; CHECK-NEXT:    [[MUL_I61:%.*]] = mul i32 [[AND_I60]], 65535
 ; CHECK-NEXT:    [[SHR_I59_1:%.*]] = lshr i32 [[TMP99]], 15
 ; CHECK-NEXT:    [[AND_I60_1:%.*]] = and i32 [[SHR_I59_1]], 65537
 ; CHECK-NEXT:    [[MUL_I61_1:%.*]] = mul i32 [[AND_I60_1]], 65535
 ; CHECK-NEXT:    [[TMP104:%.*]] = load <2 x i8>, ptr [[ARRAYIDX8_1]], align 1
-; CHECK-NEXT:    [[TMP110:%.*]] = zext <2 x i8> [[TMP104]] to <2 x i32>
-; CHECK-NEXT:    [[TMP108:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ADD_PTR644]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP103:%.*]] = zext <2 x i8> [[TMP108]] to <2 x i32>
-; CHECK-NEXT:    [[TMP109:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP116:%.*]] = zext <2 x i8> [[TMP109]] to <2 x i32>
-; CHECK-NEXT:    [[TMP106:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP118:%.*]] = zext <2 x i8> [[TMP106]] to <2 x i32>
-; CHECK-NEXT:    [[TMP124:%.*]] = sub <2 x i32> [[TMP116]], [[TMP118]]
-; CHECK-NEXT:    [[TMP125:%.*]] = shl <2 x i32> [[TMP124]], <i32 16, i32 16>
-; CHECK-NEXT:    [[TMP121:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX22_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP111:%.*]] = zext <2 x i8> [[TMP121]] to <2 x i32>
-; CHECK-NEXT:    [[TMP112:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX25_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; CHECK-NEXT:    [[TMP113:%.*]] = zext <2 x i8> [[TMP112]] to <2 x i32>
-; CHECK-NEXT:    [[TMP114:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX27_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
+; CHECK-NEXT:    [[TMP113:%.*]] = zext <2 x i8> [[TMP104]] to <2 x i32>
+; CHECK-NEXT:    [[TMP114:%.*]] = load <2 x i8>, ptr [[ADD_PTR644]], align 1
 ; CHECK-NEXT:    [[TMP115:%.*]] = zext <2 x i8> [[TMP114]] to <2 x i32>
-; CHECK-NEXT:    [[TMP135:%.*]] = sub <2 x i32> [[TMP113]], [[TMP115]]
+; CHECK-NEXT:    [[TMP116:%.*]] = load <2 x i8>, ptr [[ARRAYIDX3_1]], align 1
+; CHECK-NEXT:    [[TMP117:%.*]] = zext <2 x i8> [[TMP116]] to <2 x i32>
+; CHECK-NEXT:    [[TMP118:%.*]] = load <2 x i8>, ptr [[ARRAYIDX5_1]], align 1
+; CHECK-NEXT:    [[TMP119:%.*]] = zext <2 x i8> [[TMP118]] to <2 x i32>
+; CHECK-NEXT:    [[TMP124:%.*]] = sub <2 x i32> [[TMP117]], [[TMP119]]
+; CHECK-NEXT:    [[TMP125:%.*]] = shl <2 x i32> [[TMP124]], <i32 16, i32 16>
+; CHECK-NEXT:    [[TMP122:%.*]] = shufflevector <2 x i32> [[TMP113]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; CHECK-NEXT:    [[TMP123:%.*]] = insertelement <2 x i32> [[TMP122]], i32 [[CONV_1]], i32 0
+; CHECK-NEXT:    [[TMP134:%.*]] = sub <2 x i32> [[TMP123]], [[TMP115]]
+; CHECK-NEXT:    [[TMP156:%.*]] = add <2 x i32> [[TMP125]], [[TMP134]]
+; CHECK-NEXT:    [[TMP126:%.*]] = shufflevector <2 x i8> [[TMP7]], <2 x i8> poison, <2 x i32> <i32 1, i32 poison>
+; CHECK-NEXT:    [[TMP127:%.*]] = insertelement <2 x i8> [[TMP126]], i8 [[TMP14]], i32 1
+; CHECK-NEXT:    [[TMP128:%.*]] = zext <2 x i8> [[TMP127]] to <2 x i32>
+; CHECK-NEXT:    [[TMP129:%.*]] = insertelement <2 x i32> [[TMP113]], i32 [[TMP9]], i32 0
+; CHECK-NEXT:    [[TMP130:%.*]] = sub <2 x i32> [[TMP129]], [[TMP128]]
+; CHECK-NEXT:    [[TMP135:%.*]] = insertelement <2 x i32> [[TMP130]], i32 [[TMP16]], i32 1
 ; CHECK-NEXT:    [[TMP136:%.*]] = shl <2 x i32> [[TMP135]], <i32 16, i32 16>
+; CHECK-NEXT:    [[TMP110:%.*]] = shufflevector <2 x i32> [[TMP130]], <2 x i32> poison, <2 x i32> <i32 1, i32 poison>
 ; CHECK-NEXT:    [[TMP137:%.*]] = insertelement <2 x i32> [[TMP110]], i32 [[CONV33_1]], i32 1
-; CHECK-NEXT:    [[TMP119:%.*]] = sub <2 x i32> [[TMP137]], [[TMP111]]
-; CHECK-NEXT:    [[TMP120:%.*]] = add <2 x i32> [[TMP136]], [[TMP119]]
-; CHECK-NEXT:    [[TMP117:%.*]] = insertelement <2 x i32> [[TMP110]], i32 [[CONV_1]], i32 0
-; CHECK-NEXT:    [[TMP122:%.*]] = sub <2 x i32> [[TMP117]], [[TMP103]]
-; CHECK-NEXT:    [[TMP123:%.*]] = add <2 x i32> [[TMP125]], [[TMP122]]
-; CHECK-NEXT:    [[TMP143:%.*]] = add <2 x i32> [[TMP120]], [[TMP123]]
-; CHECK-NEXT:    [[TMP156:%.*]] = sub <2 x i32> [[TMP123]], [[TMP120]]
-; CHECK-NEXT:    [[TMP145:%.*]] = extractelement <2 x i32> [[TMP143]], i32 0
-; CHECK-NEXT:    [[TMP146:%.*]] = extractelement <2 x i32> [[TMP143]], i32 1
-; CHECK-NEXT:    [[ADD94:%.*]] = add i32 [[TMP146]], [[TMP145]]
-; CHECK-NEXT:    [[SUB51_1:%.*]] = sub i32 [[TMP145]], [[TMP146]]
-; CHECK-NEXT:    [[TMP180:%.*]] = extractelement <2 x i32> [[TMP156]], i32 0
+; CHECK-NEXT:    [[TMP155:%.*]] = add <2 x i32> [[TMP136]], [[TMP137]]
+; CHECK-NEXT:    [[TMP139:%.*]] = extractelement <2 x i32> [[TMP156]], i32 0
 ; CHECK-NEXT:    [[TMP142:%.*]] = extractelement <2 x i32> [[TMP156]], i32 1
-; CHECK-NEXT:    [[ADD55_1:%.*]] = add i32 [[TMP142]], [[TMP180]]
-; CHECK-NEXT:    [[SUB59_1:%.*]] = sub i32 [[TMP180]], [[TMP142]]
+; CHECK-NEXT:    [[SUB45_1:%.*]] = sub i32 [[TMP139]], [[TMP142]]
+; CHECK-NEXT:    [[TMP138:%.*]] = extractelement <2 x i32> [[TMP155]], i32 0
+; CHECK-NEXT:    [[TMP171:%.*]] = extractelement <2 x i32> [[TMP155]], i32 1
+; CHECK-NEXT:    [[SUB47_1:%.*]] = sub i32 [[TMP138]], [[TMP171]]
+; CHECK-NEXT:    [[TMP140:%.*]] = shufflevector <2 x i32> [[TMP156]], <2 x i32> [[TMP105]], <2 x i32> <i32 1, i32 2>
+; CHECK-NEXT:    [[TMP153:%.*]] = shufflevector <2 x i32> [[TMP156]], <2 x i32> [[TMP92]], <2 x i32> <i32 0, i32 2>
+; CHECK-NEXT:    [[TMP157:%.*]] = add <2 x i32> [[TMP140]], [[TMP153]]
+; CHECK-NEXT:    [[TMP143:%.*]] = shufflevector <2 x i32> [[TMP105]], <2 x i32> [[TMP155]], <2 x i32> <i32 3, i32 1>
+; CHECK-NEXT:    [[TMP144:%.*]] = shufflevector <2 x i32> [[TMP92]], <2 x i32> [[TMP155]], <2 x i32> <i32 2, i32 1>
+; CHECK-NEXT:    [[TMP145:%.*]] = add <2 x i32> [[TMP143]], [[TMP144]]
+; CHECK-NEXT:    [[TMP98:%.*]] = extractelement <2 x i32> [[TMP157]], i32 1
+; CHECK-NEXT:    [[TMP146:%.*]] = extractelement <2 x i32> [[TMP145]], i32 1
 ; CHECK-NEXT:    [[SHR_I54_1:%.*]] = lshr i32 [[TMP146]], 15
 ; CHECK-NEXT:    [[AND_I55_1:%.*]] = and i32 [[SHR_I54_1]], 65537
 ; CHECK-NEXT:    [[MUL_I56_1:%.*]] = mul i32 [[AND_I55_1]], 65535
-; CHECK-NEXT:    [[TMP147:%.*]] = lshr <2 x i32> [[TMP110]], <i32 15, i32 15>
+; CHECK-NEXT:    [[TMP164:%.*]] = add <2 x i32> [[TMP145]], [[TMP157]]
+; CHECK-NEXT:    [[TMP141:%.*]] = sub <2 x i32> [[TMP157]], [[TMP145]]
+; CHECK-NEXT:    [[TMP165:%.*]] = insertelement <2 x i32> [[TMP101]], i32 [[SUB47_1]], i32 0
+; CHECK-NEXT:    [[TMP151:%.*]] = shufflevector <2 x i32> [[TMP101]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; CHECK-NEXT:    [[TMP152:%.*]] = insertelement <2 x i32> [[TMP151]], i32 [[SUB45_1]], i32 0
+; CHECK-NEXT:    [[TMP180:%.*]] = add <2 x i32> [[TMP165]], [[TMP152]]
+; CHECK-NEXT:    [[TMP154:%.*]] = sub <2 x i32> [[TMP152]], [[TMP165]]
+; CHECK-NEXT:    [[TMP166:%.*]] = extractelement <2 x i32> [[TMP145]], i32 0
+; CHECK-NEXT:    [[SHR_I54:%.*]] = lshr i32 [[TMP166]], 15
+; CHECK-NEXT:    [[AND_I55:%.*]] = and i32 [[SHR_I54]], 65537
+; CHECK-NEXT:    [[MUL_I56:%.*]] = mul i32 [[AND_I55]], 65535
+; CHECK-NEXT:    [[SHR_I54_2:%.*]] = lshr i32 [[SUB47_1]], 15
+; CHECK-NEXT:    [[AND_I55_2:%.*]] = and i32 [[SHR_I54_2]], 65537
+; CHECK-NEXT:    [[MUL_I56_2:%.*]] = mul i32 [[AND_I55_2]], 65535
+; CHECK-NEXT:    [[TMP147:%.*]] = lshr <2 x i32> [[TMP113]], <i32 15, i32 15>
 ; CHECK-NEXT:    [[TMP148:%.*]] = and <2 x i32> [[TMP147]], <i32 65537, i32 65537>
 ; CHECK-NEXT:    [[TMP149:%.*]] = mul <2 x i32> [[TMP148]], <i32 65535, i32 65535>
-; CHECK-NEXT:    [[ADD79:%.*]] = add i32 [[ADD94]], [[ADD78]]
-; CHECK-NEXT:    [[SUB104:%.*]] = sub i32 [[ADD78]], [[ADD94]]
+; CHECK-NEXT:    [[TMP167:%.*]] = shufflevector <2 x i32> [[TMP164]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; CHECK-NEXT:    [[TMP160:%.*]] = insertelement <2 x i32> [[TMP167]], i32 [[ADD48_3]], i32 0
+; CHECK-NEXT:    [[TMP161:%.*]] = insertelement <2 x i32> [[TMP164]], i32 [[ADD48_2]], i32 0
+; CHECK-NEXT:    [[TMP162:%.*]] = add <2 x i32> [[TMP160]], [[TMP161]]
+; CHECK-NEXT:    [[TMP163:%.*]] = sub <2 x i32> [[TMP161]], [[TMP160]]
+; CHECK-NEXT:    [[ADD95:%.*]] = extractelement <2 x i32> [[TMP162]], i32 0
+; CHECK-NEXT:    [[ADD79:%.*]] = extractelement <2 x i32> [[TMP162]], i32 1
 ; CHECK-NEXT:    [[ADD103:%.*]] = add i32 [[ADD95]], [[ADD79]]
 ; CHECK-NEXT:    [[SUB105:%.*]] = sub i32 [[ADD79]], [[ADD95]]
-; CHECK-NEXT:    [[ADD105:%.*]] = add i32 [[SUB102]], [[SUB104]]
-; CHECK-NEXT:    [[SUB106:%.*]] = sub i32 [[SUB104]], [[SUB102]]
+; CHECK-NEXT:    [[ADD94:%.*]] = extractelement <2 x i32> [[TMP163]], i32 0
+; CHECK-NEXT:    [[ADD78:%.*]] = extractelement <2 x i32> [[TMP163]], i32 1
+; CHECK-NEXT:    [[ADD105:%.*]] = add i32 [[ADD94]], [[ADD78]]
+; CHECK-NEXT:    [[SUB104:%.*]] = sub i32 [[ADD78]], [[ADD94]]
 ; CHECK-NEXT:    [[ADD_I:%.*]] = add i32 [[MUL_I51_3]], [[ADD103]]
 ; CHECK-NEXT:    [[XOR_I:%.*]] = xor i32 [[ADD_I]], [[TMP77]]
 ; CHECK-NEXT:    [[ADD_I52:%.*]] = add i32 [[MUL_I_1]], [[ADD105]]
-; CHECK-NEXT:    [[XOR_I53:%.*]] = xor i32 [[ADD_I52]], [[SUB47_2]]
-; CHECK-NEXT:    [[ADD_I57:%.*]] = add i32 [[MUL_I56_1]], [[SUB105]]
-; CHECK-NEXT:    [[XOR_I58:%.*]] = xor i32 [[ADD_I57]], [[TMP146]]
-; CHECK-NEXT:    [[ADD_I62:%.*]] = add i32 [[MUL_I61]], [[SUB106]]
-; CHECK-NEXT:    [[XOR_I63:%.*]] = xor i32 [[ADD_I62]], [[SUB47]]
+; CHECK-NEXT:    [[XOR_I53:%.*]] = xor i32 [[ADD_I52]], [[TMP34]]
+; CHECK-NEXT:    [[ADD_I57:%.*]] = add i32 [[MUL_I56]], [[SUB105]]
+; CHECK-NEXT:    [[XOR_I58:%.*]] = xor i32 [[ADD_I57]], [[TMP166]]
+; CHECK-NEXT:    [[ADD_I62:%.*]] = add i32 [[MUL_I56_1]], [[SUB104]]
+; CHECK-NEXT:    [[XOR_I63:%.*]] = xor i32 [[ADD_I62]], [[TMP146]]
 ; CHECK-NEXT:    [[ADD110:%.*]] = add i32 [[XOR_I53]], [[XOR_I]]
 ; CHECK-NEXT:    [[ADD112:%.*]] = add i32 [[ADD110]], [[XOR_I58]]
 ; CHECK-NEXT:    [[ADD113:%.*]] = add i32 [[ADD112]], [[XOR_I63]]
-; CHECK-NEXT:    [[ADD78_1:%.*]] = add i32 [[ADD55_1]], [[ADD55]]
-; CHECK-NEXT:    [[SUB86_1:%.*]] = sub i32 [[ADD55]], [[ADD55_1]]
-; CHECK-NEXT:    [[ADD105_1:%.*]] = add i32 [[SUB102_1]], [[SUB86_1]]
-; CHECK-NEXT:    [[SUB106_1:%.*]] = sub i32 [[SUB86_1]], [[SUB102_1]]
-; CHECK-NEXT:    [[ADD_I52_1:%.*]] = add i32 [[MUL_I51_1]], [[ADD105_1]]
-; CHECK-NEXT:    [[XOR_I53_1:%.*]] = xor i32 [[ADD_I52_1]], [[TMP107]]
-; CHECK-NEXT:    [[TMP129:%.*]] = shufflevector <2 x i32> [[TMP17]], <2 x i32> [[TMP156]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT:    [[TMP130:%.*]] = lshr <2 x i32> [[TMP129]], <i32 15, i32 15>
-; CHECK-NEXT:    [[TMP131:%.*]] = and <2 x i32> [[TMP130]], <i32 65537, i32 65537>
-; CHECK-NEXT:    [[TMP132:%.*]] = mul <2 x i32> [[TMP131]], <i32 65535, i32 65535>
-; CHECK-NEXT:    [[TMP133:%.*]] = insertelement <2 x i32> poison, i32 [[ADD78_1]], i32 0
-; CHECK-NEXT:    [[TMP144:%.*]] = shufflevector <2 x i32> [[TMP133]], <2 x i32> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP151:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_1]], i32 0
-; CHECK-NEXT:    [[TMP152:%.*]] = shufflevector <2 x i32> [[TMP151]], <2 x i32> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP153:%.*]] = add <2 x i32> [[TMP144]], [[TMP152]]
-; CHECK-NEXT:    [[TMP138:%.*]] = sub <2 x i32> [[TMP144]], [[TMP152]]
-; CHECK-NEXT:    [[TMP139:%.*]] = shufflevector <2 x i32> [[TMP153]], <2 x i32> [[TMP138]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT:    [[TMP140:%.*]] = add <2 x i32> [[TMP132]], [[TMP139]]
-; CHECK-NEXT:    [[TMP141:%.*]] = xor <2 x i32> [[TMP140]], [[TMP129]]
+; CHECK-NEXT:    [[TMP168:%.*]] = extractelement <2 x i32> [[TMP180]], i32 0
+; CHECK-NEXT:    [[TMP169:%.*]] = extractelement <2 x i32> [[TMP180]], i32 1
+; CHECK-NEXT:    [[SUB102:%.*]] = add i32 [[TMP168]], [[TMP169]]
+; CHECK-NEXT:    [[ADD55_2:%.*]] = add i32 [[TMP48]], [[TMP47]]
+; CHECK-NEXT:    [[TMP170:%.*]] = insertelement <2 x i32> [[TMP180]], i32 [[ADD55_2]], i32 0
+; CHECK-NEXT:    [[TMP173:%.*]] = shufflevector <2 x i32> [[TMP180]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; CHECK-NEXT:    [[TMP172:%.*]] = insertelement <2 x i32> [[TMP173]], i32 [[ADD55_3]], i32 0
+; CHECK-NEXT:    [[TMP175:%.*]] = sub <2 x i32> [[TMP170]], [[TMP172]]
+; CHECK-NEXT:    [[TMP174:%.*]] = extractelement <2 x i32> [[TMP175]], i32 0
+; CHECK-NEXT:    [[TMP159:%.*]] = extractelement <2 x i32> [[TMP175]], i32 1
+; CHECK-NEXT:    [[ADD94_4:%.*]] = add i32 [[TMP174]], [[TMP159]]
+; CHECK-NEXT:    [[ADD94_1:%.*]] = add i32 [[ADD55_3]], [[ADD55_2]]
+; CHECK-NEXT:    [[TMP244:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_2]], i32 0
+; CHECK-NEXT:    [[TMP177:%.*]] = insertelement <2 x i32> [[TMP244]], i32 [[ADD55_3]], i32 1
+; CHECK-NEXT:    [[TMP197:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_4]], i32 0
+; CHECK-NEXT:    [[TMP179:%.*]] = insertelement <2 x i32> [[TMP197]], i32 [[ADD55_2]], i32 1
+; CHECK-NEXT:    [[TMP181:%.*]] = add <2 x i32> [[TMP177]], [[TMP179]]
+; CHECK-NEXT:    [[SUB104_1:%.*]] = sub i32 [[SUB102]], [[ADD94_1]]
+; CHECK-NEXT:    [[SUB106_1:%.*]] = sub i32 [[TMP159]], [[TMP174]]
+; CHECK-NEXT:    [[TMP193:%.*]] = insertelement <2 x i32> [[TMP192]], i32 [[SUB102]], i32 1
+; CHECK-NEXT:    [[TMP182:%.*]] = xor <2 x i32> [[TMP181]], [[TMP193]]
+; CHECK-NEXT:    [[TMP183:%.*]] = add <2 x i32> [[TMP181]], [[TMP193]]
+; CHECK-NEXT:    [[TMP184:%.*]] = shufflevector <2 x i32> [[TMP182]], <2 x i32> [[TMP183]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT:    [[TMP185:%.*]] = insertelement <2 x i32> poison, i32 [[ADD113]], i32 0
+; CHECK-NEXT:    [[TMP186:%.*]] = insertelement <2 x i32> [[TMP185]], i32 [[MUL_I_2]], i32 1
+; CHECK-NEXT:    [[TMP202:%.*]] = add <2 x i32> [[TMP184]], [[TMP186]]
+; CHECK-NEXT:    [[TMP203:%.*]] = extractelement <2 x i32> [[TMP202]], i32 1
+; CHECK-NEXT:    [[TMP189:%.*]] = shufflevector <2 x i32> [[TMP202]], <2 x i32> [[TMP32]], <2 x i32> <i32 1, i32 2>
+; CHECK-NEXT:    [[XOR_I_1:%.*]] = xor i32 [[TMP203]], [[TMP83]]
+; CHECK-NEXT:    [[ADD_I57_1:%.*]] = add i32 [[MUL_I56_2]], [[SUB104_1]]
+; CHECK-NEXT:    [[XOR_I58_1:%.*]] = xor i32 [[ADD_I57_1]], [[SUB47_1]]
 ; CHECK-NEXT:    [[ADD_I62_1:%.*]] = add i32 [[MUL_I61_1]], [[SUB106_1]]
 ; CHECK-NEXT:    [[XOR_I63_1:%.*]] = xor i32 [[ADD_I62_1]], [[TMP99]]
-; CHECK-NEXT:    [[ADD108_1:%.*]] = add i32 [[XOR_I53_1]], [[ADD113]]
-; CHECK-NEXT:    [[TMP154:%.*]] = extractelement <2 x i32> [[TMP141]], i32 0
-; CHECK-NEXT:    [[ADD110_1:%.*]] = add i32 [[ADD108_1]], [[TMP154]]
-; CHECK-NEXT:    [[TMP155:%.*]] = extractelement <2 x i32> [[TMP141]], i32 1
-; CHECK-NEXT:    [[ADD112_1:%.*]] = add i32 [[ADD110_1]], [[TMP155]]
+; CHECK-NEXT:    [[TMP190:%.*]] = extractelement <2 x i32> [[TMP202]], i32 0
+; CHECK-NEXT:    [[ADD110_1:%.*]] = add i32 [[TMP190]], [[XOR_I_1]]
+; CHECK-NEXT:    [[ADD112_1:%.*]] = add i32 [[ADD110_1]], [[XOR_I58_1]]
 ; CHECK-NEXT:    [[ADD113_1:%.*]] = add i32 [[ADD112_1]], [[XOR_I63_1]]
-; CHECK-NEXT:    [[ADD94_2:%.*]] = add i32 [[SUB51_1]], [[SUB51]]
-; CHECK-NEXT:    [[SUB86_2:%.*]] = sub i32 [[SUB51]], [[SUB51_1]]
-; CHECK-NEXT:    [[TMP244:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_2]], i32 0
-; CHECK-NEXT:    [[TMP245:%.*]] = shufflevector <2 x i32> [[TMP244]], <2 x i32> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP197:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_4]], i32 0
-; CHECK-NEXT:    [[TMP198:%.*]] = shufflevector <2 x i32> [[TMP197]], <2 x i32> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP207:%.*]] = add <2 x i32> [[TMP245]], [[TMP198]]
-; CHECK-NEXT:    [[TMP208:%.*]] = sub <2 x i32> [[TMP245]], [[TMP198]]
-; CHECK-NEXT:    [[TMP209:%.*]] = shufflevector <2 x i32> [[TMP207]], <2 x i32> [[TMP208]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT:    [[ADD105_2:%.*]] = add i32 [[SUB102_2]], [[SUB86_2]]
-; CHECK-NEXT:    [[SUB106_2:%.*]] = sub i32 [[SUB86_2]], [[SUB102_2]]
-; CHECK-NEXT:    [[ADD_I52_2:%.*]] = add i32 [[MUL_I51_6]], [[ADD105_2]]
+; CHECK-NEXT:    [[TMP191:%.*]] = extractelement <2 x i32> [[TMP141]], i32 0
+; CHECK-NEXT:    [[TMP205:%.*]] = extractelement <2 x i32> [[TMP141]], i32 1
+; CHECK-NEXT:    [[ADD78_2:%.*]] = add i32 [[TMP191]], [[TMP205]]
+; CHECK-NEXT:    [[TMP196:%.*]] = insertelement <2 x i32> [[TMP141]], i32 [[SUB51_2]], i32 0
+; CHECK-NEXT:    [[TMP194:%.*]] = shufflevector <2 x i32> [[TMP141]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; CHECK-NEXT:    [[TMP195:%.*]] = insertelement <2 x i32> [[TMP194]], i32 [[SUB51_3]], i32 0
+; CHECK-NEXT:    [[TMP206:%.*]] = sub <2 x i32> [[TMP196]], [[TMP195]]
+; CHECK-NEXT:    [[TMP201:%.*]] = insertelement <2 x i32> poison, i32 [[ADD78_2]], i32 0
+; CHECK-NEXT:    [[TMP198:%.*]] = shufflevector <2 x i32> [[TMP201]], <2 x i32> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP199:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_6]], i32 0
+; CHECK-NEXT:    [[TMP200:%.*]] = shufflevector <2 x i32> [[TMP199]], <2 x i32> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP225:%.*]] = add <2 x i32> [[TMP198]], [[TMP200]]
+; CHECK-NEXT:    [[TMP226:%.*]] = sub <2 x i32> [[TMP198]], [[TMP200]]
+; CHECK-NEXT:    [[TMP227:%.*]] = shufflevector <2 x i32> [[TMP225]], <2 x i32> [[TMP226]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT:    [[TMP204:%.*]] = extractelement <2 x i32> [[TMP206]], i32 0
+; CHECK-NEXT:    [[TMP212:%.*]] = extractelement <2 x i32> [[TMP206]], i32 1
+; CHECK-NEXT:    [[ADD105_2:%.*]] = add i32 [[TMP204]], [[TMP212]]
+; CHECK-NEXT:    [[SUB106_2:%.*]] = sub i32 [[TMP212]], [[TMP204]]
+; CHECK-NEXT:    [[ADD_I52_2:%.*]] = add i32 [[MUL_I51_2]], [[ADD105_2]]
 ; CHECK-NEXT:    [[XOR_I53_2:%.*]] = xor i32 [[ADD_I52_2]], [[CONV_1]]
-; CHECK-NEXT:    [[TMP134:%.*]] = add <2 x i32> [[TMP149]], [[TMP209]]
-; CHECK-NEXT:    [[TMP213:%.*]] = xor <2 x i32> [[TMP134]], [[TMP110]]
-; CHECK-NEXT:    [[SHR_I59_2:%.*]] = lshr i32 [[TMP94]], 15
+; CHECK-NEXT:    [[TMP207:%.*]] = add <2 x i32> [[TMP149]], [[TMP227]]
+; CHECK-NEXT:    [[TMP213:%.*]] = xor <2 x i32> [[TMP207]], [[TMP113]]
+; CHECK-NEXT:    [[SHR_I59_2:%.*]] = lshr i32 [[TMP98]], 15
 ; CHECK-NEXT:    [[AND_I60_2:%.*]] = and i32 [[SHR_I59_2]], 65537
 ; CHECK-NEXT:    [[MUL_I61_2:%.*]] = mul i32 [[AND_I60_2]], 65535
 ; CHECK-NEXT:    [[ADD_I62_2:%.*]] = add i32 [[MUL_I61_2]], [[SUB106_2]]
-; CHECK-NEXT:    [[XOR_I63_2:%.*]] = xor i32 [[ADD_I62_2]], [[TMP94]]
+; CHECK-NEXT:    [[XOR_I63_2:%.*]] = xor i32 [[ADD_I62_2]], [[TMP98]]
 ; CHECK-NEXT:    [[ADD108_2:%.*]] = add i32 [[XOR_I53_2]], [[ADD113_1]]
-; CHECK-NEXT:    [[TMP157:%.*]] = extractelement <2 x i32> [[TMP213]], i32 0
-; CHECK-NEXT:    [[ADD110_2:%.*]] = add i32 [[ADD108_2]], [[TMP157]]
-; CHECK-NEXT:    [[TMP158:%.*]] = extractelement <2 x i32> [[TMP213]], i32 1
-; CHECK-NEXT:    [[ADD112_2:%.*]] = add i32 [[ADD110_2]], [[TMP158]]
+; CHECK-NEXT:    [[TMP208:%.*]] = extractelement <2 x i32> [[TMP213]], i32 0
+; CHECK-NEXT:    [[ADD110_2:%.*]] = add i32 [[ADD108_2]], [[TMP208]]
+; CHECK-NEXT:    [[TMP209:%.*]] = extractelement <2 x i32> [[TMP213]], i32 1
+; CHECK-NEXT:    [[ADD112_2:%.*]] = add i32 [[ADD110_2]], [[TMP209]]
 ; CHECK-NEXT:    [[ADD113_2:%.*]] = add i32 [[ADD112_2]], [[XOR_I63_2]]
+; CHECK-NEXT:    [[SUB59_1:%.*]] = extractelement <2 x i32> [[TMP154]], i32 0
+; CHECK-NEXT:    [[SUB59:%.*]] = extractelement <2 x i32> [[TMP154]], i32 1
 ; CHECK-NEXT:    [[ADD94_3:%.*]] = add i32 [[SUB59_1]], [[SUB59]]
 ; CHECK-NEXT:    [[SUB86_3:%.*]] = sub i32 [[SUB59]], [[SUB59_1]]
 ; CHECK-NEXT:    [[TMP223:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_3]], i32 0
@@ -314,10 +360,10 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; CHECK-NEXT:    [[ADD_I62_3:%.*]] = add i32 [[MUL_I61_3]], [[SUB106_3]]
 ; CHECK-NEXT:    [[XOR_I63_3:%.*]] = xor i32 [[ADD_I62_3]], [[CONV33]]
 ; CHECK-NEXT:    [[ADD108_3:%.*]] = add i32 [[XOR_I53_3]], [[ADD113_2]]
-; CHECK-NEXT:    [[TMP235:%.*]] = extractelement <2 x i32> [[TMP234]], i32 0
-; CHECK-NEXT:    [[ADD110_3:%.*]] = add i32 [[ADD108_3]], [[TMP235]]
-; CHECK-NEXT:    [[TMP236:%.*]] = extractelement <2 x i32> [[TMP234]], i32 1
-; CHECK-NEXT:    [[ADD112_3:%.*]] = add i32 [[ADD110_3]], [[TMP236]]
+; CHECK-NEXT:    [[TMP228:%.*]] = extractelement <2 x i32> [[TMP234]], i32 0
+; CHECK-NEXT:    [[ADD110_3:%.*]] = add i32 [[ADD108_3]], [[TMP228]]
+; CHECK-NEXT:    [[TMP229:%.*]] = extractelement <2 x i32> [[TMP234]], i32 1
+; CHECK-NEXT:    [[ADD112_3:%.*]] = add i32 [[ADD110_3]], [[TMP229]]
 ; CHECK-NEXT:    [[ADD113_3:%.*]] = add i32 [[ADD112_3]], [[XOR_I63_3]]
 ; CHECK-NEXT:    ret i32 [[ADD113_3]]
 ;
@@ -342,52 +388,69 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[ARRAYIDX3_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 4
 ; THR15-NEXT:    [[ARRAYIDX5_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 4
 ; THR15-NEXT:    [[ARRAYIDX8_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 1
-; THR15-NEXT:    [[ARRAYIDX22_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 1
-; THR15-NEXT:    [[ARRAYIDX13_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 5
-; THR15-NEXT:    [[ARRAYIDX27_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 5
-; THR15-NEXT:    [[ARRAYIDX32_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 3
+; THR15-NEXT:    [[ARRAYIDX32_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 2
 ; THR15-NEXT:    [[TMP3:%.*]] = load i8, ptr [[ARRAYIDX32_1]], align 1
-; THR15-NEXT:    [[CONV33_1:%.*]] = zext i8 [[TMP3]] to i32
+; THR15-NEXT:    [[ARRAYIDX25_1:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 6
+; THR15-NEXT:    [[TMP4:%.*]] = load i8, ptr [[ARRAYIDX25_1]], align 1
+; THR15-NEXT:    [[ARRAYIDX27_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 6
+; THR15-NEXT:    [[TMP5:%.*]] = load i8, ptr [[ARRAYIDX27_1]], align 1
+; THR15-NEXT:    [[TMP6:%.*]] = insertelement <2 x i8> poison, i8 [[TMP4]], i32 0
+; THR15-NEXT:    [[TMP7:%.*]] = insertelement <2 x i8> [[TMP6]], i8 [[TMP5]], i32 1
+; THR15-NEXT:    [[TMP8:%.*]] = extractelement <2 x i8> [[TMP7]], i32 0
+; THR15-NEXT:    [[TMP9:%.*]] = zext i8 [[TMP8]] to i32
+; THR15-NEXT:    [[ARRAYIDX32_2:%.*]] = getelementptr i8, ptr [[ADD_PTR3]], i64 3
+; THR15-NEXT:    [[ARRAYIDX34_1:%.*]] = getelementptr i8, ptr [[ADD_PTR644]], i64 3
+; THR15-NEXT:    [[TMP10:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX32_2]], i64 4, <2 x i1> <i1 true, i1 true>, i32 2)
+; THR15-NEXT:    [[TMP11:%.*]] = zext <2 x i8> [[TMP10]] to <2 x i16>
+; THR15-NEXT:    [[TMP12:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX34_1]], i64 4, <2 x i1> <i1 true, i1 true>, i32 2)
+; THR15-NEXT:    [[TMP20:%.*]] = zext <2 x i8> [[TMP12]] to <2 x i16>
+; THR15-NEXT:    [[TMP40:%.*]] = sub <2 x i16> [[TMP11]], [[TMP20]]
+; THR15-NEXT:    [[TMP46:%.*]] = extractelement <2 x i16> [[TMP40]], i32 1
+; THR15-NEXT:    [[TMP16:%.*]] = sext i16 [[TMP46]] to i32
+; THR15-NEXT:    [[SHL42_1:%.*]] = shl i32 [[TMP16]], 16
+; THR15-NEXT:    [[TMP17:%.*]] = extractelement <2 x i16> [[TMP40]], i32 0
+; THR15-NEXT:    [[TMP18:%.*]] = sext i16 [[TMP17]] to i32
+; THR15-NEXT:    [[ADD43_1:%.*]] = add i32 [[SHL42_1]], [[TMP18]]
 ; THR15-NEXT:    [[ADD_PTR_1:%.*]] = getelementptr i8, ptr [[ADD_PTR]], i64 [[IDX_EXT]]
 ; THR15-NEXT:    [[ADD_PTR64_1:%.*]] = getelementptr i8, ptr [[ADD_PTR64]], i64 [[IDX_EXT63]]
 ; THR15-NEXT:    [[ARRAYIDX3_2:%.*]] = getelementptr i8, ptr [[ADD_PTR_1]], i64 4
 ; THR15-NEXT:    [[ARRAYIDX5_2:%.*]] = getelementptr i8, ptr [[ADD_PTR64_1]], i64 4
-; THR15-NEXT:    [[TMP4:%.*]] = load <2 x i8>, ptr [[ADD_PTR_1]], align 1
-; THR15-NEXT:    [[TMP66:%.*]] = zext <2 x i8> [[TMP4]] to <2 x i32>
-; THR15-NEXT:    [[TMP6:%.*]] = load <2 x i8>, ptr [[ADD_PTR64_1]], align 1
-; THR15-NEXT:    [[TMP7:%.*]] = zext <2 x i8> [[TMP6]] to <2 x i32>
-; THR15-NEXT:    [[TMP8:%.*]] = sub <2 x i32> [[TMP66]], [[TMP7]]
-; THR15-NEXT:    [[TMP9:%.*]] = load <2 x i8>, ptr [[ARRAYIDX3_2]], align 1
-; THR15-NEXT:    [[TMP10:%.*]] = zext <2 x i8> [[TMP9]] to <2 x i32>
-; THR15-NEXT:    [[TMP11:%.*]] = load <2 x i8>, ptr [[ARRAYIDX5_2]], align 1
-; THR15-NEXT:    [[TMP12:%.*]] = zext <2 x i8> [[TMP11]] to <2 x i32>
-; THR15-NEXT:    [[TMP13:%.*]] = sub <2 x i32> [[TMP10]], [[TMP12]]
+; THR15-NEXT:    [[TMP19:%.*]] = load <2 x i8>, ptr [[ADD_PTR_1]], align 1
+; THR15-NEXT:    [[TMP66:%.*]] = zext <2 x i8> [[TMP19]] to <2 x i32>
+; THR15-NEXT:    [[TMP21:%.*]] = load <2 x i8>, ptr [[ADD_PTR64_1]], align 1
+; THR15-NEXT:    [[TMP22:%.*]] = zext <2 x i8> [[TMP21]] to <2 x i32>
+; THR15-NEXT:    [[TMP23:%.*]] = sub <2 x i32> [[TMP66]], [[TMP22]]
+; THR15-NEXT:    [[TMP24:%.*]] = load <2 x i8>, ptr [[ARRAYIDX3_2]], align 1
+; THR15-NEXT:    [[TMP28:%.*]] = zext <2 x i8> [[TMP24]] to <2 x i32>
+; THR15-NEXT:    [[TMP30:%.*]] = load <2 x i8>, ptr [[ARRAYIDX5_2]], align 1
+; THR15-NEXT:    [[TMP47:%.*]] = zext <2 x i8> [[TMP30]] to <2 x i32>
+; THR15-NEXT:    [[TMP13:%.*]] = sub <2 x i32> [[TMP28]], [[TMP47]]
 ; THR15-NEXT:    [[TMP14:%.*]] = shl <2 x i32> [[TMP13]], <i32 16, i32 16>
-; THR15-NEXT:    [[TMP15:%.*]] = add <2 x i32> [[TMP14]], [[TMP8]]
+; THR15-NEXT:    [[TMP15:%.*]] = add <2 x i32> [[TMP14]], [[TMP23]]
 ; THR15-NEXT:    [[ARRAYIDX20_2:%.*]] = getelementptr i8, ptr [[ADD_PTR_1]], i64 2
 ; THR15-NEXT:    [[ARRAYIDX22_2:%.*]] = getelementptr i8, ptr [[ADD_PTR64_1]], i64 2
 ; THR15-NEXT:    [[ARRAYIDX25_2:%.*]] = getelementptr i8, ptr [[ADD_PTR_1]], i64 6
 ; THR15-NEXT:    [[ARRAYIDX27_2:%.*]] = getelementptr i8, ptr [[ADD_PTR64_1]], i64 6
-; THR15-NEXT:    [[TMP16:%.*]] = load <2 x i8>, ptr [[ARRAYIDX20_2]], align 1
-; THR15-NEXT:    [[TMP17:%.*]] = zext <2 x i8> [[TMP16]] to <2 x i32>
-; THR15-NEXT:    [[TMP18:%.*]] = load <2 x i8>, ptr [[ARRAYIDX22_2]], align 1
-; THR15-NEXT:    [[TMP19:%.*]] = zext <2 x i8> [[TMP18]] to <2 x i32>
-; THR15-NEXT:    [[TMP20:%.*]] = sub <2 x i32> [[TMP17]], [[TMP19]]
-; THR15-NEXT:    [[TMP21:%.*]] = load <2 x i8>, ptr [[ARRAYIDX25_2]], align 1
-; THR15-NEXT:    [[TMP22:%.*]] = zext <2 x i8> [[TMP21]] to <2 x i32>
-; THR15-NEXT:    [[TMP23:%.*]] = load <2 x i8>, ptr [[ARRAYIDX27_2]], align 1
-; THR15-NEXT:    [[TMP24:%.*]] = zext <2 x i8> [[TMP23]] to <2 x i32>
-; THR15-NEXT:    [[TMP25:%.*]] = sub <2 x i32> [[TMP22]], [[TMP24]]
+; THR15-NEXT:    [[TMP49:%.*]] = load <2 x i8>, ptr [[ARRAYIDX20_2]], align 1
+; THR15-NEXT:    [[TMP59:%.*]] = zext <2 x i8> [[TMP49]] to <2 x i32>
+; THR15-NEXT:    [[TMP33:%.*]] = load <2 x i8>, ptr [[ARRAYIDX22_2]], align 1
+; THR15-NEXT:    [[TMP78:%.*]] = zext <2 x i8> [[TMP33]] to <2 x i32>
+; THR15-NEXT:    [[TMP35:%.*]] = sub <2 x i32> [[TMP59]], [[TMP78]]
+; THR15-NEXT:    [[TMP36:%.*]] = load <2 x i8>, ptr [[ARRAYIDX25_2]], align 1
+; THR15-NEXT:    [[TMP80:%.*]] = zext <2 x i8> [[TMP36]] to <2 x i32>
+; THR15-NEXT:    [[TMP38:%.*]] = load <2 x i8>, ptr [[ARRAYIDX27_2]], align 1
+; THR15-NEXT:    [[TMP39:%.*]] = zext <2 x i8> [[TMP38]] to <2 x i32>
+; THR15-NEXT:    [[TMP25:%.*]] = sub <2 x i32> [[TMP80]], [[TMP39]]
 ; THR15-NEXT:    [[TMP26:%.*]] = shl <2 x i32> [[TMP25]], <i32 16, i32 16>
-; THR15-NEXT:    [[TMP27:%.*]] = add <2 x i32> [[TMP26]], [[TMP20]]
-; THR15-NEXT:    [[TMP28:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; THR15-NEXT:    [[TMP27:%.*]] = add <2 x i32> [[TMP26]], [[TMP35]]
+; THR15-NEXT:    [[TMP83:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
 ; THR15-NEXT:    [[TMP29:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
-; THR15-NEXT:    [[ADD44_2:%.*]] = add i32 [[TMP29]], [[TMP28]]
-; THR15-NEXT:    [[SUB45_2:%.*]] = sub i32 [[TMP28]], [[TMP29]]
-; THR15-NEXT:    [[TMP30:%.*]] = extractelement <2 x i32> [[TMP27]], i32 0
+; THR15-NEXT:    [[ADD44_2:%.*]] = add i32 [[TMP29]], [[TMP83]]
+; THR15-NEXT:    [[SUB45_2:%.*]] = sub i32 [[TMP83]], [[TMP29]]
+; THR15-NEXT:    [[TMP87:%.*]] = extractelement <2 x i32> [[TMP27]], i32 0
 ; THR15-NEXT:    [[TMP31:%.*]] = extractelement <2 x i32> [[TMP27]], i32 1
-; THR15-NEXT:    [[ADD46_2:%.*]] = add i32 [[TMP31]], [[TMP30]]
-; THR15-NEXT:    [[SUB47_2:%.*]] = sub i32 [[TMP30]], [[TMP31]]
+; THR15-NEXT:    [[ADD46_2:%.*]] = add i32 [[TMP31]], [[TMP87]]
+; THR15-NEXT:    [[SUB47_2:%.*]] = sub i32 [[TMP87]], [[TMP31]]
 ; THR15-NEXT:    [[ADD48_2:%.*]] = add i32 [[ADD46_2]], [[ADD44_2]]
 ; THR15-NEXT:    [[SUB51_2:%.*]] = sub i32 [[ADD44_2]], [[ADD46_2]]
 ; THR15-NEXT:    [[ADD55_2:%.*]] = add i32 [[SUB47_2]], [[SUB45_2]]
@@ -395,58 +458,62 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[ARRAYIDX3_3:%.*]] = getelementptr i8, ptr null, i64 4
 ; THR15-NEXT:    [[ARRAYIDX5_3:%.*]] = getelementptr i8, ptr null, i64 4
 ; THR15-NEXT:    [[TMP32:%.*]] = load <2 x i8>, ptr null, align 1
-; THR15-NEXT:    [[TMP33:%.*]] = zext <2 x i8> [[TMP32]] to <2 x i32>
+; THR15-NEXT:    [[TMP48:%.*]] = zext <2 x i8> [[TMP32]] to <2 x i32>
 ; THR15-NEXT:    [[TMP34:%.*]] = load <2 x i8>, ptr null, align 1
-; THR15-NEXT:    [[TMP35:%.*]] = zext <2 x i8> [[TMP34]] to <2 x i32>
-; THR15-NEXT:    [[TMP36:%.*]] = sub <2 x i32> [[TMP33]], [[TMP35]]
+; THR15-NEXT:    [[TMP50:%.*]] = zext <2 x i8> [[TMP34]] to <2 x i32>
+; THR15-NEXT:    [[TMP93:%.*]] = sub <2 x i32> [[TMP48]], [[TMP50]]
 ; THR15-NEXT:    [[TMP37:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3_3]], i64 -4, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP38:%.*]] = zext <2 x i8> [[TMP37]] to <2 x i32>
-; THR15-NEXT:    [[TMP39:%.*]] = load <2 x i8>, ptr [[ARRAYIDX5_3]], align 1
-; THR15-NEXT:    [[TMP40:%.*]] = zext <2 x i8> [[TMP39]] to <2 x i32>
-; THR15-NEXT:    [[TMP41:%.*]] = sub <2 x i32> [[TMP38]], [[TMP40]]
+; THR15-NEXT:    [[TMP53:%.*]] = zext <2 x i8> [[TMP37]] to <2 x i32>
+; THR15-NEXT:    [[TMP54:%.*]] = load <2 x i8>, ptr [[ARRAYIDX5_3]], align 1
+; THR15-NEXT:    [[TMP55:%.*]] = zext <2 x i8> [[TMP54]] to <2 x i32>
+; THR15-NEXT:    [[TMP41:%.*]] = sub <2 x i32> [[TMP53]], [[TMP55]]
 ; THR15-NEXT:    [[TMP42:%.*]] = shl <2 x i32> [[TMP41]], <i32 16, i32 16>
-; THR15-NEXT:    [[TMP43:%.*]] = add <2 x i32> [[TMP42]], [[TMP36]]
+; THR15-NEXT:    [[TMP43:%.*]] = add <2 x i32> [[TMP42]], [[TMP93]]
 ; THR15-NEXT:    [[ARRAYIDX20_3:%.*]] = getelementptr i8, ptr null, i64 2
 ; THR15-NEXT:    [[ARRAYIDX22_3:%.*]] = getelementptr i8, ptr null, i64 2
 ; THR15-NEXT:    [[TMP44:%.*]] = load i8, ptr null, align 1
 ; THR15-NEXT:    [[ARRAYIDX27_3:%.*]] = getelementptr i8, ptr null, i64 6
 ; THR15-NEXT:    [[TMP45:%.*]] = load i8, ptr null, align 1
-; THR15-NEXT:    [[TMP46:%.*]] = load <2 x i8>, ptr [[ARRAYIDX20_3]], align 1
-; THR15-NEXT:    [[TMP47:%.*]] = zext <2 x i8> [[TMP46]] to <2 x i32>
-; THR15-NEXT:    [[TMP48:%.*]] = load <2 x i8>, ptr [[ARRAYIDX22_3]], align 1
-; THR15-NEXT:    [[TMP49:%.*]] = zext <2 x i8> [[TMP48]] to <2 x i32>
-; THR15-NEXT:    [[TMP50:%.*]] = sub <2 x i32> [[TMP47]], [[TMP49]]
+; THR15-NEXT:    [[TMP61:%.*]] = load <2 x i8>, ptr [[ARRAYIDX20_3]], align 1
+; THR15-NEXT:    [[TMP98:%.*]] = zext <2 x i8> [[TMP61]] to <2 x i32>
+; THR15-NEXT:    [[TMP99:%.*]] = load <2 x i8>, ptr [[ARRAYIDX22_3]], align 1
+; THR15-NEXT:    [[TMP101:%.*]] = zext <2 x i8> [[TMP99]] to <2 x i32>
+; THR15-NEXT:    [[TMP65:%.*]] = sub <2 x i32> [[TMP98]], [[TMP101]]
 ; THR15-NEXT:    [[TMP51:%.*]] = insertelement <2 x i8> poison, i8 [[TMP44]], i32 0
 ; THR15-NEXT:    [[TMP52:%.*]] = insertelement <2 x i8> [[TMP51]], i8 [[TMP45]], i32 1
-; THR15-NEXT:    [[TMP53:%.*]] = zext <2 x i8> [[TMP52]] to <2 x i32>
-; THR15-NEXT:    [[TMP54:%.*]] = load <2 x i8>, ptr [[ARRAYIDX27_3]], align 1
-; THR15-NEXT:    [[TMP55:%.*]] = zext <2 x i8> [[TMP54]] to <2 x i32>
-; THR15-NEXT:    [[TMP56:%.*]] = sub <2 x i32> [[TMP53]], [[TMP55]]
+; THR15-NEXT:    [[TMP68:%.*]] = zext <2 x i8> [[TMP52]] to <2 x i32>
+; THR15-NEXT:    [[TMP102:%.*]] = load <2 x i8>, ptr [[ARRAYIDX27_3]], align 1
+; THR15-NEXT:    [[TMP70:%.*]] = zext <2 x i8> [[TMP102]] to <2 x i32>
+; THR15-NEXT:    [[TMP56:%.*]] = sub <2 x i32> [[TMP68]], [[TMP70]]
 ; THR15-NEXT:    [[TMP57:%.*]] = shl <2 x i32> [[TMP56]], <i32 16, i32 16>
-; THR15-NEXT:    [[TMP58:%.*]] = add <2 x i32> [[TMP57]], [[TMP50]]
-; THR15-NEXT:    [[TMP59:%.*]] = extractelement <2 x i32> [[TMP43]], i32 0
+; THR15-NEXT:    [[TMP58:%.*]] = add <2 x i32> [[TMP57]], [[TMP65]]
+; THR15-NEXT:    [[TMP104:%.*]] = extractelement <2 x i32> [[TMP43]], i32 0
 ; THR15-NEXT:    [[TMP60:%.*]] = extractelement <2 x i32> [[TMP43]], i32 1
-; THR15-NEXT:    [[ADD44_3:%.*]] = add i32 [[TMP60]], [[TMP59]]
-; THR15-NEXT:    [[SUB45_3:%.*]] = sub i32 [[TMP59]], [[TMP60]]
-; THR15-NEXT:    [[TMP61:%.*]] = extractelement <2 x i32> [[TMP58]], i32 0
+; THR15-NEXT:    [[ADD44_3:%.*]] = add i32 [[TMP60]], [[TMP104]]
+; THR15-NEXT:    [[SUB45_3:%.*]] = sub i32 [[TMP104]], [[TMP60]]
+; THR15-NEXT:    [[TMP76:%.*]] = extractelement <2 x i32> [[TMP58]], i32 0
 ; THR15-NEXT:    [[TMP62:%.*]] = extractelement <2 x i32> [[TMP58]], i32 1
-; THR15-NEXT:    [[ADD46_3:%.*]] = add i32 [[TMP62]], [[TMP61]]
-; THR15-NEXT:    [[SUB47_3:%.*]] = sub i32 [[TMP61]], [[TMP62]]
+; THR15-NEXT:    [[ADD46_3:%.*]] = add i32 [[TMP62]], [[TMP76]]
+; THR15-NEXT:    [[SUB47_3:%.*]] = sub i32 [[TMP76]], [[TMP62]]
 ; THR15-NEXT:    [[ADD48_3:%.*]] = add i32 [[ADD46_3]], [[ADD44_3]]
 ; THR15-NEXT:    [[SUB51_3:%.*]] = sub i32 [[ADD44_3]], [[ADD46_3]]
 ; THR15-NEXT:    [[ADD55_3:%.*]] = add i32 [[SUB47_3]], [[SUB45_3]]
 ; THR15-NEXT:    [[SUB59_3:%.*]] = sub i32 [[SUB45_3]], [[SUB47_3]]
 ; THR15-NEXT:    [[ADD94:%.*]] = add i32 [[ADD48_3]], [[ADD48_2]]
 ; THR15-NEXT:    [[SUB102:%.*]] = sub i32 [[ADD48_2]], [[ADD48_3]]
-; THR15-NEXT:    [[TMP63:%.*]] = extractelement <2 x i32> [[TMP33]], i32 0
+; THR15-NEXT:    [[TMP63:%.*]] = extractelement <2 x i32> [[TMP48]], i32 0
 ; THR15-NEXT:    [[SHR_I:%.*]] = lshr i32 [[TMP63]], 15
 ; THR15-NEXT:    [[AND_I:%.*]] = and i32 [[SHR_I]], 65537
 ; THR15-NEXT:    [[MUL_I:%.*]] = mul i32 [[AND_I]], 65535
 ; THR15-NEXT:    [[SHR_I49:%.*]] = lshr i32 [[ADD46_2]], 15
 ; THR15-NEXT:    [[AND_I50:%.*]] = and i32 [[SHR_I49]], 65537
 ; THR15-NEXT:    [[MUL_I51:%.*]] = mul i32 [[AND_I50]], 65535
-; THR15-NEXT:    [[ADD55_1:%.*]] = add i32 [[ADD55_3]], [[ADD55_2]]
+; THR15-NEXT:    [[ADD94_1:%.*]] = add i32 [[ADD55_3]], [[ADD55_2]]
 ; THR15-NEXT:    [[SUB102_1:%.*]] = sub i32 [[ADD55_2]], [[ADD55_3]]
+; THR15-NEXT:    [[TMP105:%.*]] = extractelement <2 x i32> [[TMP66]], i32 1
+; THR15-NEXT:    [[SHR_I_1:%.*]] = lshr i32 [[TMP105]], 15
+; THR15-NEXT:    [[AND_I_1:%.*]] = and i32 [[SHR_I_1]], 65537
+; THR15-NEXT:    [[MUL_I_1:%.*]] = mul i32 [[AND_I_1]], 65535
 ; THR15-NEXT:    [[TMP64:%.*]] = extractelement <2 x i32> [[TMP66]], i32 0
 ; THR15-NEXT:    [[SHR_I49_2:%.*]] = lshr i32 [[TMP64]], 15
 ; THR15-NEXT:    [[AND_I50_2:%.*]] = and i32 [[SHR_I49_2]], 65537
@@ -461,30 +528,30 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[SHR_I49_4:%.*]] = lshr i32 [[CONV]], 15
 ; THR15-NEXT:    [[AND_I50_4:%.*]] = and i32 [[SHR_I49_4]], 65537
 ; THR15-NEXT:    [[MUL_I51_4:%.*]] = mul i32 [[AND_I50_4]], 65535
-; THR15-NEXT:    [[TMP65:%.*]] = load <2 x i8>, ptr [[ARRAYIDX8]], align 1
-; THR15-NEXT:    [[TMP74:%.*]] = zext <2 x i8> [[TMP65]] to <2 x i32>
+; THR15-NEXT:    [[TMP81:%.*]] = load <2 x i8>, ptr [[ARRAYIDX8]], align 1
+; THR15-NEXT:    [[TMP74:%.*]] = zext <2 x i8> [[TMP81]] to <2 x i32>
 ; THR15-NEXT:    [[TMP67:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[PIX2]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP68:%.*]] = zext <2 x i8> [[TMP67]] to <2 x i32>
+; THR15-NEXT:    [[TMP107:%.*]] = zext <2 x i8> [[TMP67]] to <2 x i32>
 ; THR15-NEXT:    [[TMP69:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP70:%.*]] = zext <2 x i8> [[TMP69]] to <2 x i32>
+; THR15-NEXT:    [[TMP108:%.*]] = zext <2 x i8> [[TMP69]] to <2 x i32>
 ; THR15-NEXT:    [[TMP71:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP81:%.*]] = zext <2 x i8> [[TMP71]] to <2 x i32>
-; THR15-NEXT:    [[TMP72:%.*]] = sub <2 x i32> [[TMP70]], [[TMP81]]
+; THR15-NEXT:    [[TMP109:%.*]] = zext <2 x i8> [[TMP71]] to <2 x i32>
+; THR15-NEXT:    [[TMP72:%.*]] = sub <2 x i32> [[TMP108]], [[TMP109]]
 ; THR15-NEXT:    [[TMP73:%.*]] = shl <2 x i32> [[TMP72]], <i32 16, i32 16>
 ; THR15-NEXT:    [[TMP75:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX22]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP76:%.*]] = zext <2 x i8> [[TMP75]] to <2 x i32>
+; THR15-NEXT:    [[TMP111:%.*]] = zext <2 x i8> [[TMP75]] to <2 x i32>
 ; THR15-NEXT:    [[TMP82:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX25]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP78:%.*]] = zext <2 x i8> [[TMP82]] to <2 x i32>
+; THR15-NEXT:    [[TMP94:%.*]] = zext <2 x i8> [[TMP82]] to <2 x i32>
 ; THR15-NEXT:    [[TMP79:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX27]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP80:%.*]] = zext <2 x i8> [[TMP79]] to <2 x i32>
-; THR15-NEXT:    [[TMP84:%.*]] = sub <2 x i32> [[TMP78]], [[TMP80]]
+; THR15-NEXT:    [[TMP96:%.*]] = zext <2 x i8> [[TMP79]] to <2 x i32>
+; THR15-NEXT:    [[TMP84:%.*]] = sub <2 x i32> [[TMP94]], [[TMP96]]
 ; THR15-NEXT:    [[TMP85:%.*]] = shl <2 x i32> [[TMP84]], <i32 16, i32 16>
 ; THR15-NEXT:    [[TMP86:%.*]] = insertelement <2 x i32> [[TMP74]], i32 [[CONV33]], i32 1
-; THR15-NEXT:    [[TMP93:%.*]] = sub <2 x i32> [[TMP86]], [[TMP76]]
-; THR15-NEXT:    [[TMP88:%.*]] = add <2 x i32> [[TMP85]], [[TMP93]]
+; THR15-NEXT:    [[TMP100:%.*]] = sub <2 x i32> [[TMP86]], [[TMP111]]
+; THR15-NEXT:    [[TMP88:%.*]] = add <2 x i32> [[TMP85]], [[TMP100]]
 ; THR15-NEXT:    [[TMP92:%.*]] = insertelement <2 x i32> [[TMP74]], i32 [[CONV]], i32 0
-; THR15-NEXT:    [[TMP87:%.*]] = sub <2 x i32> [[TMP92]], [[TMP68]]
-; THR15-NEXT:    [[TMP95:%.*]] = add <2 x i32> [[TMP73]], [[TMP87]]
+; THR15-NEXT:    [[TMP120:%.*]] = sub <2 x i32> [[TMP92]], [[TMP107]]
+; THR15-NEXT:    [[TMP95:%.*]] = add <2 x i32> [[TMP73]], [[TMP120]]
 ; THR15-NEXT:    [[TMP97:%.*]] = shufflevector <2 x i32> [[TMP88]], <2 x i32> [[TMP95]], <2 x i32> <i32 0, i32 2>
 ; THR15-NEXT:    [[TMP77:%.*]] = add <2 x i32> [[TMP88]], [[TMP95]]
 ; THR15-NEXT:    [[TMP91:%.*]] = sub <2 x i32> [[TMP95]], [[TMP88]]
@@ -492,54 +559,62 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[TMP90:%.*]] = extractelement <2 x i32> [[TMP77]], i32 1
 ; THR15-NEXT:    [[ADD48:%.*]] = add i32 [[TMP90]], [[TMP89]]
 ; THR15-NEXT:    [[SUB51:%.*]] = sub i32 [[TMP89]], [[TMP90]]
-; THR15-NEXT:    [[TMP94:%.*]] = extractelement <2 x i32> [[TMP91]], i32 0
+; THR15-NEXT:    [[TMP110:%.*]] = extractelement <2 x i32> [[TMP91]], i32 0
 ; THR15-NEXT:    [[SUB47:%.*]] = extractelement <2 x i32> [[TMP91]], i32 1
-; THR15-NEXT:    [[ADD56:%.*]] = add i32 [[SUB47]], [[TMP94]]
-; THR15-NEXT:    [[SUB59:%.*]] = sub i32 [[TMP94]], [[SUB47]]
+; THR15-NEXT:    [[ADD55:%.*]] = add i32 [[SUB47]], [[TMP110]]
 ; THR15-NEXT:    [[SHR_I59:%.*]] = lshr i32 [[TMP90]], 15
 ; THR15-NEXT:    [[AND_I60:%.*]] = and i32 [[SHR_I59]], 65537
 ; THR15-NEXT:    [[MUL_I61:%.*]] = mul i32 [[AND_I60]], 65535
 ; THR15-NEXT:    [[SHR_I59_1:%.*]] = lshr i32 [[SUB47]], 15
 ; THR15-NEXT:    [[AND_I60_1:%.*]] = and i32 [[SHR_I59_1]], 65537
 ; THR15-NEXT:    [[MUL_I61_1:%.*]] = mul i32 [[AND_I60_1]], 65535
-; THR15-NEXT:    [[TMP96:%.*]] = load <2 x i8>, ptr [[ARRAYIDX8_1]], align 1
-; THR15-NEXT:    [[TMP103:%.*]] = zext <2 x i8> [[TMP96]] to <2 x i32>
-; THR15-NEXT:    [[TMP98:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ADD_PTR644]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP99:%.*]] = zext <2 x i8> [[TMP98]] to <2 x i32>
-; THR15-NEXT:    [[TMP100:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX3_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP104:%.*]] = zext <2 x i8> [[TMP100]] to <2 x i32>
-; THR15-NEXT:    [[TMP105:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX5_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP112:%.*]] = zext <2 x i8> [[TMP105]] to <2 x i32>
-; THR15-NEXT:    [[TMP101:%.*]] = sub <2 x i32> [[TMP104]], [[TMP112]]
-; THR15-NEXT:    [[TMP102:%.*]] = shl <2 x i32> [[TMP101]], <i32 16, i32 16>
-; THR15-NEXT:    [[TMP120:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX22_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP107:%.*]] = zext <2 x i8> [[TMP120]] to <2 x i32>
-; THR15-NEXT:    [[TMP108:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX13_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP109:%.*]] = zext <2 x i8> [[TMP108]] to <2 x i32>
-; THR15-NEXT:    [[TMP110:%.*]] = call <2 x i8> @llvm.experimental.vp.strided.load.v2i8.p0.i64(ptr align 1 [[ARRAYIDX27_1]], i64 2, <2 x i1> <i1 true, i1 true>, i32 2)
-; THR15-NEXT:    [[TMP111:%.*]] = zext <2 x i8> [[TMP110]] to <2 x i32>
-; THR15-NEXT:    [[TMP113:%.*]] = sub <2 x i32> [[TMP109]], [[TMP111]]
+; THR15-NEXT:    [[TMP112:%.*]] = load <2 x i8>, ptr [[ARRAYIDX8_1]], align 1
+; THR15-NEXT:    [[TMP130:%.*]] = zext <2 x i8> [[TMP112]] to <2 x i32>
+; THR15-NEXT:    [[TMP129:%.*]] = load <2 x i8>, ptr [[ADD_PTR644]], align 1
+; THR15-NEXT:    [[TMP115:%.*]] = zext <2 x i8> [[TMP129]] to <2 x i32>
+; THR15-NEXT:    [[TMP116:%.*]] = load <2 x i8>, ptr [[ARRAYIDX3_1]], align 1
+; THR15-NEXT:    [[TMP117:%.*]] = zext <2 x i8> [[TMP116]] to <2 x i32>
+; THR15-NEXT:    [[TMP131:%.*]] = load <2 x i8>, ptr [[ARRAYIDX5_1]], align 1
+; THR15-NEXT:    [[TMP132:%.*]] = zext <2 x i8> [[TMP131]] to <2 x i32>
+; THR15-NEXT:    [[TMP113:%.*]] = sub <2 x i32> [[TMP117]], [[TMP132]]
 ; THR15-NEXT:    [[TMP114:%.*]] = shl <2 x i32> [[TMP113]], <i32 16, i32 16>
-; THR15-NEXT:    [[TMP115:%.*]] = insertelement <2 x i32> [[TMP103]], i32 [[CONV33_1]], i32 1
-; THR15-NEXT:    [[TMP117:%.*]] = sub <2 x i32> [[TMP115]], [[TMP107]]
-; THR15-NEXT:    [[TMP116:%.*]] = add <2 x i32> [[TMP114]], [[TMP117]]
+; THR15-NEXT:    [[TMP103:%.*]] = shufflevector <2 x i32> [[TMP130]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
 ; THR15-NEXT:    [[TMP126:%.*]] = insertelement <2 x i32> [[TMP103]], i32 [[CONV_1]], i32 0
-; THR15-NEXT:    [[TMP127:%.*]] = sub <2 x i32> [[TMP126]], [[TMP99]]
-; THR15-NEXT:    [[TMP128:%.*]] = add <2 x i32> [[TMP102]], [[TMP127]]
-; THR15-NEXT:    [[TMP106:%.*]] = add <2 x i32> [[TMP116]], [[TMP128]]
-; THR15-NEXT:    [[TMP121:%.*]] = sub <2 x i32> [[TMP128]], [[TMP116]]
+; THR15-NEXT:    [[TMP134:%.*]] = sub <2 x i32> [[TMP126]], [[TMP115]]
+; THR15-NEXT:    [[TMP121:%.*]] = add <2 x i32> [[TMP114]], [[TMP134]]
+; THR15-NEXT:    [[TMP145:%.*]] = shufflevector <2 x i8> [[TMP7]], <2 x i8> poison, <2 x i32> <i32 1, i32 poison>
+; THR15-NEXT:    [[TMP127:%.*]] = insertelement <2 x i8> [[TMP145]], i8 [[TMP3]], i32 1
+; THR15-NEXT:    [[TMP128:%.*]] = zext <2 x i8> [[TMP127]] to <2 x i32>
+; THR15-NEXT:    [[TMP146:%.*]] = insertelement <2 x i32> [[TMP130]], i32 [[TMP9]], i32 0
+; THR15-NEXT:    [[TMP106:%.*]] = sub <2 x i32> [[TMP146]], [[TMP128]]
 ; THR15-NEXT:    [[TMP118:%.*]] = extractelement <2 x i32> [[TMP106]], i32 0
+; THR15-NEXT:    [[SHL30_1:%.*]] = shl i32 [[TMP118]], 16
 ; THR15-NEXT:    [[TMP119:%.*]] = extractelement <2 x i32> [[TMP106]], i32 1
-; THR15-NEXT:    [[ADD48_1:%.*]] = add i32 [[TMP119]], [[TMP118]]
-; THR15-NEXT:    [[SUB51_1:%.*]] = sub i32 [[TMP118]], [[TMP119]]
-; THR15-NEXT:    [[TMP129:%.*]] = extractelement <2 x i32> [[TMP121]], i32 0
+; THR15-NEXT:    [[ADD31_1:%.*]] = add i32 [[SHL30_1]], [[TMP119]]
+; THR15-NEXT:    [[TMP133:%.*]] = extractelement <2 x i32> [[TMP121]], i32 0
 ; THR15-NEXT:    [[TMP125:%.*]] = extractelement <2 x i32> [[TMP121]], i32 1
-; THR15-NEXT:    [[ADD55_4:%.*]] = add i32 [[TMP125]], [[TMP129]]
-; THR15-NEXT:    [[SUB59_1:%.*]] = sub i32 [[TMP129]], [[TMP125]]
-; THR15-NEXT:    [[SHR_I54_1:%.*]] = lshr i32 [[TMP119]], 15
+; THR15-NEXT:    [[SUB45_1:%.*]] = sub i32 [[TMP133]], [[TMP125]]
+; THR15-NEXT:    [[TMP135:%.*]] = shufflevector <2 x i32> [[TMP121]], <2 x i32> poison, <2 x i32> <i32 1, i32 poison>
+; THR15-NEXT:    [[TMP136:%.*]] = insertelement <2 x i32> [[TMP135]], i32 [[ADD43_1]], i32 1
+; THR15-NEXT:    [[TMP137:%.*]] = insertelement <2 x i32> [[TMP121]], i32 [[ADD31_1]], i32 1
+; THR15-NEXT:    [[TMP138:%.*]] = add <2 x i32> [[TMP136]], [[TMP137]]
+; THR15-NEXT:    [[SUB47_1:%.*]] = sub i32 [[ADD31_1]], [[ADD43_1]]
+; THR15-NEXT:    [[TMP139:%.*]] = extractelement <2 x i32> [[TMP138]], i32 0
+; THR15-NEXT:    [[TMP140:%.*]] = extractelement <2 x i32> [[TMP138]], i32 1
+; THR15-NEXT:    [[ADD48_1:%.*]] = add i32 [[TMP140]], [[TMP139]]
+; THR15-NEXT:    [[SUB51_1:%.*]] = sub i32 [[TMP139]], [[TMP140]]
+; THR15-NEXT:    [[ADD55_1:%.*]] = add i32 [[SUB47_1]], [[SUB45_1]]
+; THR15-NEXT:    [[TMP141:%.*]] = shufflevector <2 x i32> [[TMP91]], <2 x i32> poison, <2 x i32> <i32 poison, i32 0>
+; THR15-NEXT:    [[TMP142:%.*]] = insertelement <2 x i32> [[TMP141]], i32 [[SUB45_1]], i32 0
+; THR15-NEXT:    [[TMP143:%.*]] = insertelement <2 x i32> [[TMP91]], i32 [[SUB47_1]], i32 0
+; THR15-NEXT:    [[TMP144:%.*]] = sub <2 x i32> [[TMP142]], [[TMP143]]
+; THR15-NEXT:    [[SHR_I54_1:%.*]] = lshr i32 [[TMP140]], 15
 ; THR15-NEXT:    [[AND_I55_1:%.*]] = and i32 [[SHR_I54_1]], 65537
 ; THR15-NEXT:    [[MUL_I56_1:%.*]] = mul i32 [[AND_I55_1]], 65535
-; THR15-NEXT:    [[TMP122:%.*]] = lshr <2 x i32> [[TMP103]], <i32 15, i32 15>
+; THR15-NEXT:    [[SHR_I54_2:%.*]] = lshr i32 [[SUB47_1]], 15
+; THR15-NEXT:    [[AND_I55_2:%.*]] = and i32 [[SHR_I54_2]], 65537
+; THR15-NEXT:    [[MUL_I56_2:%.*]] = mul i32 [[AND_I55_2]], 65535
+; THR15-NEXT:    [[TMP122:%.*]] = lshr <2 x i32> [[TMP130]], <i32 15, i32 15>
 ; THR15-NEXT:    [[TMP123:%.*]] = and <2 x i32> [[TMP122]], <i32 65537, i32 65537>
 ; THR15-NEXT:    [[TMP124:%.*]] = mul <2 x i32> [[TMP123]], <i32 65535, i32 65535>
 ; THR15-NEXT:    [[ADD78:%.*]] = add i32 [[ADD48_1]], [[ADD48]]
@@ -553,38 +628,29 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[ADD_I52:%.*]] = add i32 [[MUL_I51]], [[ADD105]]
 ; THR15-NEXT:    [[XOR_I53:%.*]] = xor i32 [[ADD_I52]], [[ADD46_2]]
 ; THR15-NEXT:    [[ADD_I57:%.*]] = add i32 [[MUL_I56_1]], [[SUB104]]
-; THR15-NEXT:    [[XOR_I58:%.*]] = xor i32 [[ADD_I57]], [[TMP119]]
+; THR15-NEXT:    [[XOR_I58:%.*]] = xor i32 [[ADD_I57]], [[TMP140]]
 ; THR15-NEXT:    [[ADD_I62:%.*]] = add i32 [[MUL_I61]], [[SUB106]]
 ; THR15-NEXT:    [[XOR_I63:%.*]] = xor i32 [[ADD_I62]], [[TMP90]]
 ; THR15-NEXT:    [[ADD110:%.*]] = add i32 [[XOR_I53]], [[XOR_I]]
 ; THR15-NEXT:    [[ADD112:%.*]] = add i32 [[ADD110]], [[XOR_I58]]
 ; THR15-NEXT:    [[ADD113:%.*]] = add i32 [[ADD112]], [[XOR_I63]]
-; THR15-NEXT:    [[ADD55:%.*]] = add i32 [[ADD55_4]], [[ADD56]]
-; THR15-NEXT:    [[SUB86_1:%.*]] = sub i32 [[ADD56]], [[ADD55_4]]
+; THR15-NEXT:    [[ADD78_1:%.*]] = add i32 [[ADD55_1]], [[ADD55]]
+; THR15-NEXT:    [[SUB86_1:%.*]] = sub i32 [[ADD55]], [[ADD55_1]]
+; THR15-NEXT:    [[ADD103_1:%.*]] = add i32 [[ADD94_1]], [[ADD78_1]]
+; THR15-NEXT:    [[SUB104_1:%.*]] = sub i32 [[ADD78_1]], [[ADD94_1]]
 ; THR15-NEXT:    [[ADD105_1:%.*]] = add i32 [[SUB102_1]], [[SUB86_1]]
 ; THR15-NEXT:    [[SUB106_1:%.*]] = sub i32 [[SUB86_1]], [[SUB102_1]]
+; THR15-NEXT:    [[ADD_I_1:%.*]] = add i32 [[MUL_I_1]], [[ADD103_1]]
+; THR15-NEXT:    [[XOR_I_1:%.*]] = xor i32 [[ADD_I_1]], [[TMP105]]
 ; THR15-NEXT:    [[ADD_I52_1:%.*]] = add i32 [[MUL_I51_2]], [[ADD105_1]]
 ; THR15-NEXT:    [[XOR_I53_1:%.*]] = xor i32 [[ADD_I52_1]], [[TMP64]]
-; THR15-NEXT:    [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP66]], <2 x i32> [[TMP121]], <2 x i32> <i32 1, i32 3>
-; THR15-NEXT:    [[TMP132:%.*]] = lshr <2 x i32> [[TMP5]], <i32 15, i32 15>
-; THR15-NEXT:    [[TMP133:%.*]] = and <2 x i32> [[TMP132]], <i32 65537, i32 65537>
-; THR15-NEXT:    [[TMP134:%.*]] = mul <2 x i32> [[TMP133]], <i32 65535, i32 65535>
-; THR15-NEXT:    [[TMP135:%.*]] = insertelement <2 x i32> poison, i32 [[ADD55]], i32 0
-; THR15-NEXT:    [[TMP136:%.*]] = shufflevector <2 x i32> [[TMP135]], <2 x i32> poison, <2 x i32> zeroinitializer
-; THR15-NEXT:    [[TMP137:%.*]] = insertelement <2 x i32> poison, i32 [[ADD55_1]], i32 0
-; THR15-NEXT:    [[TMP138:%.*]] = shufflevector <2 x i32> [[TMP137]], <2 x i32> poison, <2 x i32> zeroinitializer
-; THR15-NEXT:    [[TMP140:%.*]] = add <2 x i32> [[TMP136]], [[TMP138]]
-; THR15-NEXT:    [[TMP139:%.*]] = sub <2 x i32> [[TMP136]], [[TMP138]]
-; THR15-NEXT:    [[TMP144:%.*]] = shufflevector <2 x i32> [[TMP140]], <2 x i32> [[TMP139]], <2 x i32> <i32 0, i32 3>
-; THR15-NEXT:    [[TMP148:%.*]] = add <2 x i32> [[TMP134]], [[TMP144]]
-; THR15-NEXT:    [[TMP149:%.*]] = xor <2 x i32> [[TMP148]], [[TMP5]]
+; THR15-NEXT:    [[ADD_I57_1:%.*]] = add i32 [[MUL_I56_2]], [[SUB104_1]]
+; THR15-NEXT:    [[XOR_I58_1:%.*]] = xor i32 [[ADD_I57_1]], [[SUB47_1]]
 ; THR15-NEXT:    [[ADD_I62_1:%.*]] = add i32 [[MUL_I61_1]], [[SUB106_1]]
 ; THR15-NEXT:    [[XOR_I63_1:%.*]] = xor i32 [[ADD_I62_1]], [[SUB47]]
 ; THR15-NEXT:    [[ADD108_1:%.*]] = add i32 [[XOR_I53_1]], [[ADD113]]
-; THR15-NEXT:    [[TMP150:%.*]] = extractelement <2 x i32> [[TMP149]], i32 0
-; THR15-NEXT:    [[ADD110_1:%.*]] = add i32 [[ADD108_1]], [[TMP150]]
-; THR15-NEXT:    [[TMP151:%.*]] = extractelement <2 x i32> [[TMP149]], i32 1
-; THR15-NEXT:    [[ADD112_1:%.*]] = add i32 [[ADD110_1]], [[TMP151]]
+; THR15-NEXT:    [[ADD110_1:%.*]] = add i32 [[ADD108_1]], [[XOR_I_1]]
+; THR15-NEXT:    [[ADD112_1:%.*]] = add i32 [[ADD110_1]], [[XOR_I58_1]]
 ; THR15-NEXT:    [[ADD113_1:%.*]] = add i32 [[ADD112_1]], [[XOR_I63_1]]
 ; THR15-NEXT:    [[ADD78_2:%.*]] = add i32 [[SUB51_1]], [[SUB51]]
 ; THR15-NEXT:    [[SUB86_2:%.*]] = sub i32 [[SUB51]], [[SUB51_1]]
@@ -599,8 +665,8 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[SUB106_2:%.*]] = sub i32 [[SUB86_2]], [[SUB102_2]]
 ; THR15-NEXT:    [[ADD_I52_2:%.*]] = add i32 [[MUL_I51_3]], [[ADD105_2]]
 ; THR15-NEXT:    [[XOR_I53_2:%.*]] = xor i32 [[ADD_I52_2]], [[CONV_1]]
-; THR15-NEXT:    [[TMP159:%.*]] = add <2 x i32> [[TMP124]], [[TMP158]]
-; THR15-NEXT:    [[TMP160:%.*]] = xor <2 x i32> [[TMP159]], [[TMP103]]
+; THR15-NEXT:    [[TMP177:%.*]] = add <2 x i32> [[TMP124]], [[TMP158]]
+; THR15-NEXT:    [[TMP160:%.*]] = xor <2 x i32> [[TMP177]], [[TMP130]]
 ; THR15-NEXT:    [[SHR_I59_2:%.*]] = lshr i32 [[TMP89]], 15
 ; THR15-NEXT:    [[AND_I60_2:%.*]] = and i32 [[SHR_I59_2]], 65537
 ; THR15-NEXT:    [[MUL_I61_2:%.*]] = mul i32 [[AND_I60_2]], 65535
@@ -612,8 +678,10 @@ define i32 @test(ptr %pix1, ptr %pix2, i64 %idx.ext, i64 %idx.ext63, ptr %add.pt
 ; THR15-NEXT:    [[TMP162:%.*]] = extractelement <2 x i32> [[TMP160]], i32 1
 ; THR15-NEXT:    [[ADD112_2:%.*]] = add i32 [[ADD110_2]], [[TMP162]]
 ; THR15-NEXT:    [[ADD113_2:%.*]] = add i32 [[ADD112_2]], [[XOR_I63_2]]
-; THR15-NEXT:    [[ADD78_3:%.*]] = add i32 [[SUB59_1]], [[SUB59]]
-; THR15-NEXT:    [[SUB86_3:%.*]] = sub i32 [[SUB59]], [[SUB59_1]]
+; THR15-NEXT:    [[TMP159:%.*]] = extractelement <2 x i32> [[TMP144]], i32 0
+; THR15-NEXT:    [[TMP178:%.*]] = extractelement <2 x i32> [[TMP144]], i32 1
+; THR15-NEXT:    [[ADD78_3:%.*]] = add i32 [[TMP159]], [[TMP178]]
+; THR15-NEXT:    [[SUB86_3:%.*]] = sub i32 [[TMP178]], [[TMP159]]
 ; THR15-NEXT:    [[TMP163:%.*]] = insertelement <2 x i32> poison, i32 [[ADD78_3]], i32 0
 ; THR15-NEXT:    [[TMP164:%.*]] = shufflevector <2 x i32> [[TMP163]], <2 x i32> poison, <2 x i32> zeroinitializer
 ; THR15-NEXT:    [[TMP165:%.*]] = insertelement <2 x i32> poison, i32 [[ADD94_3]], i32 0

diff  --git a/llvm/test/Transforms/SLPVectorizer/RISCV/mixed-extracts-types.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/mixed-extracts-types.ll
index 125fe69820d5c1..56555efb063edf 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/mixed-extracts-types.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/mixed-extracts-types.ll
@@ -6,15 +6,15 @@ define i32 @test() {
 ; CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <vscale x 4 x i8> zeroinitializer, i64 0
-; CHECK-NEXT:    [[CONV5:%.*]] = sext i8 [[VECTOR_RECUR_EXTRACT]] to i32
-; CHECK-NEXT:    store i32 [[CONV5]], ptr getelementptr ([0 x i32], ptr null, i64 0, i64 -14), align 4
 ; CHECK-NEXT:    [[TMP0:%.*]] = load <2 x i8>, ptr getelementptr ([9 x i8], ptr null, i64 -2, i64 5), align 1
 ; CHECK-NEXT:    [[TMP3:%.*]] = load i8, ptr getelementptr ([9 x i8], ptr null, i64 -2, i64 5), align 1
 ; CHECK-NEXT:    [[TMP1:%.*]] = icmp ne <2 x i8> [[TMP0]], zeroinitializer
 ; CHECK-NEXT:    [[TMP2:%.*]] = zext <2 x i1> [[TMP1]] to <2 x i16>
 ; CHECK-NEXT:    store <2 x i16> [[TMP2]], ptr getelementptr ([0 x i16], ptr null, i64 0, i64 -14), align 2
-; CHECK-NEXT:    [[CONV5_1:%.*]] = sext i8 [[TMP3]] to i32
-; CHECK-NEXT:    store i32 [[CONV5_1]], ptr getelementptr ([0 x i32], ptr null, i64 0, i64 -13), align 4
+; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <2 x i8> poison, i8 [[VECTOR_RECUR_EXTRACT]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x i8> [[TMP4]], i8 [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = sext <2 x i8> [[TMP5]] to <2 x i32>
+; CHECK-NEXT:    store <2 x i32> [[TMP6]], ptr getelementptr ([0 x i32], ptr null, i64 0, i64 -14), align 4
 ; CHECK-NEXT:    ret i32 0
 ;
 entry:

diff  --git a/llvm/test/Transforms/SLPVectorizer/RISCV/remarks-insert-into-small-vector.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/remarks-insert-into-small-vector.ll
index de1eecd98eeb3f..bb806be15c71ca 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/remarks-insert-into-small-vector.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/remarks-insert-into-small-vector.ll
@@ -8,7 +8,7 @@
 ; YAML-NEXT:  Function:        test
 ; YAML-NEXT:  Args:
 ; YAML-NEXT:  - String:          'Stores SLP vectorized with cost '
-; YAML-NEXT:  - Cost:            '3'
+; YAML-NEXT:  - Cost:            '2'
 ; YAML-NEXT:  - String:          ' and with tree size '
 ; YAML-NEXT:  - TreeSize:        '7'
 

diff  --git a/llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll b/llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll
index 6dd9242989b627..308d0e27f1ea89 100644
--- a/llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll
+++ b/llvm/test/Transforms/SLPVectorizer/RISCV/vec3-base.ll
@@ -794,13 +794,21 @@ entry:
 }
 
 define float @reduce_fadd_after_fmul_of_buildvec(float %a, float %b, float %c) {
-; CHECK-LABEL: @reduce_fadd_after_fmul_of_buildvec(
-; CHECK-NEXT:    [[MUL_0:%.*]] = fmul fast float [[A:%.*]], 1.000000e+01
-; CHECK-NEXT:    [[MUL_1:%.*]] = fmul fast float [[B:%.*]], 1.000000e+01
-; CHECK-NEXT:    [[MUL_2:%.*]] = fmul fast float [[C:%.*]], 1.000000e+01
-; CHECK-NEXT:    [[ADD_0:%.*]] = fadd fast float [[MUL_0]], [[MUL_1]]
-; CHECK-NEXT:    [[ADD_1:%.*]] = fadd fast float [[ADD_0]], [[MUL_2]]
-; CHECK-NEXT:    ret float [[ADD_1]]
+; NON-POW2-LABEL: @reduce_fadd_after_fmul_of_buildvec(
+; NON-POW2-NEXT:    [[TMP1:%.*]] = insertelement <3 x float> poison, float [[A:%.*]], i32 0
+; NON-POW2-NEXT:    [[TMP2:%.*]] = insertelement <3 x float> [[TMP1]], float [[B:%.*]], i32 1
+; NON-POW2-NEXT:    [[TMP3:%.*]] = insertelement <3 x float> [[TMP2]], float [[C:%.*]], i32 2
+; NON-POW2-NEXT:    [[TMP4:%.*]] = fmul fast <3 x float> [[TMP3]], <float 1.000000e+01, float 1.000000e+01, float 1.000000e+01>
+; NON-POW2-NEXT:    [[TMP5:%.*]] = call fast float @llvm.vector.reduce.fadd.v3f32(float 0.000000e+00, <3 x float> [[TMP4]])
+; NON-POW2-NEXT:    ret float [[TMP5]]
+;
+; POW2-ONLY-LABEL: @reduce_fadd_after_fmul_of_buildvec(
+; POW2-ONLY-NEXT:    [[MUL_0:%.*]] = fmul fast float [[A:%.*]], 1.000000e+01
+; POW2-ONLY-NEXT:    [[MUL_1:%.*]] = fmul fast float [[B:%.*]], 1.000000e+01
+; POW2-ONLY-NEXT:    [[MUL_2:%.*]] = fmul fast float [[C:%.*]], 1.000000e+01
+; POW2-ONLY-NEXT:    [[ADD_0:%.*]] = fadd fast float [[MUL_0]], [[MUL_1]]
+; POW2-ONLY-NEXT:    [[ADD_1:%.*]] = fadd fast float [[ADD_0]], [[MUL_2]]
+; POW2-ONLY-NEXT:    ret float [[ADD_1]]
 ;
   %mul.0 = fmul fast float %a, 10.0
   %mul.1 = fmul fast float %b, 10.0