[llvm] [LV] Modify the tie-breaker logic of `preferscalable` in isMoreProfitable(). (PR #121682)
via llvm-commits
llvm-commits at lists.llvm.org
Sun Jan 5 00:34:45 PST 2025
https://github.com/Tingwei0512 created https://github.com/llvm/llvm-project/pull/121682
The current tie-breaker logic leads to inconsistent behaviors in certain scenarios.
Here's an example:
Assume the `TTI.preferFixedOverScalableIfEqualCost()` is false.
When
- A's VF = 1, cost = 4
- B's VF = 4, cost = 4
Decision: Not to vectorize ;
but when
- A's VF = 1, cost = 4
- B's VF = vscale × 2, cost = 4
Decision: Vectorize.
To address this inconsistency, we modify the logic so that it only checks `preferFixedOverScalableIfEqualCost()` when A is scalable and B is not. This change will provide more opportunities for loop vectorization.
@fhahn @david-arm
>From 04aeed577178b185e95f365fb24ae71d42454681 Mon Sep 17 00:00:00 2001
From: Tingwei0512 <tingwewe at gmail.com>
Date: Sat, 4 Jan 2025 17:20:42 +0800
Subject: [PATCH 1/2] Fix: Modify the tie breaker logic of preferscalable
---
.../Transforms/Vectorize/LoopVectorize.cpp | 8 +-
.../LoopVectorize/AArch64/call-costs.ll | 91 ++++-
.../AArch64/conditional-branches-cost.ll | 30 +-
.../AArch64/force-target-instruction-cost.ll | 10 +-
.../AArch64/induction-costs-sve.ll | 289 ++++++++++------
.../LoopVectorize/AArch64/induction-costs.ll | 18 +-
.../AArch64/reduction-recurrence-costs-sve.ll | 161 +++++----
.../AArch64/uniform-args-call-variants.ll | 131 ++++---
.../LoopVectorize/RISCV/low-trip-count.ll | 48 +--
.../LoopVectorize/RISCV/short-trip-count.ll | 16 +-
.../LoopVectorize/X86/conversion-cost.ll | 20 +-
.../LoopVectorize/X86/cost-model.ll | 327 +++++++++++++-----
.../LoopVectorize/X86/float-induction-x86.ll | 44 +--
.../LoopVectorize/X86/interleave-cost.ll | 48 +--
.../X86/invariant-store-vectorization.ll | 21 +-
.../X86/limit-vf-by-tripcount.ll | 16 +-
.../LoopVectorize/X86/predicate-switch.ll | 46 ++-
.../LoopVectorize/X86/strided_load_cost.ll | 124 +++++--
18 files changed, 980 insertions(+), 468 deletions(-)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 7ef5295bb12763..294b5c2c8911bc 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4373,8 +4373,12 @@ bool LoopVectorizationPlanner::isMoreProfitable(
// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.
- bool PreferScalable = !TTI.preferFixedOverScalableIfEqualCost() &&
- A.Width.isScalable() && !B.Width.isScalable();
+
+ // Only check preferFixedOverScalableIfEqualCost() when A is scalable
+ // and B isn't.
+ bool PreferScalable = true;
+ if (A.Width.isScalable() && !B.Width.isScalable())
+ PreferScalable = !TTI.preferFixedOverScalableIfEqualCost();
auto CmpFn = [PreferScalable](const InstructionCost &LHS,
const InstructionCost &RHS) {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/call-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/call-costs.ll
index 4f050877bd1316..db57bda04b790a 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/call-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/call-costs.ll
@@ -127,24 +127,99 @@ define void @call_scalarized(ptr noalias %src, ptr noalias %dst) {
; CHECK-LABEL: define void @call_scalarized(
; CHECK-SAME: ptr noalias [[SRC:%.*]], ptr noalias [[DST:%.*]]) {
; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE8:.*]] ]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = sub i64 100, [[INDEX]]
+; CHECK-NEXT: [[IV:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[IV_NEXT:%.*]] = add i64 [[IV]], -1
+; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr double, ptr [[SRC]], i64 [[IV_NEXT]]
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr double, ptr [[GEP_SRC]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr double, ptr [[TMP3]], i32 -1
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr double, ptr [[GEP_SRC]], i32 -2
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr double, ptr [[TMP5]], i32 -1
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x double>, ptr [[TMP4]], align 8
+; CHECK-NEXT: [[REVERSE:%.*]] = shufflevector <2 x double> [[WIDE_LOAD]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
+; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <2 x double>, ptr [[TMP6]], align 8
+; CHECK-NEXT: [[REVERSE2:%.*]] = shufflevector <2 x double> [[WIDE_LOAD1]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
+; CHECK-NEXT: [[TMP7:%.*]] = fcmp une <2 x double> [[REVERSE]], splat (double 4.000000e+00)
+; CHECK-NEXT: [[TMP8:%.*]] = fcmp une <2 x double> [[REVERSE2]], splat (double 4.000000e+00)
+; CHECK-NEXT: [[TMP9:%.*]] = fcmp ugt <2 x double> [[REVERSE]], zeroinitializer
+; CHECK-NEXT: [[TMP10:%.*]] = fcmp ugt <2 x double> [[REVERSE2]], zeroinitializer
+; CHECK-NEXT: [[TMP11:%.*]] = or <2 x i1> [[TMP7]], [[TMP9]]
+; CHECK-NEXT: [[TMP12:%.*]] = or <2 x i1> [[TMP8]], [[TMP10]]
+; CHECK-NEXT: [[TMP13:%.*]] = xor <2 x i1> [[TMP11]], splat (i1 true)
+; CHECK-NEXT: [[TMP14:%.*]] = xor <2 x i1> [[TMP12]], splat (i1 true)
+; CHECK-NEXT: [[TMP15:%.*]] = extractelement <2 x i1> [[TMP13]], i32 0
+; CHECK-NEXT: br i1 [[TMP15]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
+; CHECK: [[PRED_STORE_IF]]:
+; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[IV]], -1
+; CHECK-NEXT: [[TMP17:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP16]]
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x double> [[REVERSE]], i32 0
+; CHECK-NEXT: [[TMP19:%.*]] = call double @llvm.sqrt.f64(double [[TMP18]])
+; CHECK-NEXT: store double [[TMP19]], ptr [[TMP17]], align 8
+; CHECK-NEXT: br label %[[PRED_STORE_CONTINUE]]
+; CHECK: [[PRED_STORE_CONTINUE]]:
+; CHECK-NEXT: [[TMP20:%.*]] = extractelement <2 x i1> [[TMP13]], i32 1
+; CHECK-NEXT: br i1 [[TMP20]], label %[[PRED_STORE_IF3:.*]], label %[[PRED_STORE_CONTINUE4:.*]]
+; CHECK: [[PRED_STORE_IF3]]:
+; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[OFFSET_IDX]], -1
+; CHECK-NEXT: [[TMP22:%.*]] = add i64 [[TMP21]], -1
+; CHECK-NEXT: [[TMP23:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP22]]
+; CHECK-NEXT: [[TMP24:%.*]] = extractelement <2 x double> [[REVERSE]], i32 1
+; CHECK-NEXT: [[TMP25:%.*]] = call double @llvm.sqrt.f64(double [[TMP24]])
+; CHECK-NEXT: store double [[TMP25]], ptr [[TMP23]], align 8
+; CHECK-NEXT: br label %[[PRED_STORE_CONTINUE4]]
+; CHECK: [[PRED_STORE_CONTINUE4]]:
+; CHECK-NEXT: [[TMP26:%.*]] = extractelement <2 x i1> [[TMP14]], i32 0
+; CHECK-NEXT: br i1 [[TMP26]], label %[[PRED_STORE_IF5:.*]], label %[[PRED_STORE_CONTINUE6:.*]]
+; CHECK: [[PRED_STORE_IF5]]:
+; CHECK-NEXT: [[TMP27:%.*]] = add i64 [[OFFSET_IDX]], -2
+; CHECK-NEXT: [[TMP28:%.*]] = add i64 [[TMP27]], -1
+; CHECK-NEXT: [[TMP29:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP28]]
+; CHECK-NEXT: [[TMP30:%.*]] = extractelement <2 x double> [[REVERSE2]], i32 0
+; CHECK-NEXT: [[TMP31:%.*]] = call double @llvm.sqrt.f64(double [[TMP30]])
+; CHECK-NEXT: store double [[TMP31]], ptr [[TMP29]], align 8
+; CHECK-NEXT: br label %[[PRED_STORE_CONTINUE6]]
+; CHECK: [[PRED_STORE_CONTINUE6]]:
+; CHECK-NEXT: [[TMP32:%.*]] = extractelement <2 x i1> [[TMP14]], i32 1
+; CHECK-NEXT: br i1 [[TMP32]], label %[[PRED_STORE_IF7:.*]], label %[[PRED_STORE_CONTINUE8]]
+; CHECK: [[PRED_STORE_IF7]]:
+; CHECK-NEXT: [[TMP33:%.*]] = add i64 [[OFFSET_IDX]], -3
+; CHECK-NEXT: [[TMP34:%.*]] = add i64 [[TMP33]], -1
+; CHECK-NEXT: [[TMP35:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP34]]
+; CHECK-NEXT: [[TMP36:%.*]] = extractelement <2 x double> [[REVERSE2]], i32 1
+; CHECK-NEXT: [[TMP37:%.*]] = call double @llvm.sqrt.f64(double [[TMP36]])
+; CHECK-NEXT: store double [[TMP37]], ptr [[TMP35]], align 8
+; CHECK-NEXT: br label %[[PRED_STORE_CONTINUE8]]
+; CHECK: [[PRED_STORE_CONTINUE8]]:
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP38:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT: br i1 [[TMP38]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br i1 true, label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 0, %[[MIDDLE_BLOCK]] ], [ 100, %[[ENTRY]] ]
; CHECK-NEXT: br label %[[LOOP_HEADER:.*]]
; CHECK: [[LOOP_HEADER]]:
-; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 100, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
-; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV]], -1
-; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr double, ptr [[SRC]], i64 [[IV_NEXT]]
-; CHECK-NEXT: [[L:%.*]] = load double, ptr [[GEP_SRC]], align 8
+; CHECK-NEXT: [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT1:%.*]], %[[LOOP_LATCH:.*]] ]
+; CHECK-NEXT: [[IV_NEXT1]] = add i64 [[IV1]], -1
+; CHECK-NEXT: [[GEP_SRC1:%.*]] = getelementptr double, ptr [[SRC]], i64 [[IV_NEXT1]]
+; CHECK-NEXT: [[L:%.*]] = load double, ptr [[GEP_SRC1]], align 8
; CHECK-NEXT: [[CMP295:%.*]] = fcmp une double [[L]], 4.000000e+00
; CHECK-NEXT: [[CMP299:%.*]] = fcmp ugt double [[L]], 0.000000e+00
; CHECK-NEXT: [[OR_COND:%.*]] = or i1 [[CMP295]], [[CMP299]]
; CHECK-NEXT: br i1 [[OR_COND]], label %[[LOOP_LATCH]], label %[[THEN:.*]]
; CHECK: [[THEN]]:
; CHECK-NEXT: [[SQRT:%.*]] = call double @llvm.sqrt.f64(double [[L]])
-; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV_NEXT]]
+; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV_NEXT1]]
; CHECK-NEXT: store double [[SQRT]], ptr [[GEP_DST]], align 8
; CHECK-NEXT: br label %[[LOOP_LATCH]]
; CHECK: [[LOOP_LATCH]]:
-; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 0
-; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i64 [[IV_NEXT1]], 0
+; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP7:![0-9]+]]
; CHECK: [[EXIT]]:
; CHECK-NEXT: ret void
;
@@ -235,4 +310,6 @@ declare i64 @llvm.fshl.i64(i64, i64, i64)
; CHECK: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
; CHECK: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]}
; CHECK: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]}
+; CHECK: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]], [[META2]]}
+; CHECK: [[LOOP7]] = distinct !{[[LOOP7]], [[META2]], [[META1]]}
;.
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index caa98d766a8c34..b2145dae0cc448 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -84,37 +84,33 @@ define void @loop_dependent_cond(ptr %src, ptr noalias %dst, i64 %N) {
; DEFAULT-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 0
; DEFAULT-NEXT: [[TMP3:%.*]] = getelementptr double, ptr [[SRC]], i64 [[TMP1]]
; DEFAULT-NEXT: [[TMP5:%.*]] = getelementptr double, ptr [[TMP3]], i32 0
-; DEFAULT-NEXT: [[TMP6:%.*]] = getelementptr double, ptr [[TMP3]], i32 2
-; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <2 x double>, ptr [[TMP5]], align 8
-; DEFAULT-NEXT: [[WIDE_LOAD1:%.*]] = load <2 x double>, ptr [[TMP6]], align 8
-; DEFAULT-NEXT: [[TMP7:%.*]] = call <2 x double> @llvm.fabs.v2f64(<2 x double> [[WIDE_LOAD]])
-; DEFAULT-NEXT: [[TMP8:%.*]] = call <2 x double> @llvm.fabs.v2f64(<2 x double> [[WIDE_LOAD1]])
-; DEFAULT-NEXT: [[TMP9:%.*]] = fcmp ogt <2 x double> [[TMP7]], splat (double 1.000000e+00)
-; DEFAULT-NEXT: [[TMP10:%.*]] = fcmp ogt <2 x double> [[TMP8]], splat (double 1.000000e+00)
-; DEFAULT-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP9]], i32 0
+; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <4 x double>, ptr [[TMP5]], align 8
+; DEFAULT-NEXT: [[TMP4:%.*]] = call <4 x double> @llvm.fabs.v4f64(<4 x double> [[WIDE_LOAD]])
+; DEFAULT-NEXT: [[TMP6:%.*]] = fcmp ogt <4 x double> [[TMP4]], splat (double 1.000000e+00)
+; DEFAULT-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP6]], i32 0
; DEFAULT-NEXT: br i1 [[TMP11]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
; DEFAULT: pred.store.if:
; DEFAULT-NEXT: store i32 0, ptr [[DST]], align 4
; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE]]
; DEFAULT: pred.store.continue:
-; DEFAULT-NEXT: [[TMP12:%.*]] = extractelement <2 x i1> [[TMP9]], i32 1
+; DEFAULT-NEXT: [[TMP12:%.*]] = extractelement <4 x i1> [[TMP6]], i32 1
; DEFAULT-NEXT: br i1 [[TMP12]], label [[PRED_STORE_IF2:%.*]], label [[PRED_STORE_CONTINUE3:%.*]]
-; DEFAULT: pred.store.if2:
+; DEFAULT: pred.store.if1:
; DEFAULT-NEXT: store i32 0, ptr [[DST]], align 4
; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE3]]
-; DEFAULT: pred.store.continue3:
-; DEFAULT-NEXT: [[TMP13:%.*]] = extractelement <2 x i1> [[TMP10]], i32 0
+; DEFAULT: pred.store.continue2:
+; DEFAULT-NEXT: [[TMP13:%.*]] = extractelement <4 x i1> [[TMP6]], i32 2
; DEFAULT-NEXT: br i1 [[TMP13]], label [[PRED_STORE_IF4:%.*]], label [[PRED_STORE_CONTINUE5:%.*]]
-; DEFAULT: pred.store.if4:
+; DEFAULT: pred.store.if3:
; DEFAULT-NEXT: store i32 0, ptr [[DST]], align 4
; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE5]]
-; DEFAULT: pred.store.continue5:
-; DEFAULT-NEXT: [[TMP14:%.*]] = extractelement <2 x i1> [[TMP10]], i32 1
+; DEFAULT: pred.store.continue4:
+; DEFAULT-NEXT: [[TMP14:%.*]] = extractelement <4 x i1> [[TMP6]], i32 3
; DEFAULT-NEXT: br i1 [[TMP14]], label [[PRED_STORE_IF6:%.*]], label [[PRED_STORE_CONTINUE7]]
-; DEFAULT: pred.store.if6:
+; DEFAULT: pred.store.if5:
; DEFAULT-NEXT: store i32 0, ptr [[DST]], align 4
; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE7]]
-; DEFAULT: pred.store.continue7:
+; DEFAULT: pred.store.continue6:
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; DEFAULT-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll
index 08a6001431903d..8a28be1af324a3 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll
@@ -67,7 +67,7 @@ define void @test_iv_cost(ptr %ptr.start, i8 %a, i64 %b) {
; CHECK-NEXT: [[C:%.*]] = icmp eq i64 [[START]], 0
; CHECK-NEXT: br i1 [[C]], label %[[EXIT:.*]], label %[[ITER_CHECK:.*]]
; CHECK: [[ITER_CHECK]]:
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[START]], 4
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[START]], 8
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
; CHECK: [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[START]], 32
@@ -94,11 +94,11 @@ define void @test_iv_cost(ptr %ptr.start, i8 %a, i64 %b) {
; CHECK-NEXT: [[IND_END:%.*]] = sub i64 [[START]], [[N_VEC]]
; CHECK-NEXT: [[IND_END2:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[N_VEC]]
; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[START]], [[N_VEC]]
-; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 4
+; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]]
; CHECK: [[VEC_EPILOG_PH]]:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[START]], 4
+; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[START]], 8
; CHECK-NEXT: [[N_VEC3:%.*]] = sub i64 [[START]], [[N_MOD_VF2]]
; CHECK-NEXT: [[IND_END1:%.*]] = sub i64 [[START]], [[N_VEC3]]
; CHECK-NEXT: [[IND_END5:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[N_VEC3]]
@@ -108,8 +108,8 @@ define void @test_iv_cost(ptr %ptr.start, i8 %a, i64 %b) {
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
-; CHECK-NEXT: store <4 x i8> zeroinitializer, ptr [[TMP2]], align 1
-; CHECK-NEXT: [[INDEX_NEXT10]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT: store <8 x i8> zeroinitializer, ptr [[TMP2]], align 1
+; CHECK-NEXT: [[INDEX_NEXT10]] = add nuw i64 [[INDEX]], 8
; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT10]], [[N_VEC3]]
; CHECK-NEXT: br i1 [[TMP7]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: [[VEC_EPILOG_MIDDLE_BLOCK]]:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
index 56a468ed1310b5..0361bf180bdcd0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll
@@ -8,69 +8,113 @@ target triple = "arm64-apple-macosx14.0.0"
define void @iv_casts(ptr %dst, ptr %src, i32 %x, i64 %N) #0 {
; DEFAULT-LABEL: define void @iv_casts(
; DEFAULT-SAME: ptr [[DST:%.*]], ptr [[SRC:%.*]], i32 [[X:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
-; DEFAULT-NEXT: entry:
+; DEFAULT-NEXT: iter.check:
; DEFAULT-NEXT: [[SRC2:%.*]] = ptrtoint ptr [[SRC]] to i64
; DEFAULT-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
; DEFAULT-NEXT: [[TMP0:%.*]] = add i64 [[N]], 1
; DEFAULT-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 16
+; DEFAULT-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 8
; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]
; DEFAULT: vector.memcheck:
; DEFAULT-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 8
+; DEFAULT-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 16
; DEFAULT-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 2
; DEFAULT-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[SRC2]]
; DEFAULT-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
; DEFAULT-NEXT: br i1 [[DIFF_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VECTOR_PH:%.*]]
+; DEFAULT: vector.main.loop.iter.check:
+; DEFAULT-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 32
+; DEFAULT-NEXT: [[MIN_ITERS_CHECK3:%.*]] = icmp ult i64 [[TMP0]], [[TMP8]]
+; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK3]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH1:%.*]]
; DEFAULT: vector.ph:
; DEFAULT-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 16
+; DEFAULT-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 32
; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP10]]
; DEFAULT-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
; DEFAULT-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 16
-; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i32> poison, i32 [[X]], i64 0
-; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
-; DEFAULT-NEXT: [[TMP13:%.*]] = trunc <vscale x 8 x i32> [[BROADCAST_SPLAT]] to <vscale x 8 x i16>
+; DEFAULT-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 32
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[X]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
+; DEFAULT-NEXT: [[TMP15:%.*]] = trunc <vscale x 16 x i32> [[BROADCAST_SPLAT]] to <vscale x 16 x i16>
; DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]
; DEFAULT: vector.body:
-; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH1]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; DEFAULT-NEXT: [[TMP14:%.*]] = add i64 [[INDEX]], 0
; DEFAULT-NEXT: [[TMP20:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP14]]
; DEFAULT-NEXT: [[TMP22:%.*]] = getelementptr i8, ptr [[TMP20]], i32 0
; DEFAULT-NEXT: [[TMP23:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP24:%.*]] = mul i64 [[TMP23]], 8
+; DEFAULT-NEXT: [[TMP24:%.*]] = mul i64 [[TMP23]], 16
; DEFAULT-NEXT: [[TMP25:%.*]] = getelementptr i8, ptr [[TMP20]], i64 [[TMP24]]
-; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i8>, ptr [[TMP22]], align 1
-; DEFAULT-NEXT: [[WIDE_LOAD4:%.*]] = load <vscale x 8 x i8>, ptr [[TMP25]], align 1
-; DEFAULT-NEXT: [[TMP26:%.*]] = zext <vscale x 8 x i8> [[WIDE_LOAD]] to <vscale x 8 x i16>
+; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP22]], align 1
+; DEFAULT-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 16 x i8>, ptr [[TMP25]], align 1
+; DEFAULT-NEXT: [[TMP44:%.*]] = zext <vscale x 16 x i8> [[WIDE_LOAD]] to <vscale x 16 x i16>
+; DEFAULT-NEXT: [[TMP21:%.*]] = zext <vscale x 16 x i8> [[WIDE_LOAD5]] to <vscale x 16 x i16>
+; DEFAULT-NEXT: [[TMP48:%.*]] = mul <vscale x 16 x i16> [[TMP44]], [[TMP15]]
+; DEFAULT-NEXT: [[TMP49:%.*]] = mul <vscale x 16 x i16> [[TMP21]], [[TMP15]]
+; DEFAULT-NEXT: [[TMP50:%.*]] = zext <vscale x 16 x i8> [[WIDE_LOAD]] to <vscale x 16 x i16>
+; DEFAULT-NEXT: [[TMP51:%.*]] = zext <vscale x 16 x i8> [[WIDE_LOAD5]] to <vscale x 16 x i16>
+; DEFAULT-NEXT: [[TMP26:%.*]] = or <vscale x 16 x i16> [[TMP48]], [[TMP50]]
+; DEFAULT-NEXT: [[TMP52:%.*]] = or <vscale x 16 x i16> [[TMP49]], [[TMP51]]
+; DEFAULT-NEXT: [[TMP28:%.*]] = lshr <vscale x 16 x i16> [[TMP26]], trunc (<vscale x 16 x i32> splat (i32 1) to <vscale x 16 x i16>)
+; DEFAULT-NEXT: [[TMP53:%.*]] = lshr <vscale x 16 x i16> [[TMP52]], trunc (<vscale x 16 x i32> splat (i32 1) to <vscale x 16 x i16>)
+; DEFAULT-NEXT: [[TMP30:%.*]] = trunc <vscale x 16 x i16> [[TMP28]] to <vscale x 16 x i8>
+; DEFAULT-NEXT: [[TMP54:%.*]] = trunc <vscale x 16 x i16> [[TMP53]] to <vscale x 16 x i8>
+; DEFAULT-NEXT: [[TMP32:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP14]]
+; DEFAULT-NEXT: [[TMP55:%.*]] = getelementptr i8, ptr [[TMP32]], i32 0
+; DEFAULT-NEXT: [[TMP34:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP57:%.*]] = mul i64 [[TMP34]], 16
+; DEFAULT-NEXT: [[TMP36:%.*]] = getelementptr i8, ptr [[TMP32]], i64 [[TMP57]]
+; DEFAULT-NEXT: store <vscale x 16 x i8> [[TMP30]], ptr [[TMP55]], align 1
+; DEFAULT-NEXT: store <vscale x 16 x i8> [[TMP54]], ptr [[TMP36]], align 1
+; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; DEFAULT-NEXT: [[TMP58:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; DEFAULT-NEXT: br i1 [[TMP58]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; DEFAULT: middle.block:
+; DEFAULT-NEXT: [[CMP_N1:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
+; DEFAULT-NEXT: br i1 [[CMP_N1]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
+; DEFAULT: vec.epilog.iter.check:
+; DEFAULT-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]
+; DEFAULT-NEXT: [[TMP59:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP39:%.*]] = mul i64 [[TMP59]], 8
+; DEFAULT-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP39]]
+; DEFAULT-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; DEFAULT: vec.epilog.ph:
+; DEFAULT-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_PH]] ]
+; DEFAULT-NEXT: [[TMP60:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP41:%.*]] = mul i64 [[TMP60]], 8
+; DEFAULT-NEXT: [[N_MOD_VF5:%.*]] = urem i64 [[TMP0]], [[TMP41]]
+; DEFAULT-NEXT: [[N_VEC6:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF5]]
+; DEFAULT-NEXT: [[TMP42:%.*]] = call i64 @llvm.vscale.i64()
+; DEFAULT-NEXT: [[TMP43:%.*]] = mul i64 [[TMP42]], 8
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <vscale x 8 x i32> poison, i32 [[X]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <vscale x 8 x i32> [[BROADCAST_SPLATINSERT7]], <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
+; DEFAULT-NEXT: [[TMP13:%.*]] = trunc <vscale x 8 x i32> [[BROADCAST_SPLAT8]] to <vscale x 8 x i16>
+; DEFAULT-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
+; DEFAULT: vec.epilog.vector.body:
+; DEFAULT-NEXT: [[INDEX9:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[TMP45:%.*]] = add i64 [[INDEX9]], 0
+; DEFAULT-NEXT: [[TMP46:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP45]]
+; DEFAULT-NEXT: [[TMP47:%.*]] = getelementptr i8, ptr [[TMP46]], i32 0
+; DEFAULT-NEXT: [[WIDE_LOAD4:%.*]] = load <vscale x 8 x i8>, ptr [[TMP47]], align 1
; DEFAULT-NEXT: [[TMP27:%.*]] = zext <vscale x 8 x i8> [[WIDE_LOAD4]] to <vscale x 8 x i16>
-; DEFAULT-NEXT: [[TMP28:%.*]] = mul <vscale x 8 x i16> [[TMP26]], [[TMP13]]
; DEFAULT-NEXT: [[TMP29:%.*]] = mul <vscale x 8 x i16> [[TMP27]], [[TMP13]]
-; DEFAULT-NEXT: [[TMP30:%.*]] = zext <vscale x 8 x i8> [[WIDE_LOAD]] to <vscale x 8 x i16>
; DEFAULT-NEXT: [[TMP31:%.*]] = zext <vscale x 8 x i8> [[WIDE_LOAD4]] to <vscale x 8 x i16>
-; DEFAULT-NEXT: [[TMP32:%.*]] = or <vscale x 8 x i16> [[TMP28]], [[TMP30]]
; DEFAULT-NEXT: [[TMP33:%.*]] = or <vscale x 8 x i16> [[TMP29]], [[TMP31]]
-; DEFAULT-NEXT: [[TMP34:%.*]] = lshr <vscale x 8 x i16> [[TMP32]], trunc (<vscale x 8 x i32> splat (i32 1) to <vscale x 8 x i16>)
; DEFAULT-NEXT: [[TMP35:%.*]] = lshr <vscale x 8 x i16> [[TMP33]], trunc (<vscale x 8 x i32> splat (i32 1) to <vscale x 8 x i16>)
-; DEFAULT-NEXT: [[TMP36:%.*]] = trunc <vscale x 8 x i16> [[TMP34]] to <vscale x 8 x i8>
; DEFAULT-NEXT: [[TMP37:%.*]] = trunc <vscale x 8 x i16> [[TMP35]] to <vscale x 8 x i8>
-; DEFAULT-NEXT: [[TMP38:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP14]]
+; DEFAULT-NEXT: [[TMP38:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP45]]
; DEFAULT-NEXT: [[TMP40:%.*]] = getelementptr i8, ptr [[TMP38]], i32 0
-; DEFAULT-NEXT: [[TMP41:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP42:%.*]] = mul i64 [[TMP41]], 8
-; DEFAULT-NEXT: [[TMP43:%.*]] = getelementptr i8, ptr [[TMP38]], i64 [[TMP42]]
-; DEFAULT-NEXT: store <vscale x 8 x i8> [[TMP36]], ptr [[TMP40]], align 1
-; DEFAULT-NEXT: store <vscale x 8 x i8> [[TMP37]], ptr [[TMP43]], align 1
-; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
-; DEFAULT-NEXT: [[TMP44:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP44]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
-; DEFAULT: middle.block:
-; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
-; DEFAULT: scalar.ph:
-; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ENTRY:%.*]] ]
+; DEFAULT-NEXT: store <vscale x 8 x i8> [[TMP37]], ptr [[TMP40]], align 1
+; DEFAULT-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX9]], [[TMP43]]
+; DEFAULT-NEXT: [[TMP56:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC6]]
+; DEFAULT-NEXT: br i1 [[TMP56]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; DEFAULT: vec.epilog.middle.block:
+; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC6]]
+; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT]], label [[VEC_EPILOG_SCALAR_PH]]
+; DEFAULT: vec.epilog.scalar.ph:
+; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC6]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ITER_CHECK:%.*]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ]
; DEFAULT-NEXT: br label [[LOOP:%.*]]
; DEFAULT: loop:
; DEFAULT-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
@@ -86,7 +130,7 @@ define void @iv_casts(ptr %dst, ptr %src, i32 %x, i64 %N) #0 {
; DEFAULT-NEXT: [[GEP_DST:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
; DEFAULT-NEXT: store i8 [[CONV36_US]], ptr [[GEP_DST]], align 1
; DEFAULT-NEXT: [[EC:%.*]] = icmp eq i64 [[IV]], [[N]]
-; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP4:![0-9]+]]
; DEFAULT: exit:
; DEFAULT-NEXT: ret void
;
@@ -99,49 +143,49 @@ define void @iv_casts(ptr %dst, ptr %src, i32 %x, i64 %N) #0 {
; PRED-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]
; PRED: vector.memcheck:
; PRED-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 8
+; PRED-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 16
; PRED-NEXT: [[TMP3:%.*]] = sub i64 [[DST1]], [[SRC2]]
; PRED-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP3]], [[TMP2]]
; PRED-NEXT: br i1 [[DIFF_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; PRED: vector.ph:
; PRED-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 8
+; PRED-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16
; PRED-NEXT: [[TMP8:%.*]] = sub i64 [[TMP5]], 1
; PRED-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP0]], [[TMP8]]
; PRED-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP5]]
; PRED-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; PRED-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 8
+; PRED-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 16
; PRED-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8
+; PRED-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 16
; PRED-NEXT: [[TMP13:%.*]] = sub i64 [[TMP0]], [[TMP12]]
; PRED-NEXT: [[TMP14:%.*]] = icmp ugt i64 [[TMP0]], [[TMP12]]
; PRED-NEXT: [[TMP15:%.*]] = select i1 [[TMP14]], i64 [[TMP13]], i64 0
-; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i32> poison, i32 [[X]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
-; PRED-NEXT: [[TMP16:%.*]] = trunc <vscale x 8 x i32> [[BROADCAST_SPLAT]] to <vscale x 8 x i16>
+; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[TMP0]])
+; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[X]], i64 0
+; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
+; PRED-NEXT: [[TMP16:%.*]] = trunc <vscale x 16 x i32> [[BROADCAST_SPLAT]] to <vscale x 16 x i16>
; PRED-NEXT: br label [[VECTOR_BODY:%.*]]
; PRED: vector.body:
; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; PRED-NEXT: [[TMP17:%.*]] = add i64 [[INDEX]], 0
; PRED-NEXT: [[TMP18:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[TMP17]]
; PRED-NEXT: [[TMP19:%.*]] = getelementptr i8, ptr [[TMP18]], i32 0
-; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP19]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
-; PRED-NEXT: [[TMP20:%.*]] = zext <vscale x 8 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 8 x i16>
-; PRED-NEXT: [[TMP21:%.*]] = mul <vscale x 8 x i16> [[TMP20]], [[TMP16]]
-; PRED-NEXT: [[TMP22:%.*]] = zext <vscale x 8 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 8 x i16>
-; PRED-NEXT: [[TMP23:%.*]] = or <vscale x 8 x i16> [[TMP21]], [[TMP22]]
-; PRED-NEXT: [[TMP24:%.*]] = lshr <vscale x 8 x i16> [[TMP23]], trunc (<vscale x 8 x i32> splat (i32 1) to <vscale x 8 x i16>)
-; PRED-NEXT: [[TMP25:%.*]] = trunc <vscale x 8 x i16> [[TMP24]] to <vscale x 8 x i8>
+; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP19]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
+; PRED-NEXT: [[TMP24:%.*]] = zext <vscale x 16 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 16 x i16>
+; PRED-NEXT: [[TMP25:%.*]] = mul <vscale x 16 x i16> [[TMP24]], [[TMP16]]
+; PRED-NEXT: [[TMP20:%.*]] = zext <vscale x 16 x i8> [[WIDE_MASKED_LOAD]] to <vscale x 16 x i16>
+; PRED-NEXT: [[TMP21:%.*]] = or <vscale x 16 x i16> [[TMP25]], [[TMP20]]
+; PRED-NEXT: [[TMP22:%.*]] = lshr <vscale x 16 x i16> [[TMP21]], trunc (<vscale x 16 x i32> splat (i32 1) to <vscale x 16 x i16>)
+; PRED-NEXT: [[TMP23:%.*]] = trunc <vscale x 16 x i16> [[TMP22]] to <vscale x 16 x i8>
; PRED-NEXT: [[TMP26:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP17]]
; PRED-NEXT: [[TMP27:%.*]] = getelementptr i8, ptr [[TMP26]], i32 0
-; PRED-NEXT: call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP25]], ptr [[TMP27]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
+; PRED-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP23]], ptr [[TMP27]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP10]]
-; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX]], i64 [[TMP15]])
-; PRED-NEXT: [[TMP28:%.*]] = xor <vscale x 8 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
-; PRED-NEXT: [[TMP29:%.*]] = extractelement <vscale x 8 x i1> [[TMP28]], i32 0
+; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP15]])
+; PRED-NEXT: [[TMP28:%.*]] = xor <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
+; PRED-NEXT: [[TMP29:%.*]] = extractelement <vscale x 16 x i1> [[TMP28]], i32 0
; PRED-NEXT: br i1 [[TMP29]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; PRED: middle.block:
; PRED-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -233,7 +277,7 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
; DEFAULT-NEXT: store i32 1, ptr [[TMP21]], align 4
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
; DEFAULT-NEXT: [[TMP22:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP22]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP22]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
; DEFAULT: middle.block:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -249,7 +293,7 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
; DEFAULT-NEXT: store i32 1, ptr [[GEP]], align 4
; DEFAULT-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
; DEFAULT-NEXT: [[EC:%.*]] = icmp eq i64 [[IV]], [[N]]
-; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; DEFAULT: exit:
; DEFAULT-NEXT: ret void
;
@@ -277,44 +321,60 @@ define void @iv_trunc(i32 %x, ptr %dst, i64 %N) #0 {
; PRED-NEXT: [[TMP12:%.*]] = or i1 [[TMP8]], [[TMP11]]
; PRED-NEXT: br i1 [[TMP12]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; PRED: vector.ph:
-; PRED-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP0]], 1
-; PRED-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 2
+; PRED-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP0]], 3
+; PRED-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4
; PRED-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
-; PRED-NEXT: [[TMP13:%.*]] = sub i64 [[TMP0]], 2
-; PRED-NEXT: [[TMP14:%.*]] = icmp ugt i64 [[TMP0]], 2
+; PRED-NEXT: [[TMP13:%.*]] = sub i64 [[TMP0]], 4
+; PRED-NEXT: [[TMP14:%.*]] = icmp ugt i64 [[TMP0]], 4
; PRED-NEXT: [[TMP15:%.*]] = select i1 [[TMP14]], i64 [[TMP13]], i64 0
-; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> poison, i32 [[MUL_X]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> poison, <2 x i32> zeroinitializer
+; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 0, i64 [[TMP0]])
+; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[MUL_X]], i64 0
+; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
; PRED-NEXT: br label [[VECTOR_BODY:%.*]]
; PRED: vector.body:
; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE2:%.*]] ]
-; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[PRED_STORE_CONTINUE2]] ]
-; PRED-NEXT: [[VEC_IND:%.*]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE2]] ]
-; PRED-NEXT: [[TMP16:%.*]] = mul <2 x i32> [[BROADCAST_SPLAT]], [[VEC_IND]]
-; PRED-NEXT: [[TMP17:%.*]] = zext <2 x i32> [[TMP16]] to <2 x i64>
-; PRED-NEXT: [[TMP18:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i32 0
+; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[PRED_STORE_CONTINUE2]] ]
+; PRED-NEXT: [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE2]] ]
+; PRED-NEXT: [[TMP16:%.*]] = mul <4 x i32> [[BROADCAST_SPLAT]], [[VEC_IND]]
+; PRED-NEXT: [[TMP17:%.*]] = zext <4 x i32> [[TMP16]] to <4 x i64>
+; PRED-NEXT: [[TMP18:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 0
; PRED-NEXT: br i1 [[TMP18]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
; PRED: pred.store.if:
-; PRED-NEXT: [[TMP19:%.*]] = extractelement <2 x i64> [[TMP17]], i32 0
+; PRED-NEXT: [[TMP19:%.*]] = extractelement <4 x i64> [[TMP17]], i32 0
; PRED-NEXT: [[TMP20:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP19]]
; PRED-NEXT: store i32 1, ptr [[TMP20]], align 4
; PRED-NEXT: br label [[PRED_STORE_CONTINUE]]
; PRED: pred.store.continue:
-; PRED-NEXT: [[TMP21:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i32 1
-; PRED-NEXT: br i1 [[TMP21]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2]]
+; PRED-NEXT: [[TMP21:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 1
+; PRED-NEXT: br i1 [[TMP21]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE3:%.*]]
; PRED: pred.store.if1:
-; PRED-NEXT: [[TMP22:%.*]] = extractelement <2 x i64> [[TMP17]], i32 1
+; PRED-NEXT: [[TMP22:%.*]] = extractelement <4 x i64> [[TMP17]], i32 1
; PRED-NEXT: [[TMP23:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP22]]
; PRED-NEXT: store i32 1, ptr [[TMP23]], align 4
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE2]]
+; PRED-NEXT: br label [[PRED_STORE_CONTINUE3]]
; PRED: pred.store.continue2:
-; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
-; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 [[INDEX]], i64 [[TMP15]])
-; PRED-NEXT: [[TMP24:%.*]] = xor <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
-; PRED-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], splat (i32 2)
-; PRED-NEXT: [[TMP25:%.*]] = extractelement <2 x i1> [[TMP24]], i32 0
-; PRED-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; PRED-NEXT: [[TMP24:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 2
+; PRED-NEXT: br i1 [[TMP24]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
+; PRED: pred.store.if3:
+; PRED-NEXT: [[TMP25:%.*]] = extractelement <4 x i64> [[TMP17]], i32 2
+; PRED-NEXT: [[TMP26:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP25]]
+; PRED-NEXT: store i32 1, ptr [[TMP26]], align 4
+; PRED-NEXT: br label [[PRED_STORE_CONTINUE4]]
+; PRED: pred.store.continue4:
+; PRED-NEXT: [[TMP27:%.*]] = extractelement <4 x i1> [[ACTIVE_LANE_MASK]], i32 3
+; PRED-NEXT: br i1 [[TMP27]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE2]]
+; PRED: pred.store.if5:
+; PRED-NEXT: [[TMP28:%.*]] = extractelement <4 x i64> [[TMP17]], i32 3
+; PRED-NEXT: [[TMP29:%.*]] = getelementptr i32, ptr [[DST]], i64 [[TMP28]]
+; PRED-NEXT: store i32 1, ptr [[TMP29]], align 4
+; PRED-NEXT: br label [[PRED_STORE_CONTINUE2]]
+; PRED: pred.store.continue6:
+; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
+; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[INDEX]], i64 [[TMP15]])
+; PRED-NEXT: [[TMP30:%.*]] = xor <4 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
+; PRED-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], splat (i32 4)
+; PRED-NEXT: [[TMP31:%.*]] = extractelement <4 x i1> [[TMP30]], i32 0
+; PRED-NEXT: br i1 [[TMP31]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; PRED: middle.block:
; PRED-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; PRED: scalar.ph:
@@ -402,7 +462,7 @@ define void @trunc_ivs_and_store(i32 %x, ptr %dst, i64 %N) #0 {
; DEFAULT-NEXT: store i32 [[TMP15]], ptr [[TMP24]], align 4
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
; DEFAULT-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
; DEFAULT: middle.block:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -421,7 +481,7 @@ define void @trunc_ivs_and_store(i32 %x, ptr %dst, i64 %N) #0 {
; DEFAULT-NEXT: store i32 [[IV_2]], ptr [[GEP]], align 4
; DEFAULT-NEXT: [[IV_1_NEXT]] = add i64 [[IV_1]], 1
; DEFAULT-NEXT: [[EXITCOND_3_NOT:%.*]] = icmp eq i64 [[IV_1]], [[N]]
-; DEFAULT-NEXT: br i1 [[EXITCOND_3_NOT]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[EXITCOND_3_NOT]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP8:![0-9]+]]
; DEFAULT: exit:
; DEFAULT-NEXT: ret void
;
@@ -600,7 +660,7 @@ define void @ivs_trunc_and_ext(i32 %x, ptr %dst, i64 %N) #0 {
; DEFAULT-NEXT: store i32 [[TMP14]], ptr [[TMP23]], align 4
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
; DEFAULT-NEXT: [[TMP24:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
; DEFAULT: middle.block:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -619,7 +679,7 @@ define void @ivs_trunc_and_ext(i32 %x, ptr %dst, i64 %N) #0 {
; DEFAULT-NEXT: store i32 [[IV_2]], ptr [[GEP]], align 4
; DEFAULT-NEXT: [[IV_1_NEXT]] = add i64 [[IV_1]], 1
; DEFAULT-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_1]], [[N]]
-; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP9:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP10:![0-9]+]]
; DEFAULT: exit:
; DEFAULT-NEXT: ret void
;
@@ -781,7 +841,7 @@ define void @exit_cond_zext_iv(ptr %dst, i64 %N) {
; DEFAULT-NEXT: store i32 0, ptr [[TMP10]], align 8
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
; DEFAULT-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; DEFAULT-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
; DEFAULT: middle.block:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[UMAX1]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -797,7 +857,7 @@ define void @exit_cond_zext_iv(ptr %dst, i64 %N) {
; DEFAULT-NEXT: [[IV_1_NEXT]] = add i32 [[IV_1]], 1
; DEFAULT-NEXT: [[IV_EXT]] = zext i32 [[IV_1_NEXT]] to i64
; DEFAULT-NEXT: [[C:%.*]] = icmp ult i64 [[IV_EXT]], [[N]]
-; DEFAULT-NEXT: br i1 [[C]], label [[LOOP]], label [[EXIT]], !llvm.loop [[LOOP11:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[C]], label [[LOOP]], label [[EXIT]], !llvm.loop [[LOOP12:![0-9]+]]
; DEFAULT: exit:
; DEFAULT-NEXT: ret void
;
@@ -816,21 +876,21 @@ define void @exit_cond_zext_iv(ptr %dst, i64 %N) {
; PRED-NEXT: [[TMP6:%.*]] = or i1 [[TMP4]], [[TMP5]]
; PRED-NEXT: br i1 [[TMP6]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; PRED: vector.ph:
-; PRED-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX1]], 1
-; PRED-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 2
+; PRED-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX1]], 3
+; PRED-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4
; PRED-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; PRED-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX1]], 1
; PRED-NEXT: [[IND_END:%.*]] = trunc i64 [[N_VEC]] to i32
-; PRED-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <2 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT2]], <2 x i64> poison, <2 x i32> zeroinitializer
+; PRED-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
+; PRED-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT2]], <4 x i64> poison, <4 x i32> zeroinitializer
; PRED-NEXT: br label [[VECTOR_BODY:%.*]]
; PRED: vector.body:
; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE5:%.*]] ]
-; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[INDEX]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
-; PRED-NEXT: [[VEC_IV:%.*]] = add <2 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1>
-; PRED-NEXT: [[TMP7:%.*]] = icmp ule <2 x i64> [[VEC_IV]], [[BROADCAST_SPLAT3]]
-; PRED-NEXT: [[TMP8:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
+; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[INDEX]], i64 0
+; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; PRED-NEXT: [[VEC_IV:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
+; PRED-NEXT: [[TMP7:%.*]] = icmp ule <4 x i64> [[VEC_IV]], [[BROADCAST_SPLAT3]]
+; PRED-NEXT: [[TMP8:%.*]] = extractelement <4 x i1> [[TMP7]], i32 0
; PRED-NEXT: br i1 [[TMP8]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
; PRED: pred.store.if:
; PRED-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
@@ -838,15 +898,31 @@ define void @exit_cond_zext_iv(ptr %dst, i64 %N) {
; PRED-NEXT: store i32 0, ptr [[TMP10]], align 8
; PRED-NEXT: br label [[PRED_STORE_CONTINUE]]
; PRED: pred.store.continue:
-; PRED-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; PRED-NEXT: br i1 [[TMP11]], label [[PRED_STORE_IF4:%.*]], label [[PRED_STORE_CONTINUE5]]
+; PRED-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP7]], i32 1
+; PRED-NEXT: br i1 [[TMP11]], label [[PRED_STORE_IF4:%.*]], label [[PRED_STORE_CONTINUE6:%.*]]
; PRED: pred.store.if4:
; PRED-NEXT: [[TMP12:%.*]] = add i64 [[INDEX]], 1
; PRED-NEXT: [[TMP13:%.*]] = getelementptr { [100 x i32], i32, i32 }, ptr [[DST]], i64 [[TMP12]], i32 2
; PRED-NEXT: store i32 0, ptr [[TMP13]], align 8
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE5]]
+; PRED-NEXT: br label [[PRED_STORE_CONTINUE6]]
; PRED: pred.store.continue5:
-; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
+; PRED-NEXT: [[TMP20:%.*]] = extractelement <4 x i1> [[TMP7]], i32 2
+; PRED-NEXT: br i1 [[TMP20]], label [[PRED_STORE_IF6:%.*]], label [[PRED_STORE_CONTINUE7:%.*]]
+; PRED: pred.store.if6:
+; PRED-NEXT: [[TMP15:%.*]] = add i64 [[INDEX]], 2
+; PRED-NEXT: [[TMP16:%.*]] = getelementptr { [100 x i32], i32, i32 }, ptr [[DST]], i64 [[TMP15]], i32 2
+; PRED-NEXT: store i32 0, ptr [[TMP16]], align 8
+; PRED-NEXT: br label [[PRED_STORE_CONTINUE7]]
+; PRED: pred.store.continue7:
+; PRED-NEXT: [[TMP17:%.*]] = extractelement <4 x i1> [[TMP7]], i32 3
+; PRED-NEXT: br i1 [[TMP17]], label [[PRED_STORE_IF8:%.*]], label [[PRED_STORE_CONTINUE5]]
+; PRED: pred.store.if8:
+; PRED-NEXT: [[TMP18:%.*]] = add i64 [[INDEX]], 3
+; PRED-NEXT: [[TMP19:%.*]] = getelementptr { [100 x i32], i32, i32 }, ptr [[DST]], i64 [[TMP18]], i32 2
+; PRED-NEXT: store i32 0, ptr [[TMP19]], align 8
+; PRED-NEXT: br label [[PRED_STORE_CONTINUE5]]
+; PRED: pred.store.continue9:
+; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
; PRED-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; PRED-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; PRED: middle.block:
@@ -890,15 +966,16 @@ attributes #0 = { "target-features"="+sve" }
; DEFAULT: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
; DEFAULT: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
; DEFAULT: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
-; DEFAULT: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
-; DEFAULT: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]}
-; DEFAULT: [[LOOP5]] = distinct !{[[LOOP5]], [[META1]]}
-; DEFAULT: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]], [[META2]]}
-; DEFAULT: [[LOOP7]] = distinct !{[[LOOP7]], [[META1]]}
-; DEFAULT: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]], [[META2]]}
-; DEFAULT: [[LOOP9]] = distinct !{[[LOOP9]], [[META1]]}
-; DEFAULT: [[LOOP10]] = distinct !{[[LOOP10]], [[META1]], [[META2]]}
-; DEFAULT: [[LOOP11]] = distinct !{[[LOOP11]], [[META1]]}
+; DEFAULT: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]], [[META2]]}
+; DEFAULT: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]]}
+; DEFAULT: [[LOOP5]] = distinct !{[[LOOP5]], [[META1]], [[META2]]}
+; DEFAULT: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]]}
+; DEFAULT: [[LOOP7]] = distinct !{[[LOOP7]], [[META1]], [[META2]]}
+; DEFAULT: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]]}
+; DEFAULT: [[LOOP9]] = distinct !{[[LOOP9]], [[META1]], [[META2]]}
+; DEFAULT: [[LOOP10]] = distinct !{[[LOOP10]], [[META1]]}
+; DEFAULT: [[LOOP11]] = distinct !{[[LOOP11]], [[META1]], [[META2]]}
+; DEFAULT: [[LOOP12]] = distinct !{[[LOOP12]], [[META1]]}
;.
; PRED: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
; PRED: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
index f9cc195e367021..b4979057721bf6 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
@@ -96,29 +96,29 @@ define i64 @pointer_induction_only(ptr %start, ptr %end) {
; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[END1]], [[START2]]
; CHECK-NEXT: [[TMP1:%.*]] = lshr i64 [[TMP0]], 2
; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 4
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 8
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 4
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 8
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[N_VEC]], 4
; CHECK-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP3]]
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi <2 x i64> [ <i64 poison, i64 0>, [[VECTOR_PH]] ], [ [[TMP9:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi <4 x i64> [ <i64 poison, i64 poison, i64 poison, i64 0>, [[VECTOR_PH]] ], [ [[TMP7:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 4
; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[OFFSET_IDX]], 0
; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
-; CHECK-NEXT: [[TMP7:%.*]] = getelementptr i32, ptr [[NEXT_GEP]], i32 2
-; CHECK-NEXT: [[WIDE_LOAD4:%.*]] = load <2 x i32>, ptr [[TMP7]], align 1
-; CHECK-NEXT: [[TMP9]] = zext <2 x i32> [[WIDE_LOAD4]] to <2 x i64>
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP6:%.*]] = getelementptr i32, ptr [[NEXT_GEP]], i32 4
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP6]], align 1
+; CHECK-NEXT: [[TMP7]] = zext <4 x i32> [[WIDE_LOAD]] to <4 x i64>
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP7]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP7]], i32 3
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll b/llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll
index 3d4f7e0e4924bc..6dab97e54087f8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll
@@ -345,41 +345,41 @@ define i16 @reduce_udiv(ptr %src, i16 %x, i64 %N) #0 {
; DEFAULT-NEXT: entry:
; DEFAULT-NEXT: [[TMP0:%.*]] = add i64 [[N]], 1
; DEFAULT-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 8
+; DEFAULT-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 16
; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; DEFAULT: vector.ph:
; DEFAULT-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 8
+; DEFAULT-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 16
; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP4]]
; DEFAULT-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
; DEFAULT-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8
-; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i16> poison, i16 [[X]], i64 0
-; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i16> poison, <vscale x 4 x i32> zeroinitializer
+; DEFAULT-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
+; DEFAULT-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i16> poison, i16 [[X]], i64 0
+; DEFAULT-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
; DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]
; DEFAULT: vector.body:
; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; DEFAULT-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 4 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP21:%.*]], [[VECTOR_BODY]] ]
-; DEFAULT-NEXT: [[VEC_PHI1:%.*]] = phi <vscale x 4 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP22:%.*]], [[VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 8 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.*]], [[VECTOR_BODY]] ]
+; DEFAULT-NEXT: [[VEC_PHI1:%.*]] = phi <vscale x 8 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP20:%.*]], [[VECTOR_BODY]] ]
; DEFAULT-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
; DEFAULT-NEXT: [[TMP13:%.*]] = getelementptr i16, ptr [[SRC]], i64 [[TMP7]]
; DEFAULT-NEXT: [[TMP15:%.*]] = getelementptr i16, ptr [[TMP13]], i32 0
; DEFAULT-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
-; DEFAULT-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
+; DEFAULT-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 8
; DEFAULT-NEXT: [[TMP18:%.*]] = getelementptr i16, ptr [[TMP13]], i64 [[TMP17]]
-; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i16>, ptr [[TMP15]], align 2
-; DEFAULT-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 4 x i16>, ptr [[TMP18]], align 2
-; DEFAULT-NEXT: [[TMP19:%.*]] = udiv <vscale x 4 x i16> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
-; DEFAULT-NEXT: [[TMP20:%.*]] = udiv <vscale x 4 x i16> [[WIDE_LOAD2]], [[BROADCAST_SPLAT]]
-; DEFAULT-NEXT: [[TMP21]] = or <vscale x 4 x i16> [[TMP19]], [[VEC_PHI]]
-; DEFAULT-NEXT: [[TMP22]] = or <vscale x 4 x i16> [[TMP20]], [[VEC_PHI1]]
+; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i16>, ptr [[TMP15]], align 2
+; DEFAULT-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 8 x i16>, ptr [[TMP18]], align 2
+; DEFAULT-NEXT: [[TMP21:%.*]] = udiv <vscale x 8 x i16> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; DEFAULT-NEXT: [[TMP14:%.*]] = udiv <vscale x 8 x i16> [[WIDE_LOAD2]], [[BROADCAST_SPLAT]]
+; DEFAULT-NEXT: [[TMP19]] = or <vscale x 8 x i16> [[TMP21]], [[VEC_PHI]]
+; DEFAULT-NEXT: [[TMP20]] = or <vscale x 8 x i16> [[TMP14]], [[VEC_PHI1]]
; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
; DEFAULT-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; DEFAULT: middle.block:
-; DEFAULT-NEXT: [[BIN_RDX:%.*]] = or <vscale x 4 x i16> [[TMP22]], [[TMP21]]
-; DEFAULT-NEXT: [[TMP24:%.*]] = call i16 @llvm.vector.reduce.or.nxv4i16(<vscale x 4 x i16> [[BIN_RDX]])
+; DEFAULT-NEXT: [[BIN_RDX:%.*]] = or <vscale x 8 x i16> [[TMP20]], [[TMP19]]
+; DEFAULT-NEXT: [[TMP24:%.*]] = call i16 @llvm.vector.reduce.or.nxv8i16(<vscale x 8 x i16> [[BIN_RDX]])
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; DEFAULT: scalar.ph:
@@ -402,49 +402,89 @@ define i16 @reduce_udiv(ptr %src, i16 %x, i64 %N) #0 {
;
; VSCALEFORTUNING2-LABEL: define i16 @reduce_udiv(
; VSCALEFORTUNING2-SAME: ptr [[SRC:%.*]], i16 [[X:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
-; VSCALEFORTUNING2-NEXT: entry:
+; VSCALEFORTUNING2-NEXT: iter.check:
; VSCALEFORTUNING2-NEXT: [[TMP0:%.*]] = add i64 [[N]], 1
; VSCALEFORTUNING2-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; VSCALEFORTUNING2-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 8
+; VSCALEFORTUNING2-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 4
; VSCALEFORTUNING2-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
; VSCALEFORTUNING2-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; VSCALEFORTUNING2: vector.main.loop.iter.check:
+; VSCALEFORTUNING2-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; VSCALEFORTUNING2-NEXT: [[TMP13:%.*]] = mul i64 [[TMP6]], 16
+; VSCALEFORTUNING2-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP0]], [[TMP13]]
+; VSCALEFORTUNING2-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH1:%.*]]
; VSCALEFORTUNING2: vector.ph:
; VSCALEFORTUNING2-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
-; VSCALEFORTUNING2-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 8
+; VSCALEFORTUNING2-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 16
; VSCALEFORTUNING2-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP4]]
; VSCALEFORTUNING2-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
; VSCALEFORTUNING2-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; VSCALEFORTUNING2-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8
-; VSCALEFORTUNING2-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i16> poison, i16 [[X]], i64 0
-; VSCALEFORTUNING2-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i16> poison, <vscale x 4 x i32> zeroinitializer
+; VSCALEFORTUNING2-NEXT: [[TMP31:%.*]] = mul i64 [[TMP5]], 16
+; VSCALEFORTUNING2-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i16> poison, i16 [[X]], i64 0
+; VSCALEFORTUNING2-NEXT: [[BROADCAST_SPLAT1:%.*]] = shufflevector <vscale x 8 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
; VSCALEFORTUNING2-NEXT: br label [[VECTOR_BODY:%.*]]
; VSCALEFORTUNING2: vector.body:
-; VSCALEFORTUNING2-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; VSCALEFORTUNING2-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 4 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP15:%.*]], [[VECTOR_BODY]] ]
-; VSCALEFORTUNING2-NEXT: [[VEC_PHI1:%.*]] = phi <vscale x 4 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.*]], [[VECTOR_BODY]] ]
+; VSCALEFORTUNING2-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH1]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; VSCALEFORTUNING2-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 8 x i16> [ zeroinitializer, [[VECTOR_PH1]] ], [ [[TMP17:%.*]], [[VECTOR_BODY]] ]
+; VSCALEFORTUNING2-NEXT: [[VEC_PHI2:%.*]] = phi <vscale x 8 x i16> [ zeroinitializer, [[VECTOR_PH1]] ], [ [[TMP18:%.*]], [[VECTOR_BODY]] ]
; VSCALEFORTUNING2-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
; VSCALEFORTUNING2-NEXT: [[TMP8:%.*]] = getelementptr i16, ptr [[SRC]], i64 [[TMP7]]
; VSCALEFORTUNING2-NEXT: [[TMP9:%.*]] = getelementptr i16, ptr [[TMP8]], i32 0
; VSCALEFORTUNING2-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
-; VSCALEFORTUNING2-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 4
+; VSCALEFORTUNING2-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 8
; VSCALEFORTUNING2-NEXT: [[TMP12:%.*]] = getelementptr i16, ptr [[TMP8]], i64 [[TMP11]]
-; VSCALEFORTUNING2-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i16>, ptr [[TMP9]], align 2
-; VSCALEFORTUNING2-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 4 x i16>, ptr [[TMP12]], align 2
-; VSCALEFORTUNING2-NEXT: [[TMP13:%.*]] = udiv <vscale x 4 x i16> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; VSCALEFORTUNING2-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 8 x i16>, ptr [[TMP9]], align 2
+; VSCALEFORTUNING2-NEXT: [[WIDE_LOAD3:%.*]] = load <vscale x 8 x i16>, ptr [[TMP12]], align 2
+; VSCALEFORTUNING2-NEXT: [[TMP15:%.*]] = udiv <vscale x 8 x i16> [[WIDE_LOAD]], [[BROADCAST_SPLAT1]]
+; VSCALEFORTUNING2-NEXT: [[TMP32:%.*]] = udiv <vscale x 8 x i16> [[WIDE_LOAD3]], [[BROADCAST_SPLAT1]]
+; VSCALEFORTUNING2-NEXT: [[TMP17]] = or <vscale x 8 x i16> [[TMP15]], [[VEC_PHI]]
+; VSCALEFORTUNING2-NEXT: [[TMP18]] = or <vscale x 8 x i16> [[TMP32]], [[VEC_PHI2]]
+; VSCALEFORTUNING2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP31]]
+; VSCALEFORTUNING2-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; VSCALEFORTUNING2-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; VSCALEFORTUNING2: middle.block:
+; VSCALEFORTUNING2-NEXT: [[BIN_RDX:%.*]] = or <vscale x 8 x i16> [[TMP18]], [[TMP17]]
+; VSCALEFORTUNING2-NEXT: [[TMP20:%.*]] = call i16 @llvm.vector.reduce.or.nxv8i16(<vscale x 8 x i16> [[BIN_RDX]])
+; VSCALEFORTUNING2-NEXT: [[CMP_N1:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
+; VSCALEFORTUNING2-NEXT: br i1 [[CMP_N1]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
+; VSCALEFORTUNING2: vec.epilog.iter.check:
+; VSCALEFORTUNING2-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP0]], [[N_VEC]]
+; VSCALEFORTUNING2-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
+; VSCALEFORTUNING2-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
+; VSCALEFORTUNING2-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], [[TMP22]]
+; VSCALEFORTUNING2-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; VSCALEFORTUNING2: vec.epilog.ph:
+; VSCALEFORTUNING2-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_PH]] ]
+; VSCALEFORTUNING2-NEXT: [[BC_MERGE_RDX1:%.*]] = phi i16 [ [[TMP20]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_PH]] ]
+; VSCALEFORTUNING2-NEXT: [[TMP23:%.*]] = call i64 @llvm.vscale.i64()
+; VSCALEFORTUNING2-NEXT: [[TMP24:%.*]] = mul i64 [[TMP23]], 4
+; VSCALEFORTUNING2-NEXT: [[N_MOD_VF4:%.*]] = urem i64 [[TMP0]], [[TMP24]]
+; VSCALEFORTUNING2-NEXT: [[N_VEC5:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF4]]
+; VSCALEFORTUNING2-NEXT: [[TMP25:%.*]] = call i64 @llvm.vscale.i64()
+; VSCALEFORTUNING2-NEXT: [[TMP26:%.*]] = mul i64 [[TMP25]], 4
+; VSCALEFORTUNING2-NEXT: [[TMP27:%.*]] = insertelement <vscale x 4 x i16> zeroinitializer, i16 [[BC_MERGE_RDX1]], i32 0
+; VSCALEFORTUNING2-NEXT: [[BROADCAST_SPLATINSERT9:%.*]] = insertelement <vscale x 4 x i16> poison, i16 [[X]], i64 0
+; VSCALEFORTUNING2-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT9]], <vscale x 4 x i16> poison, <vscale x 4 x i32> zeroinitializer
+; VSCALEFORTUNING2-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
+; VSCALEFORTUNING2: vec.epilog.vector.body:
+; VSCALEFORTUNING2-NEXT: [[INDEX6:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; VSCALEFORTUNING2-NEXT: [[VEC_PHI1:%.*]] = phi <vscale x 4 x i16> [ [[TMP27]], [[VEC_EPILOG_PH]] ], [ [[TMP16:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; VSCALEFORTUNING2-NEXT: [[TMP28:%.*]] = add i64 [[INDEX6]], 0
+; VSCALEFORTUNING2-NEXT: [[TMP29:%.*]] = getelementptr i16, ptr [[SRC]], i64 [[TMP28]]
+; VSCALEFORTUNING2-NEXT: [[TMP30:%.*]] = getelementptr i16, ptr [[TMP29]], i32 0
+; VSCALEFORTUNING2-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 4 x i16>, ptr [[TMP30]], align 2
; VSCALEFORTUNING2-NEXT: [[TMP14:%.*]] = udiv <vscale x 4 x i16> [[WIDE_LOAD2]], [[BROADCAST_SPLAT]]
-; VSCALEFORTUNING2-NEXT: [[TMP15]] = or <vscale x 4 x i16> [[TMP13]], [[VEC_PHI]]
; VSCALEFORTUNING2-NEXT: [[TMP16]] = or <vscale x 4 x i16> [[TMP14]], [[VEC_PHI1]]
-; VSCALEFORTUNING2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
-; VSCALEFORTUNING2-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; VSCALEFORTUNING2-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; VSCALEFORTUNING2: middle.block:
-; VSCALEFORTUNING2-NEXT: [[BIN_RDX:%.*]] = or <vscale x 4 x i16> [[TMP16]], [[TMP15]]
-; VSCALEFORTUNING2-NEXT: [[TMP18:%.*]] = call i16 @llvm.vector.reduce.or.nxv4i16(<vscale x 4 x i16> [[BIN_RDX]])
-; VSCALEFORTUNING2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC]]
-; VSCALEFORTUNING2-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
-; VSCALEFORTUNING2: scalar.ph:
-; VSCALEFORTUNING2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
-; VSCALEFORTUNING2-NEXT: [[BC_MERGE_RDX:%.*]] = phi i16 [ [[TMP18]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
+; VSCALEFORTUNING2-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX6]], [[TMP26]]
+; VSCALEFORTUNING2-NEXT: [[TMP33:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC5]]
+; VSCALEFORTUNING2-NEXT: br i1 [[TMP33]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; VSCALEFORTUNING2: vec.epilog.middle.block:
+; VSCALEFORTUNING2-NEXT: [[TMP34:%.*]] = call i16 @llvm.vector.reduce.or.nxv4i16(<vscale x 4 x i16> [[TMP16]])
+; VSCALEFORTUNING2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP0]], [[N_VEC5]]
+; VSCALEFORTUNING2-NEXT: br i1 [[CMP_N]], label [[EXIT]], label [[SCALAR_PH]]
+; VSCALEFORTUNING2: vec.epilog.scalar.ph:
+; VSCALEFORTUNING2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC5]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 0, [[ITER_CHECK:%.*]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ]
+; VSCALEFORTUNING2-NEXT: [[BC_MERGE_RDX:%.*]] = phi i16 [ [[TMP34]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 0, [[ITER_CHECK]] ], [ [[TMP20]], [[VEC_EPILOG_ITER_CHECK]] ]
; VSCALEFORTUNING2-NEXT: br label [[LOOP:%.*]]
; VSCALEFORTUNING2: loop:
; VSCALEFORTUNING2-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
@@ -455,9 +495,9 @@ define i16 @reduce_udiv(ptr %src, i16 %x, i64 %N) #0 {
; VSCALEFORTUNING2-NEXT: [[RED_NEXT]] = or i16 [[DIV]], [[RED]]
; VSCALEFORTUNING2-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
; VSCALEFORTUNING2-NEXT: [[EC:%.*]] = icmp eq i64 [[IV]], [[N]]
-; VSCALEFORTUNING2-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
+; VSCALEFORTUNING2-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP6:![0-9]+]]
; VSCALEFORTUNING2: exit:
-; VSCALEFORTUNING2-NEXT: [[RED_NEXT_LCSSA:%.*]] = phi i16 [ [[RED_NEXT]], [[LOOP]] ], [ [[TMP18]], [[MIDDLE_BLOCK]] ]
+; VSCALEFORTUNING2-NEXT: [[RED_NEXT_LCSSA:%.*]] = phi i16 [ [[RED_NEXT]], [[LOOP]] ], [ [[TMP20]], [[MIDDLE_BLOCK]] ], [ [[TMP34]], [[VEC_EPILOG_MIDDLE_BLOCK]] ]
; VSCALEFORTUNING2-NEXT: ret i16 [[RED_NEXT_LCSSA]]
;
; PRED-LABEL: define i16 @reduce_udiv(
@@ -467,40 +507,40 @@ define i16 @reduce_udiv(ptr %src, i16 %x, i64 %N) #0 {
; PRED-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; PRED: vector.ph:
; PRED-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 4
+; PRED-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 8
; PRED-NEXT: [[TMP5:%.*]] = sub i64 [[TMP2]], 1
; PRED-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP0]], [[TMP5]]
; PRED-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP2]]
; PRED-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; PRED-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
+; PRED-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 8
; PRED-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
-; PRED-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 4
+; PRED-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 8
; PRED-NEXT: [[TMP10:%.*]] = sub i64 [[TMP0]], [[TMP9]]
; PRED-NEXT: [[TMP11:%.*]] = icmp ugt i64 [[TMP0]], [[TMP9]]
; PRED-NEXT: [[TMP12:%.*]] = select i1 [[TMP11]], i64 [[TMP10]], i64 0
-; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP0]])
-; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i16> poison, i16 [[X]], i64 0
-; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i16> poison, <vscale x 4 x i32> zeroinitializer
+; PRED-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[TMP0]])
+; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 8 x i16> poison, i16 [[X]], i64 0
+; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 8 x i16> [[BROADCAST_SPLATINSERT]], <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
; PRED-NEXT: br label [[VECTOR_BODY:%.*]]
; PRED: vector.body:
; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
-; PRED-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 4 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.*]], [[VECTOR_BODY]] ]
+; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+; PRED-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 8 x i16> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.*]], [[VECTOR_BODY]] ]
; PRED-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
; PRED-NEXT: [[TMP14:%.*]] = getelementptr i16, ptr [[SRC]], i64 [[TMP13]]
; PRED-NEXT: [[TMP15:%.*]] = getelementptr i16, ptr [[TMP14]], i32 0
-; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i16> @llvm.masked.load.nxv4i16.p0(ptr [[TMP15]], i32 2, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i16> poison)
-; PRED-NEXT: [[TMP19:%.*]] = udiv <vscale x 4 x i16> [[WIDE_MASKED_LOAD]], [[BROADCAST_SPLAT]]
-; PRED-NEXT: [[TMP20:%.*]] = or <vscale x 4 x i16> [[TMP19]], [[VEC_PHI]]
-; PRED-NEXT: [[TMP16]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i16> [[TMP20]], <vscale x 4 x i16> [[VEC_PHI]]
+; PRED-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP15]], i32 2, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i16> poison)
+; PRED-NEXT: [[TMP19:%.*]] = udiv <vscale x 8 x i16> [[WIDE_MASKED_LOAD]], [[BROADCAST_SPLAT]]
+; PRED-NEXT: [[TMP20:%.*]] = or <vscale x 8 x i16> [[TMP19]], [[VEC_PHI]]
+; PRED-NEXT: [[TMP16]] = select <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i16> [[TMP20]], <vscale x 8 x i16> [[VEC_PHI]]
; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP7]]
-; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP12]])
-; PRED-NEXT: [[TMP17:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
-; PRED-NEXT: [[TMP18:%.*]] = extractelement <vscale x 4 x i1> [[TMP17]], i32 0
+; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX]], i64 [[TMP12]])
+; PRED-NEXT: [[TMP17:%.*]] = xor <vscale x 8 x i1> [[ACTIVE_LANE_MASK_NEXT]], splat (i1 true)
+; PRED-NEXT: [[TMP18:%.*]] = extractelement <vscale x 8 x i1> [[TMP17]], i32 0
; PRED-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; PRED: middle.block:
-; PRED-NEXT: [[TMP22:%.*]] = call i16 @llvm.vector.reduce.or.nxv4i16(<vscale x 4 x i16> [[TMP16]])
+; PRED-NEXT: [[TMP22:%.*]] = call i16 @llvm.vector.reduce.or.nxv8i16(<vscale x 8 x i16> [[TMP16]])
; PRED-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; PRED: scalar.ph:
; PRED-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
@@ -550,7 +590,8 @@ attributes #0 = { "target-features"="+sve" }
; VSCALEFORTUNING2: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
; VSCALEFORTUNING2: [[LOOP3]] = distinct !{[[LOOP3]], [[META2]], [[META1]]}
; VSCALEFORTUNING2: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]], [[META2]]}
-; VSCALEFORTUNING2: [[LOOP5]] = distinct !{[[LOOP5]], [[META2]], [[META1]]}
+; VSCALEFORTUNING2: [[LOOP5]] = distinct !{[[LOOP5]], [[META1]], [[META2]]}
+; VSCALEFORTUNING2: [[LOOP6]] = distinct !{[[LOOP6]], [[META2]], [[META1]]}
;.
; PRED: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
; PRED: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll b/llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll
index ce8492cd77362f..dad87826020faa 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll
@@ -183,55 +183,106 @@ define void @test_uniform_not_invariant(ptr noalias %dst, ptr readonly %src, i64
; CHECK-LABEL: define void @test_uniform_not_invariant
; CHECK-SAME: (ptr noalias [[DST:%.*]], ptr readonly [[SRC:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
; CHECK-NEXT: entry:
-; CHECK-NEXT: br label [[FOR_BODY:%.*]]
-; CHECK: for.body:
-; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
-; CHECK-NEXT: [[GEPSRC:%.*]] = getelementptr double, ptr [[SRC]], i64 [[INDVARS_IV]]
-; CHECK-NEXT: [[DATA:%.*]] = load double, ptr [[GEPSRC]], align 8
-; CHECK-NEXT: [[CALL:%.*]] = call double @foo(double [[DATA]], i64 [[INDVARS_IV]]) #[[ATTR5:[0-9]+]]
-; CHECK-NEXT: [[GEPDST:%.*]] = getelementptr inbounds nuw double, ptr [[DST]], i64 [[INDVARS_IV]]
-; CHECK-NEXT: store double [[CALL]], ptr [[GEPDST]], align 8
-; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
-; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
+; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.usub.sat.i64(i64 [[N]], i64 2)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 0, i64 [[N]])
+; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_CALL_CONTINUE2:%.*]] ]
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[PRED_CALL_CONTINUE2]] ]
+; CHECK-NEXT: [[TMP1:%.*]] = getelementptr double, ptr [[SRC]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <2 x double> @llvm.masked.load.v2f64.p0(ptr [[TMP1]], i32 8, <2 x i1> [[ACTIVE_LANE_MASK]], <2 x double> poison)
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i64 0
+; CHECK-NEXT: br i1 [[TMP2]], label [[PRED_CALL_IF:%.*]], label [[PRED_CALL_CONTINUE:%.*]]
+; CHECK: pred.call.if:
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[WIDE_MASKED_LOAD]], i64 0
+; CHECK-NEXT: [[TMP4:%.*]] = call double @foo(double [[TMP3]], i64 [[INDEX]]) #[[ATTR5:[0-9]+]]
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[TMP4]], i64 0
+; CHECK-NEXT: br label [[PRED_CALL_CONTINUE]]
+; CHECK: pred.call.continue:
+; CHECK-NEXT: [[TMP6:%.*]] = phi <2 x double> [ poison, [[VECTOR_BODY]] ], [ [[TMP5]], [[PRED_CALL_IF]] ]
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i64 1
+; CHECK-NEXT: br i1 [[TMP7]], label [[PRED_CALL_IF1:%.*]], label [[PRED_CALL_CONTINUE2]]
+; CHECK: pred.call.if1:
+; CHECK-NEXT: [[TMP8:%.*]] = or disjoint i64 [[INDEX]], 1
+; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[WIDE_MASKED_LOAD]], i64 1
+; CHECK-NEXT: [[TMP10:%.*]] = call double @foo(double [[TMP9]], i64 [[TMP8]]) #[[ATTR5]]
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP10]], i64 1
+; CHECK-NEXT: br label [[PRED_CALL_CONTINUE2]]
+; CHECK: pred.call.continue2:
+; CHECK-NEXT: [[TMP12:%.*]] = phi <2 x double> [ [[TMP6]], [[PRED_CALL_CONTINUE]] ], [ [[TMP11]], [[PRED_CALL_IF1]] ]
+; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds double, ptr [[DST]], i64 [[INDEX]]
+; CHECK-NEXT: call void @llvm.masked.store.v2f64.p0(<2 x double> [[TMP12]], ptr [[TMP13]], i32 8, <2 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
+; CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 [[INDEX]], i64 [[TMP0]])
+; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+; CHECK-NEXT: br i1 [[TMP14]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: for.cond.cleanup:
; CHECK-NEXT: ret void
;
; INTERLEAVE-LABEL: define void @test_uniform_not_invariant
; INTERLEAVE-SAME: (ptr noalias [[DST:%.*]], ptr readonly [[SRC:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
; INTERLEAVE-NEXT: entry:
-; INTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.usub.sat.i64(i64 [[N]], i64 2)
-; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = icmp ne i64 [[N]], 0
-; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = icmp ugt i64 [[N]], 1
+; INTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.usub.sat.i64(i64 [[N]], i64 4)
+; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 0, i64 [[N]])
+; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_ENTRY1:%.*]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 2, i64 [[N]])
; INTERLEAVE-NEXT: br label [[VECTOR_BODY:%.*]]
; INTERLEAVE: vector.body:
-; INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE4:%.*]] ]
-; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi i1 [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ true, [[PRED_STORE_CONTINUE4]] ]
-; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi i1 [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT5:%.*]], [[PRED_STORE_CONTINUE4]] ]
-; INTERLEAVE-NEXT: br i1 [[ACTIVE_LANE_MASK]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; INTERLEAVE: pred.store.if:
+; INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_CALL_CONTINUE9:%.*]] ]
+; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[PRED_CALL_CONTINUE9]] ]
+; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK2:%.*]] = phi <2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY1]], [[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT10:%.*]], [[PRED_CALL_CONTINUE9]] ]
; INTERLEAVE-NEXT: [[TMP1:%.*]] = getelementptr double, ptr [[SRC]], i64 [[INDEX]]
-; INTERLEAVE-NEXT: [[TMP2:%.*]] = load double, ptr [[TMP1]], align 8
-; INTERLEAVE-NEXT: [[TMP3:%.*]] = call double @foo(double [[TMP2]], i64 [[INDEX]]) #[[ATTR5:[0-9]+]]
-; INTERLEAVE-NEXT: [[TMP4:%.*]] = getelementptr inbounds double, ptr [[DST]], i64 [[INDEX]]
-; INTERLEAVE-NEXT: store double [[TMP3]], ptr [[TMP4]], align 8
-; INTERLEAVE-NEXT: br label [[PRED_STORE_CONTINUE]]
-; INTERLEAVE: pred.store.continue:
-; INTERLEAVE-NEXT: br i1 [[ACTIVE_LANE_MASK2]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4]]
-; INTERLEAVE: pred.store.if3:
-; INTERLEAVE-NEXT: [[TMP5:%.*]] = or disjoint i64 [[INDEX]], 1
-; INTERLEAVE-NEXT: [[TMP6:%.*]] = getelementptr double, ptr [[SRC]], i64 [[TMP5]]
-; INTERLEAVE-NEXT: [[TMP7:%.*]] = load double, ptr [[TMP6]], align 8
-; INTERLEAVE-NEXT: [[TMP8:%.*]] = call double @foo(double [[TMP7]], i64 [[TMP5]]) #[[ATTR5]]
-; INTERLEAVE-NEXT: [[TMP9:%.*]] = getelementptr inbounds double, ptr [[DST]], i64 [[TMP5]]
-; INTERLEAVE-NEXT: store double [[TMP8]], ptr [[TMP9]], align 8
-; INTERLEAVE-NEXT: br label [[PRED_STORE_CONTINUE4]]
-; INTERLEAVE: pred.store.continue4:
-; INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
-; INTERLEAVE-NEXT: [[TMP10:%.*]] = or disjoint i64 [[INDEX]], 1
-; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_NEXT:%.*]] = icmp ult i64 [[INDEX]], [[TMP0]]
-; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_NEXT5]] = icmp ult i64 [[TMP10]], [[TMP0]]
-; INTERLEAVE-NEXT: br i1 [[ACTIVE_LANE_MASK_NEXT]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]], !llvm.loop [[LOOP4:![0-9]+]]
+; INTERLEAVE-NEXT: [[TMP2:%.*]] = getelementptr i8, ptr [[TMP1]], i64 16
+; INTERLEAVE-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <2 x double> @llvm.masked.load.v2f64.p0(ptr [[TMP1]], i32 8, <2 x i1> [[ACTIVE_LANE_MASK]], <2 x double> poison)
+; INTERLEAVE-NEXT: [[WIDE_MASKED_LOAD3:%.*]] = call <2 x double> @llvm.masked.load.v2f64.p0(ptr [[TMP2]], i32 8, <2 x i1> [[ACTIVE_LANE_MASK2]], <2 x double> poison)
+; INTERLEAVE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i64 0
+; INTERLEAVE-NEXT: br i1 [[TMP3]], label [[PRED_CALL_IF:%.*]], label [[PRED_CALL_CONTINUE:%.*]]
+; INTERLEAVE: pred.call.if:
+; INTERLEAVE-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[WIDE_MASKED_LOAD]], i64 0
+; INTERLEAVE-NEXT: [[TMP5:%.*]] = call double @foo(double [[TMP4]], i64 [[INDEX]]) #[[ATTR5:[0-9]+]]
+; INTERLEAVE-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[TMP5]], i64 0
+; INTERLEAVE-NEXT: br label [[PRED_CALL_CONTINUE]]
+; INTERLEAVE: pred.call.continue:
+; INTERLEAVE-NEXT: [[TMP7:%.*]] = phi <2 x double> [ poison, [[VECTOR_BODY]] ], [ [[TMP6]], [[PRED_CALL_IF]] ]
+; INTERLEAVE-NEXT: [[TMP8:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK]], i64 1
+; INTERLEAVE-NEXT: br i1 [[TMP8]], label [[PRED_CALL_IF4:%.*]], label [[PRED_CALL_CONTINUE5:%.*]]
+; INTERLEAVE: pred.call.if4:
+; INTERLEAVE-NEXT: [[TMP9:%.*]] = or disjoint i64 [[INDEX]], 1
+; INTERLEAVE-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[WIDE_MASKED_LOAD]], i64 1
+; INTERLEAVE-NEXT: [[TMP11:%.*]] = call double @foo(double [[TMP10]], i64 [[TMP9]]) #[[ATTR5]]
+; INTERLEAVE-NEXT: [[TMP12:%.*]] = insertelement <2 x double> [[TMP7]], double [[TMP11]], i64 1
+; INTERLEAVE-NEXT: br label [[PRED_CALL_CONTINUE5]]
+; INTERLEAVE: pred.call.continue5:
+; INTERLEAVE-NEXT: [[TMP13:%.*]] = phi <2 x double> [ [[TMP7]], [[PRED_CALL_CONTINUE]] ], [ [[TMP12]], [[PRED_CALL_IF4]] ]
+; INTERLEAVE-NEXT: [[TMP14:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK2]], i64 0
+; INTERLEAVE-NEXT: br i1 [[TMP14]], label [[PRED_CALL_IF6:%.*]], label [[PRED_CALL_CONTINUE7:%.*]]
+; INTERLEAVE: pred.call.if6:
+; INTERLEAVE-NEXT: [[TMP15:%.*]] = or disjoint i64 [[INDEX]], 2
+; INTERLEAVE-NEXT: [[TMP16:%.*]] = extractelement <2 x double> [[WIDE_MASKED_LOAD3]], i64 0
+; INTERLEAVE-NEXT: [[TMP17:%.*]] = call double @foo(double [[TMP16]], i64 [[TMP15]]) #[[ATTR5]]
+; INTERLEAVE-NEXT: [[TMP18:%.*]] = insertelement <2 x double> poison, double [[TMP17]], i64 0
+; INTERLEAVE-NEXT: br label [[PRED_CALL_CONTINUE7]]
+; INTERLEAVE: pred.call.continue7:
+; INTERLEAVE-NEXT: [[TMP19:%.*]] = phi <2 x double> [ poison, [[PRED_CALL_CONTINUE5]] ], [ [[TMP18]], [[PRED_CALL_IF6]] ]
+; INTERLEAVE-NEXT: [[TMP20:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK2]], i64 1
+; INTERLEAVE-NEXT: br i1 [[TMP20]], label [[PRED_CALL_IF8:%.*]], label [[PRED_CALL_CONTINUE9]]
+; INTERLEAVE: pred.call.if8:
+; INTERLEAVE-NEXT: [[TMP21:%.*]] = or disjoint i64 [[INDEX]], 3
+; INTERLEAVE-NEXT: [[TMP22:%.*]] = extractelement <2 x double> [[WIDE_MASKED_LOAD3]], i64 1
+; INTERLEAVE-NEXT: [[TMP23:%.*]] = call double @foo(double [[TMP22]], i64 [[TMP21]]) #[[ATTR5]]
+; INTERLEAVE-NEXT: [[TMP24:%.*]] = insertelement <2 x double> [[TMP19]], double [[TMP23]], i64 1
+; INTERLEAVE-NEXT: br label [[PRED_CALL_CONTINUE9]]
+; INTERLEAVE: pred.call.continue9:
+; INTERLEAVE-NEXT: [[TMP25:%.*]] = phi <2 x double> [ [[TMP19]], [[PRED_CALL_CONTINUE7]] ], [ [[TMP24]], [[PRED_CALL_IF8]] ]
+; INTERLEAVE-NEXT: [[TMP26:%.*]] = getelementptr inbounds double, ptr [[DST]], i64 [[INDEX]]
+; INTERLEAVE-NEXT: [[TMP27:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP26]], i64 16
+; INTERLEAVE-NEXT: call void @llvm.masked.store.v2f64.p0(<2 x double> [[TMP13]], ptr [[TMP26]], i32 8, <2 x i1> [[ACTIVE_LANE_MASK]])
+; INTERLEAVE-NEXT: call void @llvm.masked.store.v2f64.p0(<2 x double> [[TMP25]], ptr [[TMP27]], i32 8, <2 x i1> [[ACTIVE_LANE_MASK2]])
+; INTERLEAVE-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
+; INTERLEAVE-NEXT: [[TMP28:%.*]] = or disjoint i64 [[INDEX]], 2
+; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 [[INDEX]], i64 [[TMP0]])
+; INTERLEAVE-NEXT: [[ACTIVE_LANE_MASK_NEXT10]] = call <2 x i1> @llvm.get.active.lane.mask.v2i1.i64(i64 [[TMP28]], i64 [[TMP0]])
+; INTERLEAVE-NEXT: [[TMP29:%.*]] = extractelement <2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+; INTERLEAVE-NEXT: br i1 [[TMP29]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP:%.*]], !llvm.loop [[LOOP4:![0-9]+]]
; INTERLEAVE: for.cond.cleanup:
; INTERLEAVE-NEXT: ret void
;
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
index 5c5600b9cfdf8e..0966505cd21bd9 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
@@ -48,27 +48,27 @@ define void @trip3_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16
; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP1]], 1
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 3, [[TMP4]]
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 2
+; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[TMP7]], i64 3)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 3)
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 2 x i8> @llvm.masked.load.nxv2i8.p0(ptr [[TMP9]], i32 1, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i8> poison)
-; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 2 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
+; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 2 x i8> @llvm.masked.load.nxv2i8.p0(ptr [[TMP12]], i32 1, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i8> poison)
-; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 2 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT: call void @llvm.masked.store.nxv2i8.p0(<vscale x 2 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
+; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
@@ -117,27 +117,27 @@ define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16
; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP1]], 1
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 5, [[TMP4]]
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 4
+; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP7]], i64 5)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[TMP7]], i64 5)
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP9]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
-; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP9]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
+; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
-; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT: call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
+; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 16 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:
@@ -186,27 +186,27 @@ define void @trip8_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP1]], 1
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 8, [[TMP4]]
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 4
+; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP7]], i64 8)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[TMP7]], i64 8)
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP8]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP9]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
-; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 4 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP9]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT: [[TMP10:%.*]] = shl <vscale x 8 x i8> [[WIDE_MASKED_LOAD]], splat (i8 1)
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[TMP11]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0(ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i8> poison)
-; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
-; CHECK-NEXT: call void @llvm.masked.store.nxv4i8.p0(<vscale x 4 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
+; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 8 x i8> [[TMP10]], [[WIDE_MASKED_LOAD1]]
+; CHECK-NEXT: call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> [[TMP13]], ptr [[TMP12]], i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll b/llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
index 375278eea38f97..ac233479cb2de8 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
@@ -7,22 +7,24 @@ define void @small_trip_count_min_vlen_128(ptr nocapture %a) nounwind vscale_ran
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
-; CHECK-NEXT: [[TMP1:%.*]] = sub i32 [[TMP0]], 1
+; CHECK-NEXT: [[TMP6:%.*]] = mul i32 [[TMP0]], 2
+; CHECK-NEXT: [[TMP1:%.*]] = sub i32 [[TMP6]], 1
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 4, [[TMP1]]
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], [[TMP0]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], [[TMP6]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
; CHECK-NEXT: [[TMP2:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-NEXT: [[TMP7:%.*]] = mul i32 [[TMP2]], 2
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[INDEX]], 0
-; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 1 x i1> @llvm.get.active.lane.mask.nxv1i1.i32(i32 [[TMP3]], i32 4)
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i32(i32 [[TMP3]], i32 4)
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i32 [[TMP3]]
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i32 0
-; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 1 x i32> @llvm.masked.load.nxv1i32.p0(ptr [[TMP5]], i32 4, <vscale x 1 x i1> [[ACTIVE_LANE_MASK]], <vscale x 1 x i32> poison)
-; CHECK-NEXT: [[TMP6:%.*]] = add nsw <vscale x 1 x i32> [[WIDE_MASKED_LOAD]], splat (i32 1)
-; CHECK-NEXT: call void @llvm.masked.store.nxv1i32.p0(<vscale x 1 x i32> [[TMP6]], ptr [[TMP5]], i32 4, <vscale x 1 x i1> [[ACTIVE_LANE_MASK]])
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP2]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0(ptr [[TMP5]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i32> poison)
+; CHECK-NEXT: [[TMP8:%.*]] = add nsw <vscale x 2 x i32> [[WIDE_MASKED_LOAD]], splat (i32 1)
+; CHECK-NEXT: call void @llvm.masked.store.nxv2i32.p0(<vscale x 2 x i32> [[TMP8]], ptr [[TMP5]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP7]]
; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
diff --git a/llvm/test/Transforms/LoopVectorize/X86/conversion-cost.ll b/llvm/test/Transforms/LoopVectorize/X86/conversion-cost.ll
index 15bdbea612a705..be143a19e9111b 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/conversion-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/conversion-cost.ll
@@ -11,7 +11,7 @@ define i32 @conversion_cost1(i32 %n, ptr nocapture %A, ptr nocapture %B) nounwin
; CHECK: iter.check:
; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[N]], -3
; CHECK-NEXT: [[TMP3:%.*]] = zext i32 [[TMP2]] to i64
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 4
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP3]], 32
@@ -39,29 +39,29 @@ define i32 @conversion_cost1(i32 %n, ptr nocapture %A, ptr nocapture %B) nounwin
; CHECK: vec.epilog.iter.check:
; CHECK-NEXT: [[IND_END5:%.*]] = add i64 3, [[N_VEC]]
; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP3]], [[N_VEC]]
-; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 4
+; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[IND_END]], [[VEC_EPILOG_ITER_CHECK]] ], [ 3, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP3]], 4
+; CHECK-NEXT: [[N_MOD_VF2:%.*]] = urem i64 [[TMP3]], 8
; CHECK-NEXT: [[N_VEC3:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF2]]
; CHECK-NEXT: [[IND_END4:%.*]] = add i64 3, [[N_VEC3]]
; CHECK-NEXT: [[TMP8:%.*]] = trunc i64 [[BC_RESUME_VAL]] to i8
-; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <4 x i8> poison, i8 [[TMP8]], i64 0
-; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <4 x i8> [[DOTSPLATINSERT]], <4 x i8> poison, <4 x i32> zeroinitializer
-; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i8> [[DOTSPLAT]], <i8 0, i8 1, i8 2, i8 3>
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <8 x i8> poison, i8 [[TMP8]], i64 0
+; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <8 x i8> [[DOTSPLATINSERT]], <8 x i8> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i8> [[DOTSPLAT]], <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7>
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX7:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_IND8:%.*]] = phi <4 x i8> [ [[INDUCTION]], [[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT9:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_IND5:%.*]] = phi <8 x i8> [ [[INDUCTION]], [[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT6:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[OFFSET_IDX10:%.*]] = add i64 3, [[INDEX7]]
; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[OFFSET_IDX10]], 0
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[TMP9]]
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[TMP10]], i32 0
-; CHECK-NEXT: store <4 x i8> [[VEC_IND8]], ptr [[TMP11]], align 1
-; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 4
-; CHECK-NEXT: [[VEC_IND_NEXT9]] = add <4 x i8> [[VEC_IND8]], splat (i8 4)
+; CHECK-NEXT: store <8 x i8> [[VEC_IND5]], ptr [[TMP11]], align 1
+; CHECK-NEXT: [[INDEX_NEXT11]] = add nuw i64 [[INDEX7]], 8
+; CHECK-NEXT: [[VEC_IND_NEXT6]] = add <8 x i8> [[VEC_IND5]], splat (i8 8)
; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC3]]
; CHECK-NEXT: br i1 [[TMP12]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: vec.epilog.middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/X86/cost-model.ll b/llvm/test/Transforms/LoopVectorize/X86/cost-model.ll
index 5c0aeb526e50c9..88a0600e6ad8fb 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/cost-model.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/cost-model.ll
@@ -77,18 +77,18 @@ define float @PR27826(ptr nocapture readonly %a, ptr nocapture readonly %b, i32
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 4
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; CHECK: vector.main.loop.iter.check:
-; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP2]], 16
+; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i64 [[TMP2]], 32
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 32
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <4 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP119:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI2:%.*]] = phi <4 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP120:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI3:%.*]] = phi <4 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP121:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI4:%.*]] = phi <4 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP122:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <8 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP232:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI2:%.*]] = phi <8 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP233:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI3:%.*]] = phi <8 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP234:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI4:%.*]] = phi <8 x float> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP235:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 32
; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[OFFSET_IDX]], 0
; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[OFFSET_IDX]], 32
@@ -106,6 +106,22 @@ define float @PR27826(ptr nocapture readonly %a, ptr nocapture readonly %b, i32
; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[OFFSET_IDX]], 416
; CHECK-NEXT: [[TMP17:%.*]] = add i64 [[OFFSET_IDX]], 448
; CHECK-NEXT: [[TMP18:%.*]] = add i64 [[OFFSET_IDX]], 480
+; CHECK-NEXT: [[TMP39:%.*]] = add i64 [[OFFSET_IDX]], 512
+; CHECK-NEXT: [[TMP40:%.*]] = add i64 [[OFFSET_IDX]], 544
+; CHECK-NEXT: [[TMP41:%.*]] = add i64 [[OFFSET_IDX]], 576
+; CHECK-NEXT: [[TMP42:%.*]] = add i64 [[OFFSET_IDX]], 608
+; CHECK-NEXT: [[TMP47:%.*]] = add i64 [[OFFSET_IDX]], 640
+; CHECK-NEXT: [[TMP48:%.*]] = add i64 [[OFFSET_IDX]], 672
+; CHECK-NEXT: [[TMP49:%.*]] = add i64 [[OFFSET_IDX]], 704
+; CHECK-NEXT: [[TMP50:%.*]] = add i64 [[OFFSET_IDX]], 736
+; CHECK-NEXT: [[TMP87:%.*]] = add i64 [[OFFSET_IDX]], 768
+; CHECK-NEXT: [[TMP88:%.*]] = add i64 [[OFFSET_IDX]], 800
+; CHECK-NEXT: [[TMP89:%.*]] = add i64 [[OFFSET_IDX]], 832
+; CHECK-NEXT: [[TMP90:%.*]] = add i64 [[OFFSET_IDX]], 864
+; CHECK-NEXT: [[TMP164:%.*]] = add i64 [[OFFSET_IDX]], 896
+; CHECK-NEXT: [[TMP165:%.*]] = add i64 [[OFFSET_IDX]], 928
+; CHECK-NEXT: [[TMP166:%.*]] = add i64 [[OFFSET_IDX]], 960
+; CHECK-NEXT: [[TMP167:%.*]] = add i64 [[OFFSET_IDX]], 992
; CHECK-NEXT: [[TMP19:%.*]] = getelementptr inbounds float, ptr [[A:%.*]], i64 [[TMP3]]
; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP4]]
; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP5]]
@@ -122,38 +138,86 @@ define float @PR27826(ptr nocapture readonly %a, ptr nocapture readonly %b, i32
; CHECK-NEXT: [[TMP32:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP16]]
; CHECK-NEXT: [[TMP33:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP17]]
; CHECK-NEXT: [[TMP34:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP18]]
+; CHECK-NEXT: [[TMP168:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP39]]
+; CHECK-NEXT: [[TMP169:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP40]]
+; CHECK-NEXT: [[TMP170:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP41]]
+; CHECK-NEXT: [[TMP55:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP42]]
+; CHECK-NEXT: [[TMP56:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP47]]
+; CHECK-NEXT: [[TMP57:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP48]]
+; CHECK-NEXT: [[TMP58:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP49]]
+; CHECK-NEXT: [[TMP171:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP50]]
+; CHECK-NEXT: [[TMP180:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP87]]
+; CHECK-NEXT: [[TMP181:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP88]]
+; CHECK-NEXT: [[TMP182:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP89]]
+; CHECK-NEXT: [[TMP63:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP90]]
+; CHECK-NEXT: [[TMP64:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP164]]
+; CHECK-NEXT: [[TMP65:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP165]]
+; CHECK-NEXT: [[TMP66:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP166]]
+; CHECK-NEXT: [[TMP183:%.*]] = getelementptr inbounds float, ptr [[A]], i64 [[TMP167]]
; CHECK-NEXT: [[TMP35:%.*]] = load float, ptr [[TMP19]], align 4
; CHECK-NEXT: [[TMP36:%.*]] = load float, ptr [[TMP20]], align 4
; CHECK-NEXT: [[TMP37:%.*]] = load float, ptr [[TMP21]], align 4
; CHECK-NEXT: [[TMP38:%.*]] = load float, ptr [[TMP22]], align 4
-; CHECK-NEXT: [[TMP39:%.*]] = insertelement <4 x float> poison, float [[TMP35]], i32 0
-; CHECK-NEXT: [[TMP40:%.*]] = insertelement <4 x float> [[TMP39]], float [[TMP36]], i32 1
-; CHECK-NEXT: [[TMP41:%.*]] = insertelement <4 x float> [[TMP40]], float [[TMP37]], i32 2
-; CHECK-NEXT: [[TMP42:%.*]] = insertelement <4 x float> [[TMP41]], float [[TMP38]], i32 3
; CHECK-NEXT: [[TMP43:%.*]] = load float, ptr [[TMP23]], align 4
; CHECK-NEXT: [[TMP44:%.*]] = load float, ptr [[TMP24]], align 4
; CHECK-NEXT: [[TMP45:%.*]] = load float, ptr [[TMP25]], align 4
; CHECK-NEXT: [[TMP46:%.*]] = load float, ptr [[TMP26]], align 4
-; CHECK-NEXT: [[TMP47:%.*]] = insertelement <4 x float> poison, float [[TMP43]], i32 0
-; CHECK-NEXT: [[TMP48:%.*]] = insertelement <4 x float> [[TMP47]], float [[TMP44]], i32 1
-; CHECK-NEXT: [[TMP49:%.*]] = insertelement <4 x float> [[TMP48]], float [[TMP45]], i32 2
-; CHECK-NEXT: [[TMP50:%.*]] = insertelement <4 x float> [[TMP49]], float [[TMP46]], i32 3
+; CHECK-NEXT: [[TMP184:%.*]] = insertelement <8 x float> poison, float [[TMP35]], i32 0
+; CHECK-NEXT: [[TMP185:%.*]] = insertelement <8 x float> [[TMP184]], float [[TMP36]], i32 1
+; CHECK-NEXT: [[TMP186:%.*]] = insertelement <8 x float> [[TMP185]], float [[TMP37]], i32 2
+; CHECK-NEXT: [[TMP187:%.*]] = insertelement <8 x float> [[TMP186]], float [[TMP38]], i32 3
+; CHECK-NEXT: [[TMP236:%.*]] = insertelement <8 x float> [[TMP187]], float [[TMP43]], i32 4
+; CHECK-NEXT: [[TMP237:%.*]] = insertelement <8 x float> [[TMP236]], float [[TMP44]], i32 5
+; CHECK-NEXT: [[TMP238:%.*]] = insertelement <8 x float> [[TMP237]], float [[TMP45]], i32 6
+; CHECK-NEXT: [[TMP239:%.*]] = insertelement <8 x float> [[TMP238]], float [[TMP46]], i32 7
; CHECK-NEXT: [[TMP51:%.*]] = load float, ptr [[TMP27]], align 4
; CHECK-NEXT: [[TMP52:%.*]] = load float, ptr [[TMP28]], align 4
; CHECK-NEXT: [[TMP53:%.*]] = load float, ptr [[TMP29]], align 4
; CHECK-NEXT: [[TMP54:%.*]] = load float, ptr [[TMP30]], align 4
-; CHECK-NEXT: [[TMP55:%.*]] = insertelement <4 x float> poison, float [[TMP51]], i32 0
-; CHECK-NEXT: [[TMP56:%.*]] = insertelement <4 x float> [[TMP55]], float [[TMP52]], i32 1
-; CHECK-NEXT: [[TMP57:%.*]] = insertelement <4 x float> [[TMP56]], float [[TMP53]], i32 2
-; CHECK-NEXT: [[TMP58:%.*]] = insertelement <4 x float> [[TMP57]], float [[TMP54]], i32 3
; CHECK-NEXT: [[TMP59:%.*]] = load float, ptr [[TMP31]], align 4
; CHECK-NEXT: [[TMP60:%.*]] = load float, ptr [[TMP32]], align 4
; CHECK-NEXT: [[TMP61:%.*]] = load float, ptr [[TMP33]], align 4
; CHECK-NEXT: [[TMP62:%.*]] = load float, ptr [[TMP34]], align 4
-; CHECK-NEXT: [[TMP63:%.*]] = insertelement <4 x float> poison, float [[TMP59]], i32 0
-; CHECK-NEXT: [[TMP64:%.*]] = insertelement <4 x float> [[TMP63]], float [[TMP60]], i32 1
-; CHECK-NEXT: [[TMP65:%.*]] = insertelement <4 x float> [[TMP64]], float [[TMP61]], i32 2
-; CHECK-NEXT: [[TMP66:%.*]] = insertelement <4 x float> [[TMP65]], float [[TMP62]], i32 3
+; CHECK-NEXT: [[TMP240:%.*]] = insertelement <8 x float> poison, float [[TMP51]], i32 0
+; CHECK-NEXT: [[TMP241:%.*]] = insertelement <8 x float> [[TMP240]], float [[TMP52]], i32 1
+; CHECK-NEXT: [[TMP242:%.*]] = insertelement <8 x float> [[TMP241]], float [[TMP53]], i32 2
+; CHECK-NEXT: [[TMP95:%.*]] = insertelement <8 x float> [[TMP242]], float [[TMP54]], i32 3
+; CHECK-NEXT: [[TMP96:%.*]] = insertelement <8 x float> [[TMP95]], float [[TMP59]], i32 4
+; CHECK-NEXT: [[TMP97:%.*]] = insertelement <8 x float> [[TMP96]], float [[TMP60]], i32 5
+; CHECK-NEXT: [[TMP98:%.*]] = insertelement <8 x float> [[TMP97]], float [[TMP61]], i32 6
+; CHECK-NEXT: [[TMP243:%.*]] = insertelement <8 x float> [[TMP98]], float [[TMP62]], i32 7
+; CHECK-NEXT: [[TMP244:%.*]] = load float, ptr [[TMP168]], align 4
+; CHECK-NEXT: [[TMP245:%.*]] = load float, ptr [[TMP169]], align 4
+; CHECK-NEXT: [[TMP246:%.*]] = load float, ptr [[TMP170]], align 4
+; CHECK-NEXT: [[TMP103:%.*]] = load float, ptr [[TMP55]], align 4
+; CHECK-NEXT: [[TMP104:%.*]] = load float, ptr [[TMP56]], align 4
+; CHECK-NEXT: [[TMP105:%.*]] = load float, ptr [[TMP57]], align 4
+; CHECK-NEXT: [[TMP106:%.*]] = load float, ptr [[TMP58]], align 4
+; CHECK-NEXT: [[TMP247:%.*]] = load float, ptr [[TMP171]], align 4
+; CHECK-NEXT: [[TMP248:%.*]] = insertelement <8 x float> poison, float [[TMP244]], i32 0
+; CHECK-NEXT: [[TMP249:%.*]] = insertelement <8 x float> [[TMP248]], float [[TMP245]], i32 1
+; CHECK-NEXT: [[TMP250:%.*]] = insertelement <8 x float> [[TMP249]], float [[TMP246]], i32 2
+; CHECK-NEXT: [[TMP111:%.*]] = insertelement <8 x float> [[TMP250]], float [[TMP103]], i32 3
+; CHECK-NEXT: [[TMP112:%.*]] = insertelement <8 x float> [[TMP111]], float [[TMP104]], i32 4
+; CHECK-NEXT: [[TMP113:%.*]] = insertelement <8 x float> [[TMP112]], float [[TMP105]], i32 5
+; CHECK-NEXT: [[TMP114:%.*]] = insertelement <8 x float> [[TMP113]], float [[TMP106]], i32 6
+; CHECK-NEXT: [[TMP115:%.*]] = insertelement <8 x float> [[TMP114]], float [[TMP247]], i32 7
+; CHECK-NEXT: [[TMP116:%.*]] = load float, ptr [[TMP180]], align 4
+; CHECK-NEXT: [[TMP117:%.*]] = load float, ptr [[TMP181]], align 4
+; CHECK-NEXT: [[TMP118:%.*]] = load float, ptr [[TMP182]], align 4
+; CHECK-NEXT: [[TMP119:%.*]] = load float, ptr [[TMP63]], align 4
+; CHECK-NEXT: [[TMP120:%.*]] = load float, ptr [[TMP64]], align 4
+; CHECK-NEXT: [[TMP121:%.*]] = load float, ptr [[TMP65]], align 4
+; CHECK-NEXT: [[TMP122:%.*]] = load float, ptr [[TMP66]], align 4
+; CHECK-NEXT: [[TMP251:%.*]] = load float, ptr [[TMP183]], align 4
+; CHECK-NEXT: [[TMP252:%.*]] = insertelement <8 x float> poison, float [[TMP116]], i32 0
+; CHECK-NEXT: [[TMP253:%.*]] = insertelement <8 x float> [[TMP252]], float [[TMP117]], i32 1
+; CHECK-NEXT: [[TMP254:%.*]] = insertelement <8 x float> [[TMP253]], float [[TMP118]], i32 2
+; CHECK-NEXT: [[TMP255:%.*]] = insertelement <8 x float> [[TMP254]], float [[TMP119]], i32 3
+; CHECK-NEXT: [[TMP256:%.*]] = insertelement <8 x float> [[TMP255]], float [[TMP120]], i32 4
+; CHECK-NEXT: [[TMP257:%.*]] = insertelement <8 x float> [[TMP256]], float [[TMP121]], i32 5
+; CHECK-NEXT: [[TMP258:%.*]] = insertelement <8 x float> [[TMP257]], float [[TMP122]], i32 6
+; CHECK-NEXT: [[TMP259:%.*]] = insertelement <8 x float> [[TMP258]], float [[TMP251]], i32 7
; CHECK-NEXT: [[TMP67:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i64 [[TMP3]]
; CHECK-NEXT: [[TMP68:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP4]]
; CHECK-NEXT: [[TMP69:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP5]]
@@ -170,54 +234,102 @@ define float @PR27826(ptr nocapture readonly %a, ptr nocapture readonly %b, i32
; CHECK-NEXT: [[TMP80:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP16]]
; CHECK-NEXT: [[TMP81:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP17]]
; CHECK-NEXT: [[TMP82:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP18]]
+; CHECK-NEXT: [[TMP260:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP39]]
+; CHECK-NEXT: [[TMP261:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP40]]
+; CHECK-NEXT: [[TMP262:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP41]]
+; CHECK-NEXT: [[TMP263:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP42]]
+; CHECK-NEXT: [[TMP264:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP47]]
+; CHECK-NEXT: [[TMP265:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP48]]
+; CHECK-NEXT: [[TMP266:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP49]]
+; CHECK-NEXT: [[TMP267:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP50]]
+; CHECK-NEXT: [[TMP268:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP87]]
+; CHECK-NEXT: [[TMP269:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP88]]
+; CHECK-NEXT: [[TMP158:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP89]]
+; CHECK-NEXT: [[TMP159:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP90]]
+; CHECK-NEXT: [[TMP160:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP164]]
+; CHECK-NEXT: [[TMP161:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP165]]
+; CHECK-NEXT: [[TMP162:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP166]]
+; CHECK-NEXT: [[TMP163:%.*]] = getelementptr inbounds float, ptr [[B]], i64 [[TMP167]]
; CHECK-NEXT: [[TMP83:%.*]] = load float, ptr [[TMP67]], align 4
; CHECK-NEXT: [[TMP84:%.*]] = load float, ptr [[TMP68]], align 4
; CHECK-NEXT: [[TMP85:%.*]] = load float, ptr [[TMP69]], align 4
; CHECK-NEXT: [[TMP86:%.*]] = load float, ptr [[TMP70]], align 4
-; CHECK-NEXT: [[TMP87:%.*]] = insertelement <4 x float> poison, float [[TMP83]], i32 0
-; CHECK-NEXT: [[TMP88:%.*]] = insertelement <4 x float> [[TMP87]], float [[TMP84]], i32 1
-; CHECK-NEXT: [[TMP89:%.*]] = insertelement <4 x float> [[TMP88]], float [[TMP85]], i32 2
-; CHECK-NEXT: [[TMP90:%.*]] = insertelement <4 x float> [[TMP89]], float [[TMP86]], i32 3
; CHECK-NEXT: [[TMP91:%.*]] = load float, ptr [[TMP71]], align 4
; CHECK-NEXT: [[TMP92:%.*]] = load float, ptr [[TMP72]], align 4
; CHECK-NEXT: [[TMP93:%.*]] = load float, ptr [[TMP73]], align 4
; CHECK-NEXT: [[TMP94:%.*]] = load float, ptr [[TMP74]], align 4
-; CHECK-NEXT: [[TMP95:%.*]] = insertelement <4 x float> poison, float [[TMP91]], i32 0
-; CHECK-NEXT: [[TMP96:%.*]] = insertelement <4 x float> [[TMP95]], float [[TMP92]], i32 1
-; CHECK-NEXT: [[TMP97:%.*]] = insertelement <4 x float> [[TMP96]], float [[TMP93]], i32 2
-; CHECK-NEXT: [[TMP98:%.*]] = insertelement <4 x float> [[TMP97]], float [[TMP94]], i32 3
+; CHECK-NEXT: [[TMP172:%.*]] = insertelement <8 x float> poison, float [[TMP83]], i32 0
+; CHECK-NEXT: [[TMP173:%.*]] = insertelement <8 x float> [[TMP172]], float [[TMP84]], i32 1
+; CHECK-NEXT: [[TMP174:%.*]] = insertelement <8 x float> [[TMP173]], float [[TMP85]], i32 2
+; CHECK-NEXT: [[TMP175:%.*]] = insertelement <8 x float> [[TMP174]], float [[TMP86]], i32 3
+; CHECK-NEXT: [[TMP176:%.*]] = insertelement <8 x float> [[TMP175]], float [[TMP91]], i32 4
+; CHECK-NEXT: [[TMP177:%.*]] = insertelement <8 x float> [[TMP176]], float [[TMP92]], i32 5
+; CHECK-NEXT: [[TMP178:%.*]] = insertelement <8 x float> [[TMP177]], float [[TMP93]], i32 6
+; CHECK-NEXT: [[TMP179:%.*]] = insertelement <8 x float> [[TMP178]], float [[TMP94]], i32 7
; CHECK-NEXT: [[TMP99:%.*]] = load float, ptr [[TMP75]], align 4
; CHECK-NEXT: [[TMP100:%.*]] = load float, ptr [[TMP76]], align 4
; CHECK-NEXT: [[TMP101:%.*]] = load float, ptr [[TMP77]], align 4
; CHECK-NEXT: [[TMP102:%.*]] = load float, ptr [[TMP78]], align 4
-; CHECK-NEXT: [[TMP103:%.*]] = insertelement <4 x float> poison, float [[TMP99]], i32 0
-; CHECK-NEXT: [[TMP104:%.*]] = insertelement <4 x float> [[TMP103]], float [[TMP100]], i32 1
-; CHECK-NEXT: [[TMP105:%.*]] = insertelement <4 x float> [[TMP104]], float [[TMP101]], i32 2
-; CHECK-NEXT: [[TMP106:%.*]] = insertelement <4 x float> [[TMP105]], float [[TMP102]], i32 3
; CHECK-NEXT: [[TMP107:%.*]] = load float, ptr [[TMP79]], align 4
; CHECK-NEXT: [[TMP108:%.*]] = load float, ptr [[TMP80]], align 4
; CHECK-NEXT: [[TMP109:%.*]] = load float, ptr [[TMP81]], align 4
; CHECK-NEXT: [[TMP110:%.*]] = load float, ptr [[TMP82]], align 4
-; CHECK-NEXT: [[TMP111:%.*]] = insertelement <4 x float> poison, float [[TMP107]], i32 0
-; CHECK-NEXT: [[TMP112:%.*]] = insertelement <4 x float> [[TMP111]], float [[TMP108]], i32 1
-; CHECK-NEXT: [[TMP113:%.*]] = insertelement <4 x float> [[TMP112]], float [[TMP109]], i32 2
-; CHECK-NEXT: [[TMP114:%.*]] = insertelement <4 x float> [[TMP113]], float [[TMP110]], i32 3
-; CHECK-NEXT: [[TMP115:%.*]] = fadd fast <4 x float> [[TMP42]], [[VEC_PHI]]
-; CHECK-NEXT: [[TMP116:%.*]] = fadd fast <4 x float> [[TMP50]], [[VEC_PHI2]]
-; CHECK-NEXT: [[TMP117:%.*]] = fadd fast <4 x float> [[TMP58]], [[VEC_PHI3]]
-; CHECK-NEXT: [[TMP118:%.*]] = fadd fast <4 x float> [[TMP66]], [[VEC_PHI4]]
-; CHECK-NEXT: [[TMP119]] = fadd fast <4 x float> [[TMP115]], [[TMP90]]
-; CHECK-NEXT: [[TMP120]] = fadd fast <4 x float> [[TMP116]], [[TMP98]]
-; CHECK-NEXT: [[TMP121]] = fadd fast <4 x float> [[TMP117]], [[TMP106]]
-; CHECK-NEXT: [[TMP122]] = fadd fast <4 x float> [[TMP118]], [[TMP114]]
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; CHECK-NEXT: [[TMP188:%.*]] = insertelement <8 x float> poison, float [[TMP99]], i32 0
+; CHECK-NEXT: [[TMP189:%.*]] = insertelement <8 x float> [[TMP188]], float [[TMP100]], i32 1
+; CHECK-NEXT: [[TMP190:%.*]] = insertelement <8 x float> [[TMP189]], float [[TMP101]], i32 2
+; CHECK-NEXT: [[TMP191:%.*]] = insertelement <8 x float> [[TMP190]], float [[TMP102]], i32 3
+; CHECK-NEXT: [[TMP192:%.*]] = insertelement <8 x float> [[TMP191]], float [[TMP107]], i32 4
+; CHECK-NEXT: [[TMP193:%.*]] = insertelement <8 x float> [[TMP192]], float [[TMP108]], i32 5
+; CHECK-NEXT: [[TMP194:%.*]] = insertelement <8 x float> [[TMP193]], float [[TMP109]], i32 6
+; CHECK-NEXT: [[TMP195:%.*]] = insertelement <8 x float> [[TMP194]], float [[TMP110]], i32 7
+; CHECK-NEXT: [[TMP196:%.*]] = load float, ptr [[TMP260]], align 4
+; CHECK-NEXT: [[TMP197:%.*]] = load float, ptr [[TMP261]], align 4
+; CHECK-NEXT: [[TMP198:%.*]] = load float, ptr [[TMP262]], align 4
+; CHECK-NEXT: [[TMP199:%.*]] = load float, ptr [[TMP263]], align 4
+; CHECK-NEXT: [[TMP200:%.*]] = load float, ptr [[TMP264]], align 4
+; CHECK-NEXT: [[TMP201:%.*]] = load float, ptr [[TMP265]], align 4
+; CHECK-NEXT: [[TMP202:%.*]] = load float, ptr [[TMP266]], align 4
+; CHECK-NEXT: [[TMP203:%.*]] = load float, ptr [[TMP267]], align 4
+; CHECK-NEXT: [[TMP204:%.*]] = insertelement <8 x float> poison, float [[TMP196]], i32 0
+; CHECK-NEXT: [[TMP205:%.*]] = insertelement <8 x float> [[TMP204]], float [[TMP197]], i32 1
+; CHECK-NEXT: [[TMP206:%.*]] = insertelement <8 x float> [[TMP205]], float [[TMP198]], i32 2
+; CHECK-NEXT: [[TMP207:%.*]] = insertelement <8 x float> [[TMP206]], float [[TMP199]], i32 3
+; CHECK-NEXT: [[TMP208:%.*]] = insertelement <8 x float> [[TMP207]], float [[TMP200]], i32 4
+; CHECK-NEXT: [[TMP209:%.*]] = insertelement <8 x float> [[TMP208]], float [[TMP201]], i32 5
+; CHECK-NEXT: [[TMP210:%.*]] = insertelement <8 x float> [[TMP209]], float [[TMP202]], i32 6
+; CHECK-NEXT: [[TMP211:%.*]] = insertelement <8 x float> [[TMP210]], float [[TMP203]], i32 7
+; CHECK-NEXT: [[TMP212:%.*]] = load float, ptr [[TMP268]], align 4
+; CHECK-NEXT: [[TMP213:%.*]] = load float, ptr [[TMP269]], align 4
+; CHECK-NEXT: [[TMP214:%.*]] = load float, ptr [[TMP158]], align 4
+; CHECK-NEXT: [[TMP215:%.*]] = load float, ptr [[TMP159]], align 4
+; CHECK-NEXT: [[TMP216:%.*]] = load float, ptr [[TMP160]], align 4
+; CHECK-NEXT: [[TMP217:%.*]] = load float, ptr [[TMP161]], align 4
+; CHECK-NEXT: [[TMP218:%.*]] = load float, ptr [[TMP162]], align 4
+; CHECK-NEXT: [[TMP219:%.*]] = load float, ptr [[TMP163]], align 4
+; CHECK-NEXT: [[TMP220:%.*]] = insertelement <8 x float> poison, float [[TMP212]], i32 0
+; CHECK-NEXT: [[TMP221:%.*]] = insertelement <8 x float> [[TMP220]], float [[TMP213]], i32 1
+; CHECK-NEXT: [[TMP222:%.*]] = insertelement <8 x float> [[TMP221]], float [[TMP214]], i32 2
+; CHECK-NEXT: [[TMP223:%.*]] = insertelement <8 x float> [[TMP222]], float [[TMP215]], i32 3
+; CHECK-NEXT: [[TMP224:%.*]] = insertelement <8 x float> [[TMP223]], float [[TMP216]], i32 4
+; CHECK-NEXT: [[TMP225:%.*]] = insertelement <8 x float> [[TMP224]], float [[TMP217]], i32 5
+; CHECK-NEXT: [[TMP226:%.*]] = insertelement <8 x float> [[TMP225]], float [[TMP218]], i32 6
+; CHECK-NEXT: [[TMP227:%.*]] = insertelement <8 x float> [[TMP226]], float [[TMP219]], i32 7
+; CHECK-NEXT: [[TMP228:%.*]] = fadd fast <8 x float> [[TMP239]], [[VEC_PHI]]
+; CHECK-NEXT: [[TMP229:%.*]] = fadd fast <8 x float> [[TMP243]], [[VEC_PHI2]]
+; CHECK-NEXT: [[TMP230:%.*]] = fadd fast <8 x float> [[TMP115]], [[VEC_PHI3]]
+; CHECK-NEXT: [[TMP231:%.*]] = fadd fast <8 x float> [[TMP259]], [[VEC_PHI4]]
+; CHECK-NEXT: [[TMP232]] = fadd fast <8 x float> [[TMP228]], [[TMP179]]
+; CHECK-NEXT: [[TMP233]] = fadd fast <8 x float> [[TMP229]], [[TMP195]]
+; CHECK-NEXT: [[TMP234]] = fadd fast <8 x float> [[TMP230]], [[TMP211]]
+; CHECK-NEXT: [[TMP235]] = fadd fast <8 x float> [[TMP231]], [[TMP227]]
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
; CHECK-NEXT: [[TMP123:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP123]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP120]], [[TMP119]]
-; CHECK-NEXT: [[BIN_RDX5:%.*]] = fadd fast <4 x float> [[TMP121]], [[BIN_RDX]]
-; CHECK-NEXT: [[BIN_RDX6:%.*]] = fadd fast <4 x float> [[TMP122]], [[BIN_RDX5]]
-; CHECK-NEXT: [[TMP124:%.*]] = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> [[BIN_RDX6]])
+; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <8 x float> [[TMP233]], [[TMP232]]
+; CHECK-NEXT: [[BIN_RDX5:%.*]] = fadd fast <8 x float> [[TMP234]], [[BIN_RDX]]
+; CHECK-NEXT: [[BIN_RDX6:%.*]] = fadd fast <8 x float> [[TMP235]], [[BIN_RDX5]]
+; CHECK-NEXT: [[TMP124:%.*]] = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> [[BIN_RDX6]])
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOPEXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
; CHECK: vec.epilog.iter.check:
@@ -879,31 +991,38 @@ exit:
define i64 @cost_assume(ptr %end, i64 %N) {
; CHECK-LABEL: @cost_assume(
-; CHECK-NEXT: entry:
+; CHECK-NEXT: iter.check:
; CHECK-NEXT: [[END1:%.*]] = ptrtoint ptr [[END:%.*]] to i64
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[END1]], -9
; CHECK-NEXT: [[TMP1:%.*]] = udiv i64 [[TMP0]], 9
; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 8
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 2
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.main.loop.iter.check:
+; CHECK-NEXT: [[MIN_ITERS_CHECK2:%.*]] = icmp ult i64 [[TMP2]], 16
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK2]], label [[VEC_EPILOG_PH:%.*]], label [[VECTOR_PH1:%.*]]
; CHECK: vector.ph:
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 8
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
-; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[N:%.*]], i64 0
-; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <2 x i64> [[BROADCAST_SPLAT]], zeroinitializer
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[N:%.*]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ne <4 x i64> [[BROADCAST_SPLAT]], zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
-; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP7:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI2:%.*]] = phi <2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI3:%.*]] = phi <2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP9:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_PHI4:%.*]] = phi <2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP10:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[TMP7]] = add <2 x i64> [[VEC_PHI]], splat (i64 1)
-; CHECK-NEXT: [[TMP8]] = add <2 x i64> [[VEC_PHI2]], splat (i64 1)
-; CHECK-NEXT: [[TMP9]] = add <2 x i64> [[VEC_PHI3]], splat (i64 1)
-; CHECK-NEXT: [[TMP10]] = add <2 x i64> [[VEC_PHI4]], splat (i64 1)
-; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP3]], i32 0
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH1]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <4 x i64> [ zeroinitializer, [[VECTOR_PH1]] ], [ [[TMP4:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI3:%.*]] = phi <4 x i64> [ zeroinitializer, [[VECTOR_PH1]] ], [ [[TMP5:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI4:%.*]] = phi <4 x i64> [ zeroinitializer, [[VECTOR_PH1]] ], [ [[TMP6:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI5:%.*]] = phi <4 x i64> [ zeroinitializer, [[VECTOR_PH1]] ], [ [[TMP7:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP4]] = add <4 x i64> [[VEC_PHI]], splat (i64 1)
+; CHECK-NEXT: [[TMP5]] = add <4 x i64> [[VEC_PHI3]], splat (i64 1)
+; CHECK-NEXT: [[TMP6]] = add <4 x i64> [[VEC_PHI4]], splat (i64 1)
+; CHECK-NEXT: [[TMP7]] = add <4 x i64> [[VEC_PHI5]], splat (i64 1)
+; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP3]], i32 0
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
@@ -912,32 +1031,64 @@ define i64 @cost_assume(ptr %end, i64 %N) {
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[BIN_RDX:%.*]] = add <2 x i64> [[TMP8]], [[TMP7]]
-; CHECK-NEXT: [[BIN_RDX5:%.*]] = add <2 x i64> [[TMP9]], [[BIN_RDX]]
-; CHECK-NEXT: [[BIN_RDX6:%.*]] = add <2 x i64> [[TMP10]], [[BIN_RDX5]]
-; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> [[BIN_RDX6]])
+; CHECK-NEXT: [[BIN_RDX:%.*]] = add <4 x i64> [[TMP5]], [[TMP4]]
+; CHECK-NEXT: [[BIN_RDX6:%.*]] = add <4 x i64> [[TMP6]], [[BIN_RDX]]
+; CHECK-NEXT: [[BIN_RDX7:%.*]] = add <4 x i64> [[TMP7]], [[BIN_RDX6]]
+; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[BIN_RDX7]])
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
-; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
-; CHECK: scalar.ph:
-; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP14]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
+; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
+; CHECK: vec.epilog.iter.check:
+; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 [[TMP2]], [[N_VEC]]
+; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 2
+; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[SCALAR_PH]], label [[VEC_EPILOG_PH]]
+; CHECK: vec.epilog.ph:
+; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_PH]] ]
+; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP10]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_PH]] ]
+; CHECK-NEXT: [[N_MOD_VF8:%.*]] = urem i64 [[TMP2]], 2
+; CHECK-NEXT: [[N_VEC9:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF8]]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT10:%.*]] = insertelement <2 x i64> poison, i64 [[N]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT11:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT10]], <2 x i64> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = icmp ne <2 x i64> [[BROADCAST_SPLAT11]], zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = insertelement <2 x i64> zeroinitializer, i64 [[BC_MERGE_RDX]], i32 0
; CHECK-NEXT: br label [[LOOP:%.*]]
+; CHECK: vec.epilog.vector.body:
+; CHECK-NEXT: [[INDEX12:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT14:%.*]], [[LOOP]] ]
+; CHECK-NEXT: [[VEC_PHI13:%.*]] = phi <2 x i64> [ [[TMP18]], [[VEC_EPILOG_PH]] ], [ [[TMP19:%.*]], [[LOOP]] ]
+; CHECK-NEXT: [[TMP19]] = add <2 x i64> [[VEC_PHI13]], splat (i64 1)
+; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i1> [[TMP17]], i32 0
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP14]])
+; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP14]])
+; CHECK-NEXT: [[INDEX_NEXT14]] = add nuw i64 [[INDEX12]], 2
+; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT14]], [[N_VEC9]]
+; CHECK-NEXT: br i1 [[TMP20]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[LOOP]], !llvm.loop [[LOOP21:![0-9]+]]
+; CHECK: vec.epilog.middle.block:
+; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> [[TMP19]])
+; CHECK-NEXT: [[CMP_N15:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC9]]
+; CHECK-NEXT: br i1 [[CMP_N15]], label [[EXIT]], label [[SCALAR_PH]]
+; CHECK: vec.epilog.scalar.ph:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC9]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 0, [[ITER_CHECK:%.*]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ]
+; CHECK-NEXT: [[BC_MERGE_RDX16:%.*]] = phi i64 [ [[TMP16]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ 0, [[ITER_CHECK]] ], [ [[TMP10]], [[VEC_EPILOG_ITER_CHECK]] ]
+; CHECK-NEXT: br label [[LOOP1:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
-; CHECK-NEXT: [[TMP15:%.*]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[TMP12:%.*]], [[LOOP]] ]
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP1]] ]
+; CHECK-NEXT: [[TMP15:%.*]] = phi i64 [ [[BC_MERGE_RDX16]], [[SCALAR_PH]] ], [ [[TMP12:%.*]], [[LOOP1]] ]
; CHECK-NEXT: [[TMP12]] = add i64 [[TMP15]], 1
; CHECK-NEXT: [[IV_NEXT]] = add nsw i64 [[IV]], 1
; CHECK-NEXT: [[C:%.*]] = icmp ne i64 [[N]], 0
; CHECK-NEXT: tail call void @llvm.assume(i1 [[C]])
; CHECK-NEXT: [[GEP:%.*]] = getelementptr nusw [9 x i8], ptr null, i64 [[IV_NEXT]]
; CHECK-NEXT: [[EC:%.*]] = icmp eq ptr [[GEP]], [[END]]
-; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP21:![0-9]+]]
+; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP1]], !llvm.loop [[LOOP22:![0-9]+]]
; CHECK: exit:
-; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i64 [ [[TMP12]], [[LOOP]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]
+; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i64 [ [[TMP12]], [[LOOP1]] ], [ [[TMP10]], [[MIDDLE_BLOCK]] ], [ [[TMP16]], [[VEC_EPILOG_MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[DOTLCSSA]]
;
entry:
@@ -979,7 +1130,7 @@ define void @reduction_store(ptr noalias %src, ptr %dst, i1 %x) #2 {
; CHECK-NEXT: [[TMP12]] = and <4 x i32> [[VEC_PHI1]], [[TMP2]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i32 [[INDEX_NEXT]], 24
-; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
+; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: [[BIN_RDX:%.*]] = and <4 x i32> [[TMP12]], [[TMP11]]
; CHECK-NEXT: [[TMP10:%.*]] = call i32 @llvm.vector.reduce.and.v4i32(<4 x i32> [[BIN_RDX]])
@@ -1003,7 +1154,7 @@ define void @reduction_store(ptr noalias %src, ptr %dst, i1 %x) #2 {
; CHECK-NEXT: store i32 [[RED_NEXT]], ptr [[DST]], align 4
; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; CHECK-NEXT: [[EC:%.*]] = icmp eq i32 [[IV]], 29
-; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP23:![0-9]+]]
+; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP24:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: ret void
;
@@ -1053,7 +1204,7 @@ define i64 @live_in_known_1_via_scev() {
; CHECK-NEXT: [[TMP1:%.*]] = select <4 x i1> [[TMP0]], <4 x i64> [[VEC_PHI]], <4 x i64> [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i32 [[INDEX_NEXT]], 8
-; CHECK-NEXT: br i1 [[TMP2]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP24:![0-9]+]]
+; CHECK-NEXT: br i1 [[TMP2]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP25:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.mul.v4i64(<4 x i64> [[TMP1]])
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -1067,7 +1218,7 @@ define i64 @live_in_known_1_via_scev() {
; CHECK-NEXT: [[RED_MUL]] = mul nsw i64 [[RED]], [[P_EXT]]
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
; CHECK-NEXT: [[EC:%.*]] = icmp eq i32 [[IV_NEXT]], [[N]]
-; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP25:![0-9]+]]
+; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP26:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: [[RES:%.*]] = phi i64 [ [[RED_MUL]], [[LOOP]] ], [ [[TMP3]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[RES]]
@@ -1114,7 +1265,7 @@ define i64 @cost_loop_invariant_recipes(i1 %x, i64 %y) {
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <2 x i64> [ splat (i64 1), [[VECTOR_PH]] ], [ [[TMP3:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP3]] = mul <2 x i64> [[TMP2]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
-; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
+; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP27:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vector.reduce.mul.v2i64(<2 x i64> [[TMP3]])
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
@@ -1131,7 +1282,7 @@ define i64 @cost_loop_invariant_recipes(i1 %x, i64 %y) {
; CHECK-NEXT: [[RED_MUL]] = mul i64 [[SHL]], [[RED]]
; CHECK-NEXT: [[IV_NEXT_I_I_I]] = add i64 [[IV]], 1
; CHECK-NEXT: [[EC:%.*]] = icmp eq i64 [[IV]], 1
-; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP27:![0-9]+]]
+; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP28:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: [[RED_MUL_LCSSA:%.*]] = phi i64 [ [[RED_MUL]], [[LOOP]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[RED_MUL_LCSSA]]
@@ -1178,7 +1329,7 @@ define i32 @narrowed_reduction(ptr %a, i1 %cmp) #0 {
; CHECK-NEXT: [[TMP7]] = zext <16 x i1> [[TMP5]] to <16 x i32>
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 32
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], 0
-; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP28:![0-9]+]]
+; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP29:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: [[TMP9:%.*]] = trunc <16 x i32> [[TMP6]] to <16 x i1>
; CHECK-NEXT: [[TMP10:%.*]] = trunc <16 x i32> [[TMP7]] to <16 x i1>
@@ -1197,7 +1348,7 @@ define i32 @narrowed_reduction(ptr %a, i1 %cmp) #0 {
; CHECK-NEXT: [[OR]] = or i32 [[AND]], [[CONV]]
; CHECK-NEXT: [[INC]] = add i32 [[IV]], 1
; CHECK-NEXT: [[EC:%.*]] = icmp eq i32 [[IV]], 0
-; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP29:![0-9]+]]
+; CHECK-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP30:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: [[OR_LCSSA:%.*]] = phi i32 [ [[OR]], [[LOOP]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i32 [[OR_LCSSA]]
diff --git a/llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll b/llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll
index fc6059d036cd07..87bbf68b4e2e84 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll
@@ -21,7 +21,7 @@ define void @fp_iv_loop1(ptr noalias nocapture %A, i32 %N) #0 {
; AUTO_VEC-NEXT: br i1 [[CMP4]], label [[ITER_CHECK:%.*]], label [[FOR_END:%.*]]
; AUTO_VEC: iter.check:
; AUTO_VEC-NEXT: [[ZEXT:%.*]] = zext nneg i32 [[N]] to i64
-; AUTO_VEC-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
+; AUTO_VEC-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
; AUTO_VEC-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; AUTO_VEC: vector.main.loop.iter.check:
; AUTO_VEC-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[N]], 32
@@ -57,27 +57,27 @@ define void @fp_iv_loop1(ptr noalias nocapture %A, i32 %N) #0 {
; AUTO_VEC-NEXT: [[DOTCAST7:%.*]] = uitofp nneg i64 [[N_VEC]] to float
; AUTO_VEC-NEXT: [[TMP6:%.*]] = fmul fast float [[DOTCAST7]], 5.000000e-01
; AUTO_VEC-NEXT: [[IND_END8:%.*]] = fadd fast float [[TMP6]], 1.000000e+00
-; AUTO_VEC-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[ZEXT]], 28
+; AUTO_VEC-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[ZEXT]], 24
; AUTO_VEC-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
; AUTO_VEC-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[FOR_BODY]], label [[VEC_EPILOG_PH]]
; AUTO_VEC: vec.epilog.ph:
; AUTO_VEC-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; AUTO_VEC-NEXT: [[BC_RESUME_VAL:%.*]] = phi float [ [[IND_END]], [[VEC_EPILOG_ITER_CHECK]] ], [ 1.000000e+00, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; AUTO_VEC-NEXT: [[N_VEC3:%.*]] = and i64 [[ZEXT]], 2147483644
+; AUTO_VEC-NEXT: [[N_VEC3:%.*]] = and i64 [[ZEXT]], 2147483640
; AUTO_VEC-NEXT: [[DOTCAST5:%.*]] = uitofp nneg i64 [[N_VEC3]] to float
; AUTO_VEC-NEXT: [[TMP7:%.*]] = fmul fast float [[DOTCAST5]], 5.000000e-01
; AUTO_VEC-NEXT: [[IND_END6:%.*]] = fadd fast float [[TMP7]], 1.000000e+00
-; AUTO_VEC-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[BC_RESUME_VAL]], i64 0
-; AUTO_VEC-NEXT: [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
-; AUTO_VEC-NEXT: [[INDUCTION:%.*]] = fadd fast <4 x float> [[DOTSPLAT]], <float 0.000000e+00, float 5.000000e-01, float 1.000000e+00, float 1.500000e+00>
+; AUTO_VEC-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <8 x float> poison, float [[BC_RESUME_VAL]], i64 0
+; AUTO_VEC-NEXT: [[DOTSPLAT:%.*]] = shufflevector <8 x float> [[DOTSPLATINSERT]], <8 x float> poison, <8 x i32> zeroinitializer
+; AUTO_VEC-NEXT: [[INDUCTION:%.*]] = fadd fast <8 x float> [[DOTSPLAT]], <float 0.000000e+00, float 5.000000e-01, float 1.000000e+00, float 1.500000e+00, float 2.000000e+00, float 2.500000e+00, float 3.000000e+00, float 3.500000e+00>
; AUTO_VEC-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; AUTO_VEC: vec.epilog.vector.body:
; AUTO_VEC-NEXT: [[INDEX10:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT13:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
-; AUTO_VEC-NEXT: [[VEC_IND11:%.*]] = phi <4 x float> [ [[INDUCTION]], [[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT12:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; AUTO_VEC-NEXT: [[VEC_IND7:%.*]] = phi <8 x float> [ [[INDUCTION]], [[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT8:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
; AUTO_VEC-NEXT: [[TMP8:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDEX10]]
-; AUTO_VEC-NEXT: store <4 x float> [[VEC_IND11]], ptr [[TMP8]], align 4
-; AUTO_VEC-NEXT: [[INDEX_NEXT13]] = add nuw i64 [[INDEX10]], 4
-; AUTO_VEC-NEXT: [[VEC_IND_NEXT12]] = fadd fast <4 x float> [[VEC_IND11]], splat (float 2.000000e+00)
+; AUTO_VEC-NEXT: store <8 x float> [[VEC_IND7]], ptr [[TMP8]], align 4
+; AUTO_VEC-NEXT: [[INDEX_NEXT13]] = add nuw i64 [[INDEX10]], 8
+; AUTO_VEC-NEXT: [[VEC_IND_NEXT8]] = fadd fast <8 x float> [[VEC_IND7]], splat (float 4.000000e+00)
; AUTO_VEC-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT13]], [[N_VEC3]]
; AUTO_VEC-NEXT: br i1 [[TMP9]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; AUTO_VEC: vec.epilog.middle.block:
@@ -393,7 +393,7 @@ define void @fadd_reassoc_FMF(ptr nocapture %p, i32 %N) {
; AUTO_VEC-NEXT: br i1 [[CMP_NOT11]], label [[FOR_COND_CLEANUP:%.*]], label [[ITER_CHECK:%.*]]
; AUTO_VEC: iter.check:
; AUTO_VEC-NEXT: [[TMP0:%.*]] = zext i32 [[N]] to i64
-; AUTO_VEC-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
+; AUTO_VEC-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
; AUTO_VEC-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[FOR_BODY:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
; AUTO_VEC: vector.main.loop.iter.check:
; AUTO_VEC-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[N]], 32
@@ -437,29 +437,29 @@ define void @fadd_reassoc_FMF(ptr nocapture %p, i32 %N) {
; AUTO_VEC-NEXT: [[DOTCAST10:%.*]] = uitofp nneg i64 [[N_VEC]] to float
; AUTO_VEC-NEXT: [[TMP11:%.*]] = fmul reassoc float [[DOTCAST10]], 4.200000e+01
; AUTO_VEC-NEXT: [[IND_END11:%.*]] = fadd reassoc float [[TMP11]], 1.000000e+00
-; AUTO_VEC-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 28
+; AUTO_VEC-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP0]], 24
; AUTO_VEC-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
; AUTO_VEC-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[FOR_BODY]], label [[VEC_EPILOG_PH]]
; AUTO_VEC: vec.epilog.ph:
; AUTO_VEC-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; AUTO_VEC-NEXT: [[BC_RESUME_VAL:%.*]] = phi float [ [[IND_END]], [[VEC_EPILOG_ITER_CHECK]] ], [ 1.000000e+00, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
-; AUTO_VEC-NEXT: [[N_VEC6:%.*]] = and i64 [[TMP0]], 4294967292
+; AUTO_VEC-NEXT: [[N_VEC6:%.*]] = and i64 [[TMP0]], 4294967288
; AUTO_VEC-NEXT: [[DOTCAST8:%.*]] = uitofp nneg i64 [[N_VEC6]] to float
; AUTO_VEC-NEXT: [[TMP12:%.*]] = fmul reassoc float [[DOTCAST8]], 4.200000e+01
; AUTO_VEC-NEXT: [[IND_END9:%.*]] = fadd reassoc float [[TMP12]], 1.000000e+00
-; AUTO_VEC-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[BC_RESUME_VAL]], i64 0
-; AUTO_VEC-NEXT: [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
-; AUTO_VEC-NEXT: [[INDUCTION:%.*]] = fadd reassoc <4 x float> [[DOTSPLAT]], <float 0.000000e+00, float 4.200000e+01, float 8.400000e+01, float 1.260000e+02>
+; AUTO_VEC-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <8 x float> poison, float [[BC_RESUME_VAL]], i64 0
+; AUTO_VEC-NEXT: [[DOTSPLAT:%.*]] = shufflevector <8 x float> [[DOTSPLATINSERT]], <8 x float> poison, <8 x i32> zeroinitializer
+; AUTO_VEC-NEXT: [[INDUCTION:%.*]] = fadd reassoc <8 x float> [[DOTSPLAT]], <float 0.000000e+00, float 4.200000e+01, float 8.400000e+01, float 1.260000e+02, float 1.680000e+02, float 2.100000e+02, float 2.520000e+02, float 2.940000e+02>
; AUTO_VEC-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; AUTO_VEC: vec.epilog.vector.body:
; AUTO_VEC-NEXT: [[INDEX13:%.*]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT17:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
-; AUTO_VEC-NEXT: [[VEC_IND14:%.*]] = phi <4 x float> [ [[INDUCTION]], [[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT15:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
+; AUTO_VEC-NEXT: [[VEC_IND10:%.*]] = phi <8 x float> [ [[INDUCTION]], [[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT11:%.*]], [[VEC_EPILOG_VECTOR_BODY]] ]
; AUTO_VEC-NEXT: [[TMP13:%.*]] = getelementptr inbounds nuw float, ptr [[P]], i64 [[INDEX13]]
-; AUTO_VEC-NEXT: [[WIDE_LOAD16:%.*]] = load <4 x float>, ptr [[TMP13]], align 4
-; AUTO_VEC-NEXT: [[TMP14:%.*]] = fadd reassoc <4 x float> [[VEC_IND14]], [[WIDE_LOAD16]]
-; AUTO_VEC-NEXT: store <4 x float> [[TMP14]], ptr [[TMP13]], align 4
-; AUTO_VEC-NEXT: [[INDEX_NEXT17]] = add nuw i64 [[INDEX13]], 4
-; AUTO_VEC-NEXT: [[VEC_IND_NEXT15]] = fadd reassoc <4 x float> [[VEC_IND14]], splat (float 1.680000e+02)
+; AUTO_VEC-NEXT: [[WIDE_LOAD12:%.*]] = load <8 x float>, ptr [[TMP13]], align 4
+; AUTO_VEC-NEXT: [[TMP17:%.*]] = fadd reassoc <8 x float> [[VEC_IND10]], [[WIDE_LOAD12]]
+; AUTO_VEC-NEXT: store <8 x float> [[TMP17]], ptr [[TMP13]], align 4
+; AUTO_VEC-NEXT: [[INDEX_NEXT17]] = add nuw i64 [[INDEX13]], 8
+; AUTO_VEC-NEXT: [[VEC_IND_NEXT11]] = fadd reassoc <8 x float> [[VEC_IND10]], splat (float 3.360000e+02)
; AUTO_VEC-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT17]], [[N_VEC6]]
; AUTO_VEC-NEXT: br i1 [[TMP15]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
; AUTO_VEC: vec.epilog.middle.block:
diff --git a/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll b/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
index 5c9375eb1d17f4..5510ca51152f47 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
@@ -185,7 +185,7 @@ define void @geps_feeding_interleave_groups_with_reuse(ptr %arg, i64 %arg1, ptr
; CHECK-SAME: ptr [[ARG:%.*]], i64 [[ARG1:%.*]], ptr [[ARG2:%.*]]) #[[ATTR0:[0-9]+]] {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[ARG1]], 1
-; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], 54
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], 56
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
; CHECK: [[VECTOR_SCEVCHECK]]:
; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i8, ptr [[ARG2]], i64 8
@@ -235,7 +235,7 @@ define void @geps_feeding_interleave_groups_with_reuse(ptr %arg, i64 %arg1, ptr
; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
; CHECK: [[VECTOR_PH]]:
-; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 2
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], 4
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
@@ -245,29 +245,29 @@ define void @geps_feeding_interleave_groups_with_reuse(ptr %arg, i64 %arg1, ptr
; CHECK-NEXT: [[TMP26:%.*]] = getelementptr i8, ptr [[ARG]], i64 [[TMP25]]
; CHECK-NEXT: [[TMP27:%.*]] = shl i64 [[TMP24]], 4
; CHECK-NEXT: [[TMP28:%.*]] = getelementptr i8, ptr [[ARG2]], i64 [[TMP27]]
-; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <16 x float>, ptr [[TMP26]], align 4
-; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 0, i32 8>
-; CHECK-NEXT: [[STRIDED_VEC14:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 1, i32 9>
-; CHECK-NEXT: [[STRIDED_VEC15:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 2, i32 10>
-; CHECK-NEXT: [[STRIDED_VEC16:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 3, i32 11>
-; CHECK-NEXT: [[STRIDED_VEC17:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 4, i32 12>
-; CHECK-NEXT: [[STRIDED_VEC18:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 5, i32 13>
-; CHECK-NEXT: [[STRIDED_VEC19:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 6, i32 14>
-; CHECK-NEXT: [[STRIDED_VEC20:%.*]] = shufflevector <16 x float> [[WIDE_VEC]], <16 x float> poison, <2 x i32> <i32 7, i32 15>
-; CHECK-NEXT: [[TMP30:%.*]] = fadd <2 x float> [[STRIDED_VEC]], [[STRIDED_VEC17]]
-; CHECK-NEXT: [[TMP31:%.*]] = fmul <2 x float> [[TMP30]], zeroinitializer
-; CHECK-NEXT: [[TMP32:%.*]] = fadd <2 x float> [[STRIDED_VEC14]], [[STRIDED_VEC18]]
-; CHECK-NEXT: [[TMP33:%.*]] = fmul <2 x float> [[TMP32]], zeroinitializer
-; CHECK-NEXT: [[TMP34:%.*]] = fadd <2 x float> [[STRIDED_VEC15]], [[STRIDED_VEC19]]
-; CHECK-NEXT: [[TMP35:%.*]] = fmul <2 x float> [[TMP34]], zeroinitializer
-; CHECK-NEXT: [[TMP36:%.*]] = fadd <2 x float> [[STRIDED_VEC16]], [[STRIDED_VEC20]]
-; CHECK-NEXT: [[TMP37:%.*]] = fmul <2 x float> [[TMP36]], zeroinitializer
-; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <2 x float> [[TMP31]], <2 x float> [[TMP33]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; CHECK-NEXT: [[TMP41:%.*]] = shufflevector <2 x float> [[TMP35]], <2 x float> [[TMP37]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <32 x float>, ptr [[TMP26]], align 4
+; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 0, i32 8, i32 16, i32 24>
+; CHECK-NEXT: [[STRIDED_VEC14:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 1, i32 9, i32 17, i32 25>
+; CHECK-NEXT: [[STRIDED_VEC15:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 2, i32 10, i32 18, i32 26>
+; CHECK-NEXT: [[STRIDED_VEC16:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 3, i32 11, i32 19, i32 27>
+; CHECK-NEXT: [[STRIDED_VEC17:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 4, i32 12, i32 20, i32 28>
+; CHECK-NEXT: [[STRIDED_VEC18:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 5, i32 13, i32 21, i32 29>
+; CHECK-NEXT: [[STRIDED_VEC19:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 6, i32 14, i32 22, i32 30>
+; CHECK-NEXT: [[STRIDED_VEC20:%.*]] = shufflevector <32 x float> [[WIDE_VEC]], <32 x float> poison, <4 x i32> <i32 7, i32 15, i32 23, i32 31>
+; CHECK-NEXT: [[TMP29:%.*]] = fadd <4 x float> [[STRIDED_VEC]], [[STRIDED_VEC17]]
+; CHECK-NEXT: [[TMP40:%.*]] = fmul <4 x float> [[TMP29]], zeroinitializer
+; CHECK-NEXT: [[TMP31:%.*]] = fadd <4 x float> [[STRIDED_VEC14]], [[STRIDED_VEC18]]
+; CHECK-NEXT: [[TMP41:%.*]] = fmul <4 x float> [[TMP31]], zeroinitializer
+; CHECK-NEXT: [[TMP33:%.*]] = fadd <4 x float> [[STRIDED_VEC15]], [[STRIDED_VEC19]]
+; CHECK-NEXT: [[TMP34:%.*]] = fmul <4 x float> [[TMP33]], zeroinitializer
+; CHECK-NEXT: [[TMP35:%.*]] = fadd <4 x float> [[STRIDED_VEC16]], [[STRIDED_VEC20]]
+; CHECK-NEXT: [[TMP36:%.*]] = fmul <4 x float> [[TMP35]], zeroinitializer
; CHECK-NEXT: [[TMP42:%.*]] = shufflevector <4 x float> [[TMP40]], <4 x float> [[TMP41]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x float> [[TMP42]], <8 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
-; CHECK-NEXT: store <8 x float> [[INTERLEAVED_VEC]], ptr [[TMP28]], align 4
-; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT: [[TMP38:%.*]] = shufflevector <4 x float> [[TMP34]], <4 x float> [[TMP36]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <8 x float> [[TMP42]], <8 x float> [[TMP38]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <16 x float> [[TMP39]], <16 x float> poison, <16 x i32> <i32 0, i32 4, i32 8, i32 12, i32 1, i32 5, i32 9, i32 13, i32 2, i32 6, i32 10, i32 14, i32 3, i32 7, i32 11, i32 15>
+; CHECK-NEXT: store <16 x float> [[INTERLEAVED_VEC]], ptr [[TMP28]], align 4
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: [[TMP43:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP43]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: [[MIDDLE_BLOCK]]:
diff --git a/llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll b/llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
index adfffccb6bcac0..365ded4ad39aab 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
@@ -386,15 +386,22 @@ for.end: ; preds = %for.body
define void @test_store_of_final_reduction_value(i64 %x, ptr %dst) {
; CHECK-LABEL: @test_store_of_final_reduction_value(
; CHECK-NEXT: entry:
+; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK: vector.ph:
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: br label [[LOOP:%.*]]
-; CHECK: loop:
-; CHECK-NEXT: [[IV4:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
-; CHECK-NEXT: [[RED:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[RED_NEXT:%.*]], [[LOOP]] ]
-; CHECK-NEXT: [[RED_NEXT]] = mul i64 [[RED]], [[X:%.*]]
+; CHECK: vector.body:
+; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[LOOP]], !llvm.loop [[LOOP47:![0-9]+]]
+; CHECK: middle.block:
+; CHECK-NEXT: [[TMP0:%.*]] = mul nuw <2 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1>
+; CHECK-NEXT: [[RED_NEXT:%.*]] = call i64 @llvm.vector.reduce.mul.v2i64(<2 x i64> [[TMP0]])
; CHECK-NEXT: store i64 [[RED_NEXT]], ptr [[DST:%.*]], align 8
-; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV4]], 1
-; CHECK-NEXT: [[EC:%.*]] = icmp eq i64 [[IV4]], 1
-; CHECK-NEXT: br i1 [[EC]], label [[EXIT:%.*]], label [[LOOP]]
+; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
+; CHECK: scalar.ph:
+; CHECK-NEXT: br label [[LOOP1:%.*]]
+; CHECK: loop:
+; CHECK-NEXT: br i1 poison, label [[EXIT]], label [[LOOP1]], !llvm.loop [[LOOP48:![0-9]+]]
; CHECK: exit:
; CHECK-NEXT: ret void
;
diff --git a/llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll b/llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll
index 50414cc29312c0..7434335b8934ef 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll
@@ -216,22 +216,10 @@ define void @test_tc_20(ptr noalias %src, ptr noalias %dst) {
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[SRC:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 0
-; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 4
-; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 8
-; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[TMP1]], i32 12
-; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP2]], align 64
-; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[TMP3]], align 64
-; CHECK-NEXT: [[WIDE_LOAD2:%.*]] = load <4 x i8>, ptr [[TMP4]], align 64
-; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <4 x i8>, ptr [[TMP5]], align 64
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP2]], align 64
; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[DST:%.*]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 4
-; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 8
-; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 12
-; CHECK-NEXT: store <4 x i8> [[WIDE_LOAD]], ptr [[TMP7]], align 64
-; CHECK-NEXT: store <4 x i8> [[WIDE_LOAD1]], ptr [[TMP8]], align 64
-; CHECK-NEXT: store <4 x i8> [[WIDE_LOAD2]], ptr [[TMP9]], align 64
-; CHECK-NEXT: store <4 x i8> [[WIDE_LOAD3]], ptr [[TMP10]], align 64
+; CHECK-NEXT: store <16 x i8> [[WIDE_LOAD]], ptr [[TMP7]], align 64
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
diff --git a/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll b/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
index baacad482f9626..8026ada1482739 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
@@ -806,9 +806,45 @@ define void @switch4_default_common_dest_with_case(ptr %start, ptr %end) {
; COST-LABEL: define void @switch4_default_common_dest_with_case(
; COST-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0]] {
; COST-NEXT: [[ENTRY:.*]]:
+; COST-NEXT: [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; COST-NEXT: [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; COST-NEXT: [[TMP0:%.*]] = add i64 [[END1]], -8
+; COST-NEXT: [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; COST-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; COST-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; COST-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 4
+; COST-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; COST: [[VECTOR_PH]]:
+; COST-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 4
+; COST-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; COST-NEXT: [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; COST-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; COST-NEXT: br label %[[VECTOR_BODY:.*]]
+; COST: [[VECTOR_BODY]]:
+; COST-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; COST-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; COST-NEXT: [[TMP6:%.*]] = add i64 [[OFFSET_IDX]], 0
+; COST-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP6]]
+; COST-NEXT: [[TMP7:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; COST-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP7]], align 1
+; COST-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], splat (i64 -12)
+; COST-NEXT: [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], splat (i64 13)
+; COST-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP8]], [[TMP9]]
+; COST-NEXT: [[TMP11:%.*]] = xor <4 x i1> [[TMP10]], splat (i1 true)
+; COST-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP7]], i32 1, <4 x i1> [[TMP9]])
+; COST-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> splat (i64 42), ptr [[TMP7]], i32 1, <4 x i1> [[TMP8]])
+; COST-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> splat (i64 2), ptr [[TMP7]], i32 1, <4 x i1> [[TMP11]])
+; COST-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; COST-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; COST-NEXT: br i1 [[TMP12]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; COST: [[MIDDLE_BLOCK]]:
+; COST-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; COST-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; COST: [[SCALAR_PH]]:
+; COST-NEXT: [[BC_RESUME_VAL:%.*]] = phi ptr [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
; COST-NEXT: br label %[[LOOP_HEADER:.*]]
; COST: [[LOOP_HEADER]]:
-; COST-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; COST-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
; COST-NEXT: [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
; COST-NEXT: switch i64 [[L]], label %[[DEFAULT:.*]] [
; COST-NEXT: i64 -12, label %[[IF_THEN_1:.*]]
@@ -827,7 +863,7 @@ define void @switch4_default_common_dest_with_case(ptr %start, ptr %end) {
; COST: [[LOOP_LATCH]]:
; COST-NEXT: [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
; COST-NEXT: [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; COST-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; COST-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP9:![0-9]+]]
; COST: [[EXIT]]:
; COST-NEXT: ret void
;
@@ -977,7 +1013,7 @@ define void @switch_under_br_default_common_dest_with_case(ptr %start, ptr %end,
; COST-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> splat (i64 2), ptr [[TMP6]], i32 1, <4 x i1> [[TMP14]])
; COST-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; COST-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; COST-NEXT: br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; COST-NEXT: br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; COST: [[MIDDLE_BLOCK]]:
; COST-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
; COST-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
@@ -1007,7 +1043,7 @@ define void @switch_under_br_default_common_dest_with_case(ptr %start, ptr %end,
; COST: [[LOOP_LATCH]]:
; COST-NEXT: [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
; COST-NEXT: [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; COST-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP9:![0-9]+]]
+; COST-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP11:![0-9]+]]
; COST: [[EXIT]]:
; COST-NEXT: ret void
;
@@ -1459,6 +1495,8 @@ exit:
; COST: [[LOOP7]] = distinct !{[[LOOP7]], [[META2]], [[META1]]}
; COST: [[LOOP8]] = distinct !{[[LOOP8]], [[META1]], [[META2]]}
; COST: [[LOOP9]] = distinct !{[[LOOP9]], [[META2]], [[META1]]}
+; COST: [[LOOP10]] = distinct !{[[LOOP10]], [[META1]], [[META2]]}
+; COST: [[LOOP11]] = distinct !{[[LOOP11]], [[META2]], [[META1]]}
;.
; FORCED: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
; FORCED: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
diff --git a/llvm/test/Transforms/LoopVectorize/X86/strided_load_cost.ll b/llvm/test/Transforms/LoopVectorize/X86/strided_load_cost.ll
index d316befb9548d2..02f54c6097c0bc 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/strided_load_cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/strided_load_cost.ll
@@ -604,6 +604,22 @@ define void @test(ptr %A, ptr noalias %B) #0 {
; MAX-BW-NEXT: [[TMP13:%.*]] = add i64 [[OFFSET_IDX]], 26
; MAX-BW-NEXT: [[TMP14:%.*]] = add i64 [[OFFSET_IDX]], 28
; MAX-BW-NEXT: [[TMP15:%.*]] = add i64 [[OFFSET_IDX]], 30
+; MAX-BW-NEXT: [[TMP33:%.*]] = add i64 [[OFFSET_IDX]], 32
+; MAX-BW-NEXT: [[TMP34:%.*]] = add i64 [[OFFSET_IDX]], 34
+; MAX-BW-NEXT: [[TMP35:%.*]] = add i64 [[OFFSET_IDX]], 36
+; MAX-BW-NEXT: [[TMP69:%.*]] = add i64 [[OFFSET_IDX]], 38
+; MAX-BW-NEXT: [[TMP70:%.*]] = add i64 [[OFFSET_IDX]], 40
+; MAX-BW-NEXT: [[TMP71:%.*]] = add i64 [[OFFSET_IDX]], 42
+; MAX-BW-NEXT: [[TMP72:%.*]] = add i64 [[OFFSET_IDX]], 44
+; MAX-BW-NEXT: [[TMP73:%.*]] = add i64 [[OFFSET_IDX]], 46
+; MAX-BW-NEXT: [[TMP74:%.*]] = add i64 [[OFFSET_IDX]], 48
+; MAX-BW-NEXT: [[TMP75:%.*]] = add i64 [[OFFSET_IDX]], 50
+; MAX-BW-NEXT: [[TMP76:%.*]] = add i64 [[OFFSET_IDX]], 52
+; MAX-BW-NEXT: [[TMP77:%.*]] = add i64 [[OFFSET_IDX]], 54
+; MAX-BW-NEXT: [[TMP78:%.*]] = add i64 [[OFFSET_IDX]], 56
+; MAX-BW-NEXT: [[TMP79:%.*]] = add i64 [[OFFSET_IDX]], 58
+; MAX-BW-NEXT: [[TMP80:%.*]] = add i64 [[OFFSET_IDX]], 60
+; MAX-BW-NEXT: [[TMP81:%.*]] = add i64 [[OFFSET_IDX]], 62
; MAX-BW-NEXT: [[TMP16:%.*]] = add nuw nsw i64 [[TMP0]], 0
; MAX-BW-NEXT: [[TMP17:%.*]] = add nuw nsw i64 [[TMP1]], 0
; MAX-BW-NEXT: [[TMP18:%.*]] = add nuw nsw i64 [[TMP2]], 0
@@ -620,12 +636,28 @@ define void @test(ptr %A, ptr noalias %B) #0 {
; MAX-BW-NEXT: [[TMP29:%.*]] = add nuw nsw i64 [[TMP13]], 0
; MAX-BW-NEXT: [[TMP30:%.*]] = add nuw nsw i64 [[TMP14]], 0
; MAX-BW-NEXT: [[TMP31:%.*]] = add nuw nsw i64 [[TMP15]], 0
+; MAX-BW-NEXT: [[TMP82:%.*]] = add nuw nsw i64 [[TMP33]], 0
+; MAX-BW-NEXT: [[TMP99:%.*]] = add nuw nsw i64 [[TMP34]], 0
+; MAX-BW-NEXT: [[TMP100:%.*]] = add nuw nsw i64 [[TMP35]], 0
+; MAX-BW-NEXT: [[TMP101:%.*]] = add nuw nsw i64 [[TMP69]], 0
+; MAX-BW-NEXT: [[TMP102:%.*]] = add nuw nsw i64 [[TMP70]], 0
+; MAX-BW-NEXT: [[TMP103:%.*]] = add nuw nsw i64 [[TMP71]], 0
+; MAX-BW-NEXT: [[TMP104:%.*]] = add nuw nsw i64 [[TMP72]], 0
+; MAX-BW-NEXT: [[TMP105:%.*]] = add nuw nsw i64 [[TMP73]], 0
+; MAX-BW-NEXT: [[TMP106:%.*]] = add nuw nsw i64 [[TMP74]], 0
+; MAX-BW-NEXT: [[TMP107:%.*]] = add nuw nsw i64 [[TMP75]], 0
+; MAX-BW-NEXT: [[TMP108:%.*]] = add nuw nsw i64 [[TMP76]], 0
+; MAX-BW-NEXT: [[TMP109:%.*]] = add nuw nsw i64 [[TMP77]], 0
+; MAX-BW-NEXT: [[TMP110:%.*]] = add nuw nsw i64 [[TMP78]], 0
+; MAX-BW-NEXT: [[TMP111:%.*]] = add nuw nsw i64 [[TMP79]], 0
+; MAX-BW-NEXT: [[TMP112:%.*]] = add nuw nsw i64 [[TMP80]], 0
+; MAX-BW-NEXT: [[TMP113:%.*]] = add nuw nsw i64 [[TMP81]], 0
; MAX-BW-NEXT: [[TMP32:%.*]] = getelementptr inbounds [1024 x i32], ptr [[A:%.*]], i64 0, i64 [[TMP16]]
-; MAX-BW-NEXT: [[WIDE_VEC:%.*]] = load <32 x i32>, ptr [[TMP32]], align 4
-; MAX-BW-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <32 x i32> [[WIDE_VEC]], <32 x i32> poison, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
-; MAX-BW-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <32 x i32> [[WIDE_VEC]], <32 x i32> poison, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
-; MAX-BW-NEXT: [[TMP34:%.*]] = add <16 x i32> [[STRIDED_VEC]], [[STRIDED_VEC1]]
-; MAX-BW-NEXT: [[TMP35:%.*]] = trunc <16 x i32> [[TMP34]] to <16 x i8>
+; MAX-BW-NEXT: [[WIDE_VEC:%.*]] = load <64 x i32>, ptr [[TMP32]], align 4
+; MAX-BW-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <64 x i32> [[WIDE_VEC]], <64 x i32> poison, <32 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62>
+; MAX-BW-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <64 x i32> [[WIDE_VEC]], <64 x i32> poison, <32 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63>
+; MAX-BW-NEXT: [[TMP114:%.*]] = add <32 x i32> [[STRIDED_VEC]], [[STRIDED_VEC1]]
+; MAX-BW-NEXT: [[TMP131:%.*]] = trunc <32 x i32> [[TMP114]] to <32 x i8>
; MAX-BW-NEXT: [[TMP36:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B:%.*]], i64 0, i64 [[TMP16]]
; MAX-BW-NEXT: [[TMP37:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP17]]
; MAX-BW-NEXT: [[TMP38:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP18]]
@@ -642,39 +674,87 @@ define void @test(ptr %A, ptr noalias %B) #0 {
; MAX-BW-NEXT: [[TMP49:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP29]]
; MAX-BW-NEXT: [[TMP50:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP30]]
; MAX-BW-NEXT: [[TMP51:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP31]]
-; MAX-BW-NEXT: [[TMP52:%.*]] = extractelement <16 x i8> [[TMP35]], i32 0
+; MAX-BW-NEXT: [[TMP83:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP82]]
+; MAX-BW-NEXT: [[TMP84:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP99]]
+; MAX-BW-NEXT: [[TMP85:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP100]]
+; MAX-BW-NEXT: [[TMP86:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP101]]
+; MAX-BW-NEXT: [[TMP87:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP102]]
+; MAX-BW-NEXT: [[TMP88:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP103]]
+; MAX-BW-NEXT: [[TMP89:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP104]]
+; MAX-BW-NEXT: [[TMP90:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP105]]
+; MAX-BW-NEXT: [[TMP91:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP106]]
+; MAX-BW-NEXT: [[TMP92:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP107]]
+; MAX-BW-NEXT: [[TMP93:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP108]]
+; MAX-BW-NEXT: [[TMP94:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP109]]
+; MAX-BW-NEXT: [[TMP95:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP110]]
+; MAX-BW-NEXT: [[TMP96:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP111]]
+; MAX-BW-NEXT: [[TMP97:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP112]]
+; MAX-BW-NEXT: [[TMP98:%.*]] = getelementptr inbounds [1024 x i8], ptr [[B]], i64 0, i64 [[TMP113]]
+; MAX-BW-NEXT: [[TMP52:%.*]] = extractelement <32 x i8> [[TMP131]], i32 0
; MAX-BW-NEXT: store i8 [[TMP52]], ptr [[TMP36]], align 1
-; MAX-BW-NEXT: [[TMP53:%.*]] = extractelement <16 x i8> [[TMP35]], i32 1
+; MAX-BW-NEXT: [[TMP53:%.*]] = extractelement <32 x i8> [[TMP131]], i32 1
; MAX-BW-NEXT: store i8 [[TMP53]], ptr [[TMP37]], align 1
-; MAX-BW-NEXT: [[TMP54:%.*]] = extractelement <16 x i8> [[TMP35]], i32 2
+; MAX-BW-NEXT: [[TMP54:%.*]] = extractelement <32 x i8> [[TMP131]], i32 2
; MAX-BW-NEXT: store i8 [[TMP54]], ptr [[TMP38]], align 1
-; MAX-BW-NEXT: [[TMP55:%.*]] = extractelement <16 x i8> [[TMP35]], i32 3
+; MAX-BW-NEXT: [[TMP55:%.*]] = extractelement <32 x i8> [[TMP131]], i32 3
; MAX-BW-NEXT: store i8 [[TMP55]], ptr [[TMP39]], align 1
-; MAX-BW-NEXT: [[TMP56:%.*]] = extractelement <16 x i8> [[TMP35]], i32 4
+; MAX-BW-NEXT: [[TMP56:%.*]] = extractelement <32 x i8> [[TMP131]], i32 4
; MAX-BW-NEXT: store i8 [[TMP56]], ptr [[TMP40]], align 1
-; MAX-BW-NEXT: [[TMP57:%.*]] = extractelement <16 x i8> [[TMP35]], i32 5
+; MAX-BW-NEXT: [[TMP57:%.*]] = extractelement <32 x i8> [[TMP131]], i32 5
; MAX-BW-NEXT: store i8 [[TMP57]], ptr [[TMP41]], align 1
-; MAX-BW-NEXT: [[TMP58:%.*]] = extractelement <16 x i8> [[TMP35]], i32 6
+; MAX-BW-NEXT: [[TMP58:%.*]] = extractelement <32 x i8> [[TMP131]], i32 6
; MAX-BW-NEXT: store i8 [[TMP58]], ptr [[TMP42]], align 1
-; MAX-BW-NEXT: [[TMP59:%.*]] = extractelement <16 x i8> [[TMP35]], i32 7
+; MAX-BW-NEXT: [[TMP59:%.*]] = extractelement <32 x i8> [[TMP131]], i32 7
; MAX-BW-NEXT: store i8 [[TMP59]], ptr [[TMP43]], align 1
-; MAX-BW-NEXT: [[TMP60:%.*]] = extractelement <16 x i8> [[TMP35]], i32 8
+; MAX-BW-NEXT: [[TMP60:%.*]] = extractelement <32 x i8> [[TMP131]], i32 8
; MAX-BW-NEXT: store i8 [[TMP60]], ptr [[TMP44]], align 1
-; MAX-BW-NEXT: [[TMP61:%.*]] = extractelement <16 x i8> [[TMP35]], i32 9
+; MAX-BW-NEXT: [[TMP61:%.*]] = extractelement <32 x i8> [[TMP131]], i32 9
; MAX-BW-NEXT: store i8 [[TMP61]], ptr [[TMP45]], align 1
-; MAX-BW-NEXT: [[TMP62:%.*]] = extractelement <16 x i8> [[TMP35]], i32 10
+; MAX-BW-NEXT: [[TMP62:%.*]] = extractelement <32 x i8> [[TMP131]], i32 10
; MAX-BW-NEXT: store i8 [[TMP62]], ptr [[TMP46]], align 1
-; MAX-BW-NEXT: [[TMP63:%.*]] = extractelement <16 x i8> [[TMP35]], i32 11
+; MAX-BW-NEXT: [[TMP63:%.*]] = extractelement <32 x i8> [[TMP131]], i32 11
; MAX-BW-NEXT: store i8 [[TMP63]], ptr [[TMP47]], align 1
-; MAX-BW-NEXT: [[TMP64:%.*]] = extractelement <16 x i8> [[TMP35]], i32 12
+; MAX-BW-NEXT: [[TMP64:%.*]] = extractelement <32 x i8> [[TMP131]], i32 12
; MAX-BW-NEXT: store i8 [[TMP64]], ptr [[TMP48]], align 1
-; MAX-BW-NEXT: [[TMP65:%.*]] = extractelement <16 x i8> [[TMP35]], i32 13
+; MAX-BW-NEXT: [[TMP65:%.*]] = extractelement <32 x i8> [[TMP131]], i32 13
; MAX-BW-NEXT: store i8 [[TMP65]], ptr [[TMP49]], align 1
-; MAX-BW-NEXT: [[TMP66:%.*]] = extractelement <16 x i8> [[TMP35]], i32 14
+; MAX-BW-NEXT: [[TMP66:%.*]] = extractelement <32 x i8> [[TMP131]], i32 14
; MAX-BW-NEXT: store i8 [[TMP66]], ptr [[TMP50]], align 1
-; MAX-BW-NEXT: [[TMP67:%.*]] = extractelement <16 x i8> [[TMP35]], i32 15
+; MAX-BW-NEXT: [[TMP67:%.*]] = extractelement <32 x i8> [[TMP131]], i32 15
; MAX-BW-NEXT: store i8 [[TMP67]], ptr [[TMP51]], align 1
-; MAX-BW-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; MAX-BW-NEXT: [[TMP115:%.*]] = extractelement <32 x i8> [[TMP131]], i32 16
+; MAX-BW-NEXT: store i8 [[TMP115]], ptr [[TMP83]], align 1
+; MAX-BW-NEXT: [[TMP116:%.*]] = extractelement <32 x i8> [[TMP131]], i32 17
+; MAX-BW-NEXT: store i8 [[TMP116]], ptr [[TMP84]], align 1
+; MAX-BW-NEXT: [[TMP117:%.*]] = extractelement <32 x i8> [[TMP131]], i32 18
+; MAX-BW-NEXT: store i8 [[TMP117]], ptr [[TMP85]], align 1
+; MAX-BW-NEXT: [[TMP118:%.*]] = extractelement <32 x i8> [[TMP131]], i32 19
+; MAX-BW-NEXT: store i8 [[TMP118]], ptr [[TMP86]], align 1
+; MAX-BW-NEXT: [[TMP119:%.*]] = extractelement <32 x i8> [[TMP131]], i32 20
+; MAX-BW-NEXT: store i8 [[TMP119]], ptr [[TMP87]], align 1
+; MAX-BW-NEXT: [[TMP120:%.*]] = extractelement <32 x i8> [[TMP131]], i32 21
+; MAX-BW-NEXT: store i8 [[TMP120]], ptr [[TMP88]], align 1
+; MAX-BW-NEXT: [[TMP121:%.*]] = extractelement <32 x i8> [[TMP131]], i32 22
+; MAX-BW-NEXT: store i8 [[TMP121]], ptr [[TMP89]], align 1
+; MAX-BW-NEXT: [[TMP122:%.*]] = extractelement <32 x i8> [[TMP131]], i32 23
+; MAX-BW-NEXT: store i8 [[TMP122]], ptr [[TMP90]], align 1
+; MAX-BW-NEXT: [[TMP123:%.*]] = extractelement <32 x i8> [[TMP131]], i32 24
+; MAX-BW-NEXT: store i8 [[TMP123]], ptr [[TMP91]], align 1
+; MAX-BW-NEXT: [[TMP124:%.*]] = extractelement <32 x i8> [[TMP131]], i32 25
+; MAX-BW-NEXT: store i8 [[TMP124]], ptr [[TMP92]], align 1
+; MAX-BW-NEXT: [[TMP125:%.*]] = extractelement <32 x i8> [[TMP131]], i32 26
+; MAX-BW-NEXT: store i8 [[TMP125]], ptr [[TMP93]], align 1
+; MAX-BW-NEXT: [[TMP126:%.*]] = extractelement <32 x i8> [[TMP131]], i32 27
+; MAX-BW-NEXT: store i8 [[TMP126]], ptr [[TMP94]], align 1
+; MAX-BW-NEXT: [[TMP127:%.*]] = extractelement <32 x i8> [[TMP131]], i32 28
+; MAX-BW-NEXT: store i8 [[TMP127]], ptr [[TMP95]], align 1
+; MAX-BW-NEXT: [[TMP128:%.*]] = extractelement <32 x i8> [[TMP131]], i32 29
+; MAX-BW-NEXT: store i8 [[TMP128]], ptr [[TMP96]], align 1
+; MAX-BW-NEXT: [[TMP129:%.*]] = extractelement <32 x i8> [[TMP131]], i32 30
+; MAX-BW-NEXT: store i8 [[TMP129]], ptr [[TMP97]], align 1
+; MAX-BW-NEXT: [[TMP130:%.*]] = extractelement <32 x i8> [[TMP131]], i32 31
+; MAX-BW-NEXT: store i8 [[TMP130]], ptr [[TMP98]], align 1
+; MAX-BW-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
; MAX-BW-NEXT: [[TMP68:%.*]] = icmp eq i64 [[INDEX_NEXT]], 512
; MAX-BW-NEXT: br i1 [[TMP68]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; MAX-BW: middle.block:
>From ff0c81198c92fab0393e73179e80cddd64d472e7 Mon Sep 17 00:00:00 2001
From: Tingwei0512 <tingwewe at gmail.com>
Date: Sat, 4 Jan 2025 17:54:53 +0800
Subject: [PATCH 2/2] Fix: Modify the tie breaker logic of preferscalable
Co-authored-by: Tomlord1122 <r12944044 at ntu.edu.tw>
---
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 294b5c2c8911bc..7589a2dae06a14 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4375,7 +4375,7 @@ bool LoopVectorizationPlanner::isMoreProfitable(
// vectorization.
// Only check preferFixedOverScalableIfEqualCost() when A is scalable
- // and B isn't.
+ // but B isn't.
bool PreferScalable = true;
if (A.Width.isScalable() && !B.Width.isScalable())
PreferScalable = !TTI.preferFixedOverScalableIfEqualCost();
More information about the llvm-commits
mailing list