[llvm] [LV] Add support for partial alias masking with tail folding (PR #182457)
Benjamin Maxwell via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 4 08:45:46 PST 2026
https://github.com/MacDue updated https://github.com/llvm/llvm-project/pull/182457
>From a9879d8f133ff8ef49ebe24628d01fd7cc2fee40 Mon Sep 17 00:00:00 2001
From: Benjamin Maxwell <benjamin.maxwell at arm.com>
Date: Thu, 19 Feb 2026 15:29:32 +0000
Subject: [PATCH 1/2] [LV] Add support for partial alias masking with tail
folding
This patch adds basic support for partial alias masking, which allows
entering the vector loop even when there is aliasing within a single
vector iteration. It does this by clamping the VF to the safe distance
between pointers. This allows the runtime VF to be anywhere from 2 to
the "static" VF.
Conceptually, this transform looks like:
```
// `c` and `b` may alias.
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
```
->
```
svbool_t alias_mask = loop.dependence.war.mask(b, c);
int num_active = num_active_lanes(mask);
if (num_active >= 2) {
for (int i = 0; i < n; i += num_active) {
// ... vector loop masked with `alias_mask`
}
}
// ... scalar tail
```
This initial patch has a number of limitations:
- The loop must be tail-folded
* We intend to follow-up with full alias-masking support for loops
without tail-folding
- The mask and transform is only valid for IC = 1
* Some recipes may not handle the "ClampedVF" correctly at IC > 1
* Note: On AArch64, we also only have native alias mask instructions
for IC = 1
- Reverse iteration is not supported
* The mask reversal logic is not correct for the alias mask
(or clamped ALM)
- First order recurrences are not supported
* The `splice.right` is not lowered correctly for clamped VFs
- This style of vectorization is not enabled by default/costed
* It can be enabled with `-force-partial-aliasing-vectorization`
* When enabled, alias masking is used instead of the standard diff
checks (when legal to do so)
This PR supersedes #100579 (closes #100579).
---
llvm/lib/Analysis/VectorUtils.cpp | 2 +
.../Vectorize/LoopVectorizationPlanner.h | 7 +
.../Transforms/Vectorize/LoopVectorize.cpp | 124 ++++-
llvm/lib/Transforms/Vectorize/VPlan.h | 5 +-
.../Transforms/Vectorize/VPlanAnalysis.cpp | 2 +
.../Vectorize/VPlanConstruction.cpp | 10 +-
.../lib/Transforms/Vectorize/VPlanRecipes.cpp | 21 +-
.../Transforms/Vectorize/VPlanTransforms.cpp | 67 +++
.../Transforms/Vectorize/VPlanTransforms.h | 10 +
llvm/lib/Transforms/Vectorize/VPlanUtils.cpp | 18 +-
.../LoopVectorize/AArch64/alias-mask.ll | 472 ++++++++++++++++++
.../RISCV/alias-mask-force-evl.ll | 64 +++
.../AArch64/vplan-printing-alias-mask.ll | 92 ++++
.../VPlan/vplan-printing-alias-mask.ll | 93 ++++
.../LoopVectorize/VPlan/vplan-printing.ll | 12 +-
.../alias-mask-negative-tests.ll | 83 +++
.../Transforms/LoopVectorize/alias-mask.ll | 382 ++++++++++++++
.../LoopVectorize/pointer-induction.ll | 10 +-
.../reuse-lcssa-phi-scev-expansion.ll | 12 +-
19 files changed, 1457 insertions(+), 29 deletions(-)
create mode 100644 llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll
create mode 100644 llvm/test/Transforms/LoopVectorize/RISCV/alias-mask-force-evl.ll
create mode 100644 llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll
create mode 100644 llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing-alias-mask.ll
create mode 100644 llvm/test/Transforms/LoopVectorize/alias-mask-negative-tests.ll
create mode 100644 llvm/test/Transforms/LoopVectorize/alias-mask.ll
diff --git a/llvm/lib/Analysis/VectorUtils.cpp b/llvm/lib/Analysis/VectorUtils.cpp
index d4083c49626fe..e3cf650ddb76b 100644
--- a/llvm/lib/Analysis/VectorUtils.cpp
+++ b/llvm/lib/Analysis/VectorUtils.cpp
@@ -170,6 +170,8 @@ bool llvm::isVectorIntrinsicWithScalarOpAtArg(Intrinsic::ID ID,
return (ScalarOpdIdx == 2);
case Intrinsic::experimental_vp_splice:
return ScalarOpdIdx == 2 || ScalarOpdIdx == 4;
+ case Intrinsic::loop_dependence_war_mask:
+ return true;
default:
return false;
}
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 8368349e63cee..d666c159a699c 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -675,6 +675,13 @@ class LoopVectorizationPlanner {
void attachRuntimeChecks(VPlan &Plan, GeneratedRTChecks &RTChecks,
bool HasBranchWeights) const;
+ /// Materializes the alias mask within a check block before the loop. The
+ /// vector loop will only be entered if the clamped VF from the alias mask
+ /// is not scalar. Returns the clamped VF.
+ VPValue *materializeAliasMask(VPlan &Plan,
+ ArrayRef<PointerDiffInfo> DiffChecks,
+ bool HasBranchWeights);
+
#ifndef NDEBUG
/// \return The most profitable vectorization factor for the available VPlans
/// and the cost of that VF.
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 91c7f1680aac2..7c9156c7d38b8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -170,6 +170,8 @@ STATISTIC(LoopsVectorized, "Number of loops vectorized");
STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");
STATISTIC(LoopsEarlyExitVectorized, "Number of early exit loops vectorized");
+STATISTIC(LoopsPartialAliasVectorized,
+ "Number of partial aliasing loops vectorized");
static cl::opt<bool> EnableEpilogueVectorization(
"enable-epilogue-vectorization", cl::init(true), cl::Hidden,
@@ -205,6 +207,10 @@ static cl::opt<bool> ForceTargetSupportsMaskedMemoryOps(
cl::desc("Assume the target supports masked memory operations (used for "
"testing)."));
+static cl::opt<bool> ForcePartialAliasingVectorization(
+ "force-partial-aliasing-vectorization", cl::init(false), cl::Hidden,
+ cl::desc("Replace pointer diff checks with alias masks."));
+
// Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,
// that predication is preferred, and this lists all options. I.e., the
// vectorizer will try to fold the tail-loop (epilogue) into the vector body
@@ -1382,6 +1388,33 @@ class LoopVectorizationCostModel {
return getTailFoldingStyle() != TailFoldingStyle::None;
}
+ void tryToEnablePartialAliasMasking() {
+ assert(foldTailByMasking() && "Expected tail folding to be enabled!");
+ assert(!foldTailWithEVL() &&
+ "Did not expect to enable alias masking with EVL!");
+ // Note: FixedOrderRecurrences are not supported yet as we cannot handle
+ // the required `splice.right` with the alias-mask.
+ if (!ForcePartialAliasingVectorization ||
+ !Legal->getFixedOrderRecurrences().empty())
+ return;
+
+ const RuntimePointerChecking *Checks = Legal->getRuntimePointerChecking();
+ if (!Checks)
+ return;
+
+ if (auto DiffChecks = Checks->getDiffChecks()) {
+ // We have diff checks. We can use an alias mask.
+ IsPartialAliasMaskingEnabled = !DiffChecks->empty();
+ }
+ }
+
+ void disablePartialAliasMaskingIfEnabled() {
+ IsPartialAliasMaskingEnabled = false;
+ }
+
+ /// Returns true if all loop blocks should have partial aliases masked.
+ bool maskPartialAliasing() const { return IsPartialAliasMaskingEnabled; }
+
/// Returns true if the use of wide lane masks is requested and the loop is
/// using tail-folding with a lane mask for control flow.
bool useWideActiveLaneMask() const {
@@ -1603,6 +1636,9 @@ class LoopVectorizationCostModel {
/// Control finally chosen tail folding style.
TailFoldingStyle ChosenTailFoldingStyle = TailFoldingStyle::None;
+ /// True if partial alias masking is enabled.
+ bool IsPartialAliasMaskingEnabled = false;
+
/// true if scalable vectorization is supported and enabled.
std::optional<bool> IsScalableVectorizationAllowed;
@@ -1824,14 +1860,18 @@ class GeneratedRTChecks {
/// The kind of cost that we are calculating
TTI::TargetCostKind CostKind;
+ /// True if the loop is alias-masked (which allows us to omit diff checks).
+ bool LoopUsesAliasMasking = false;
+
public:
GeneratedRTChecks(PredicatedScalarEvolution &PSE, DominatorTree *DT,
LoopInfo *LI, TargetTransformInfo *TTI,
- TTI::TargetCostKind CostKind)
+ TTI::TargetCostKind CostKind, bool LoopUsesAliasMasking)
: DT(DT), LI(LI), TTI(TTI),
SCEVExp(*PSE.getSE(), "scev.check", /*PreserveLCSSA=*/false),
MemCheckExp(*PSE.getSE(), "scev.check", /*PreserveLCSSA=*/false),
- PSE(PSE), CostKind(CostKind) {}
+ PSE(PSE), CostKind(CostKind),
+ LoopUsesAliasMasking(LoopUsesAliasMasking) {}
/// Generate runtime checks in SCEVCheckBlock and MemCheckBlock, so we can
/// accurately estimate the cost of the runtime checks. The blocks are
@@ -1884,7 +1924,7 @@ class GeneratedRTChecks {
}
const auto &RtPtrChecking = *LAI.getRuntimePointerChecking();
- if (RtPtrChecking.Need) {
+ if (RtPtrChecking.Need && !LoopUsesAliasMasking) {
auto *Pred = SCEVCheckBlock ? SCEVCheckBlock : Preheader;
MemCheckBlock = SplitBlock(Pred, Pred->getTerminator(), DT, LI, nullptr,
"vector.memcheck");
@@ -3072,10 +3112,17 @@ bool LoopVectorizationCostModel::memoryInstructionCanBeWidened(
auto *Ptr = getLoadStorePointerOperand(I);
auto *ScalarTy = getLoadStoreType(I);
+ int Stride = Legal->isConsecutivePtr(ScalarTy, Ptr);
// In order to be widened, the pointer should be consecutive, first of all.
- if (!Legal->isConsecutivePtr(ScalarTy, Ptr))
+ if (!Stride)
return false;
+ // Currently, we can't handle alias masking in reverse. Reversing the alias
+ // mask is not correct (or necessary). When combined with tail-folding the ALM
+ // should only be reversed where the alias-mask is true.
+ if (Stride < 0)
+ disablePartialAliasMaskingIfEnabled();
+
// If the instruction is a store located in a predicated block, it will be
// scalarized.
if (isScalarWithPredication(I, VF))
@@ -3731,6 +3778,8 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
assert(ContainsScalableVF && "Expected scalable vector factor.");
MaxFactors.FixedVF = ElementCount::getFixed(1);
+ } else {
+ tryToEnablePartialAliasMasking();
}
return MaxFactors;
}
@@ -4445,6 +4494,13 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
return Result;
}
+ if (CM.maskPartialAliasing()) {
+ LLVM_DEBUG(
+ dbgs()
+ << "LEV: Epilogue vectorization not supported with alias masking");
+ return Result;
+ }
+
// Not really a cost consideration, but check for unsupported cases here to
// simplify the logic.
if (!isCandidateForEpilogueVectorization(MainLoopVF)) {
@@ -7434,6 +7490,14 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// compactness.
attachRuntimeChecks(BestVPlan, ILV.RTChecks, HasBranchWeights);
+ VPValue *ClampedVF = nullptr;
+ if (CM.maskPartialAliasing()) {
+ ClampedVF = materializeAliasMask(
+ BestVPlan, *CM.Legal->getRuntimePointerChecking()->getDiffChecks(),
+ HasBranchWeights);
+ ++LoopsPartialAliasVectorized;
+ }
+
// Retrieving VectorPH now when it's easier while VPlan still has Regions.
VPBasicBlock *VectorPH = cast<VPBasicBlock>(BestVPlan.getVectorPreheader());
@@ -7470,6 +7534,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
VPlanTransforms::materializeVectorTripCount(
BestVPlan, VectorPH, CM.foldTailByMasking(),
CM.requiresScalarEpilogue(BestVF.isVector()));
+ // Do a late fix-up of the VF to replace any additional users of VF since the
+ // alias mask was materialized.
+ VPlanTransforms::fixupVFUsersForClampedVF(BestVPlan, ClampedVF);
VPlanTransforms::materializeFactors(BestVPlan, VectorPH, BestVF);
VPlanTransforms::cse(BestVPlan);
VPlanTransforms::simplifyRecipes(BestVPlan);
@@ -8682,6 +8749,38 @@ void LoopVectorizationPlanner::attachRuntimeChecks(
}
}
+VPValue *LoopVectorizationPlanner::materializeAliasMask(
+ VPlan &Plan, ArrayRef<PointerDiffInfo> DiffChecks, bool HasBranchWeights) {
+ VPBasicBlock *ClampedVFCheck =
+ Plan.createVPBasicBlock("vector.clamped.vf.check");
+ VPValue *ClampedVF = VPlanTransforms::materializeAliasMask(
+ Plan, ClampedVFCheck,
+ *CM.Legal->getRuntimePointerChecking()->getDiffChecks());
+ VPBuilder Builder(ClampedVFCheck);
+ DebugLoc DL = DebugLoc::getCompilerGenerated();
+ Type *TCTy = VPTypeAnalysis(Plan).inferScalarType(Plan.getTripCount());
+
+ // Check the "ClampedVF" from the alias mask is not scalar.
+ VPValue *IsScalar =
+ Builder.createICmp(CmpInst::ICMP_ULE, ClampedVF,
+ Plan.getConstantInt(TCTy, 1), DL, "vf.is.scalar");
+
+ VPValue *TripCount = Plan.getTripCount();
+ VPValue *MaxUIntTripCount =
+ Plan.getConstantInt(cast<IntegerType>(TCTy)->getMask());
+ VPValue *DistanceToMax = Builder.createSub(MaxUIntTripCount, TripCount);
+
+ // For tail-folding: Don't execute the vector loop if (UMax - n) < ClampedVF.
+ VPValue *TripCountCheck = Builder.createICmp(
+ ICmpInst::ICMP_ULT, DistanceToMax, ClampedVF, DL, "vf.step.overflow");
+
+ VPValue *Cond = Builder.createOr(IsScalar, TripCountCheck, DL);
+ VPlanTransforms::attachCheckBlock(Plan, Cond, ClampedVFCheck,
+ HasBranchWeights);
+ VPlanTransforms::fixupVFUsersForClampedVF(Plan, ClampedVF);
+ return ClampedVF;
+}
+
void LoopVectorizationPlanner::addMinimumIterationCheck(
VPlan &Plan, ElementCount VF, unsigned UF,
ElementCount MinProfitableTripCount) const {
@@ -8786,7 +8885,8 @@ static bool processLoopInVPlanNativePath(
VPlan &BestPlan = LVP.getPlanFor(VF.Width);
{
- GeneratedRTChecks Checks(PSE, DT, LI, TTI, CM.CostKind);
+ GeneratedRTChecks Checks(PSE, DT, LI, TTI, CM.CostKind,
+ CM.maskPartialAliasing());
InnerLoopVectorizer LB(L, PSE, LI, DT, TTI, AC, VF.Width, /*UF=*/1, &CM,
Checks, BestPlan);
LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""
@@ -9657,7 +9757,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
if (ORE->allowExtraAnalysis(LV_NAME))
LVP.emitInvalidCostRemarks(ORE);
- GeneratedRTChecks Checks(PSE, DT, LI, TTI, CM.CostKind);
+ GeneratedRTChecks Checks(PSE, DT, LI, TTI, CM.CostKind,
+ CM.maskPartialAliasing());
if (LVP.hasPlanWithVF(VF.Width)) {
// Select the interleave count.
IC = LVP.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
@@ -9776,6 +9877,17 @@ bool LoopVectorizePass::processLoop(Loop *L) {
IC = 1;
}
+ if (CM.maskPartialAliasing()) {
+ LLVM_DEBUG(
+ dbgs()
+ << "LV: Not interleaving due to partial aliasing vectorization.\n");
+ IntDiagMsg = {
+ "PartialAliasingVectorization",
+ "Unable to interleave due to partial aliasing vectorization."};
+ InterleaveLoop = false;
+ IC = 1;
+ }
+
// Emit diagnostic messages, if any.
const char *VAPassName = Hints.vectorizeAnalysisPassName();
if (!VectorizeLoop && !InterleaveLoop) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 41eef2a368343..1118babe87e51 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1245,8 +1245,9 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
// part if it is scalar. In the latter case, the recipe will be removed
// during unrolling.
ExtractPenultimateElement,
- LogicalAnd, // Non-poison propagating logical And.
- LogicalOr, // Non-poison propagating logical Or.
+ LogicalAnd, // Non-poison propagating logical And.
+ LogicalOr, // Non-poison propagating logical Or.
+ NumActiveLanes, // Counts the number of active lanes in a mask.
// Add an offset in bytes (second operand) to a base pointer (first
// operand). Only generates scalar values (either for the first lane only or
// for all lanes, depending on its uses).
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 998e48d411f50..5d6e7adb09dc9 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -150,6 +150,8 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
return inferScalarType(R->getOperand(0));
case Instruction::ExtractValue:
return cast<ExtractValueInst>(R->getUnderlyingValue())->getType();
+ case VPInstruction::NumActiveLanes:
+ return Type::getInt64Ty(Ctx);
default:
break;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
index 83907fb96dbd2..0ac60755d1160 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
@@ -1030,13 +1030,19 @@ static void addBypassBranch(VPlan &Plan, VPBasicBlock *CheckBlockVPBB,
}
}
+void VPlanTransforms::attachCheckBlock(VPlan &Plan, VPValue *Cond,
+ VPBasicBlock *CheckBlock,
+ bool AddBranchWeights) {
+ insertCheckBlockBeforeVectorLoop(Plan, CheckBlock);
+ addBypassBranch(Plan, CheckBlock, Cond, AddBranchWeights);
+}
+
void VPlanTransforms::attachCheckBlock(VPlan &Plan, Value *Cond,
BasicBlock *CheckBlock,
bool AddBranchWeights) {
VPValue *CondVPV = Plan.getOrAddLiveIn(Cond);
VPBasicBlock *CheckBlockVPBB = Plan.createVPIRBasicBlock(CheckBlock);
- insertCheckBlockBeforeVectorLoop(Plan, CheckBlockVPBB);
- addBypassBranch(Plan, CheckBlockVPBB, CondVPV, AddBranchWeights);
+ attachCheckBlock(Plan, CondVPV, CheckBlockVPBB, AddBranchWeights);
}
void VPlanTransforms::addMinimumIterationCheck(
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 26183f15306f1..083b98296283a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -464,6 +464,7 @@ unsigned VPInstruction::getNumOperandsForOpcode() const {
case VPInstruction::ResumeForEpilogue:
case VPInstruction::Reverse:
case VPInstruction::Unpack:
+ case VPInstruction::NumActiveLanes:
return 1;
case Instruction::ICmp:
case Instruction::FCmp:
@@ -611,6 +612,20 @@ Value *VPInstruction::generate(VPTransformState &State) {
{PredTy, ScalarTC->getType()},
{VIVElem0, ScalarTC}, nullptr, Name);
}
+ case VPInstruction::NumActiveLanes: {
+ Value *Op = State.get(getOperand(0));
+ auto *VecTy = cast<VectorType>(Op->getType());
+ assert(VecTy->getScalarSizeInBits() == 1 &&
+ "NumActiveLanes only implemented for i1 vectors");
+
+ Value *ZExt = Builder.CreateCast(
+ Instruction::ZExt, Op,
+ VectorType::get(Builder.getInt32Ty(), VecTy->getElementCount()));
+ Value *Count =
+ Builder.CreateUnaryIntrinsic(Intrinsic::vector_reduce_add, ZExt);
+ return Builder.CreateCast(Instruction::ZExt, Count, Builder.getInt64Ty(),
+ "num.active.lanes");
+ }
case VPInstruction::FirstOrderRecurrenceSplice: {
// Generate code to combine the previous and current values in vector v3.
//
@@ -1273,7 +1288,8 @@ bool VPInstruction::isVectorToScalar() const {
getOpcode() == VPInstruction::ComputeAnyOfResult ||
getOpcode() == VPInstruction::ExtractLastActive ||
getOpcode() == VPInstruction::ComputeReductionResult ||
- getOpcode() == VPInstruction::AnyOf;
+ getOpcode() == VPInstruction::AnyOf ||
+ getOpcode() == VPInstruction::NumActiveLanes;
}
bool VPInstruction::isSingleScalar() const {
@@ -1553,6 +1569,9 @@ void VPInstruction::printRecipe(raw_ostream &O, const Twine &Indent,
case VPInstruction::ExtractLastActive:
O << "extract-last-active";
break;
+ case VPInstruction::NumActiveLanes:
+ O << "num-active-lanes";
+ break;
default:
O << Instruction::getOpcodeName(getOpcode());
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 11b73f1dcbda8..796a296d1d589 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -5093,6 +5093,73 @@ void VPlanTransforms::materializeFactors(VPlan &Plan, VPBasicBlock *VectorPH,
VFxUF.replaceAllUsesWith(MulByUF);
}
+VPValue *
+VPlanTransforms::materializeAliasMask(VPlan &Plan, VPBasicBlock *AliasCheck,
+ ArrayRef<PointerDiffInfo> DiffChecks) {
+
+ VPBuilder Builder(AliasCheck, AliasCheck->begin());
+ Type *I1Ty = IntegerType::getInt1Ty(Plan.getContext());
+ Type *I64Ty = IntegerType::getInt64Ty(Plan.getContext());
+ Type *PtrTy = PointerType::getUnqual(Plan.getContext());
+
+ VPValue *AliasMask = nullptr;
+ for (PointerDiffInfo Check : DiffChecks) {
+ VPValue *Src = vputils::getOrCreateVPValueForSCEVExpr(Plan, Check.SrcStart);
+ VPValue *Sink =
+ vputils::getOrCreateVPValueForSCEVExpr(Plan, Check.SinkStart);
+
+ VPValue *SrcPtr =
+ Builder.createScalarCast(Instruction::CastOps::IntToPtr, Src, PtrTy,
+ DebugLoc::getCompilerGenerated());
+ VPValue *SinkPtr =
+ Builder.createScalarCast(Instruction::CastOps::IntToPtr, Sink, PtrTy,
+ DebugLoc::getCompilerGenerated());
+
+ VPWidenIntrinsicRecipe *WARMask = new VPWidenIntrinsicRecipe(
+ Intrinsic::loop_dependence_war_mask,
+ {SrcPtr, SinkPtr, Plan.getConstantInt(I64Ty, Check.AccessSize)}, I1Ty);
+ Builder.insert(WARMask);
+
+ if (AliasMask)
+ AliasMask = Builder.createAnd(AliasMask, WARMask);
+ else
+ AliasMask = WARMask;
+ }
+
+ Type *IVTy = VPTypeAnalysis(Plan).inferScalarType(Plan.getTripCount());
+ VPValue *NumActive =
+ Builder.createNaryOp(VPInstruction::NumActiveLanes, {AliasMask});
+ VPValue *ClampedVF = Builder.createScalarZExtOrTrunc(
+ NumActive, IVTy, I64Ty, DebugLoc::getCompilerGenerated());
+
+ // Find the existing header mask.
+ VPSingleDefRecipe *HeaderMask = vputils::findHeaderMask(Plan);
+ auto *HeaderMaskDef = HeaderMask->getDefiningRecipe();
+ if (HeaderMaskDef->isPhi())
+ Builder.setInsertPoint(&*HeaderMaskDef->getParent()->getFirstNonPhi());
+ else
+ Builder = VPBuilder::getToInsertAfter(HeaderMaskDef);
+
+ // Update all existing users of the header mask to "HeaderMask & AliasMask".
+ auto *ClampedHeaderMask = Builder.createAnd(HeaderMask, AliasMask);
+ HeaderMask->replaceUsesWithIf(ClampedHeaderMask, [&](VPUser &U, unsigned) {
+ return dyn_cast<VPInstruction>(&U) != ClampedHeaderMask;
+ });
+
+ return ClampedVF;
+}
+
+void VPlanTransforms::fixupVFUsersForClampedVF(VPlan &Plan,
+ VPValue *ClampedVF) {
+ if (!ClampedVF)
+ return;
+
+ assert(Plan.getConcreteUF() == 1 &&
+ "Clamped VF not support with interleaving");
+ Plan.getVF().replaceAllUsesWith(ClampedVF);
+ Plan.getVFxUF().replaceAllUsesWith(ClampedVF);
+}
+
DenseMap<const SCEV *, Value *>
VPlanTransforms::expandSCEVs(VPlan &Plan, ScalarEvolution &SE) {
SCEVExpander Expander(SE, "induction", /*PreserveLCSSA=*/false);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 16f7ae2daeb5e..eb6e688447758 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -180,6 +180,8 @@ struct VPlanTransforms {
/// Wrap runtime check block \p CheckBlock in a VPIRBB and \p Cond in a
/// VPValue and connect the block to \p Plan, using the VPValue as branch
/// condition.
+ static void attachCheckBlock(VPlan &Plan, VPValue *Cond,
+ VPBasicBlock *CheckBlock, bool AddBranchWeights);
static void attachCheckBlock(VPlan &Plan, Value *Cond, BasicBlock *CheckBlock,
bool AddBranchWeights);
@@ -422,6 +424,14 @@ struct VPlanTransforms {
static void materializeFactors(VPlan &Plan, VPBasicBlock *VectorPH,
ElementCount VF);
+ /// Materializes within the \p AliasCheck block. Updates the header mask of
+ /// the loop to use the alias mask. Returns the clamped VF.
+ static VPValue *materializeAliasMask(VPlan &Plan, VPBasicBlock *AliasCheck,
+ ArrayRef<PointerDiffInfo> DiffChecks);
+
+ /// Replaces all users of the VF and VFxUF with the runtime clamped VF.
+ static void fixupVFUsersForClampedVF(VPlan &Plan, VPValue *ClampedVF);
+
/// Expand VPExpandSCEVRecipes in \p Plan's entry block. Each
/// VPExpandSCEVRecipe is replaced with a live-in wrapping the expanded IR
/// value. A mapping from SCEV expressions to their expanded IR value is
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
index 821a4f7911bb8..db0329dfa16b1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
@@ -46,7 +46,8 @@ VPValue *vputils::getOrCreateVPValueForSCEVExpr(VPlan &Plan, const SCEV *Expr) {
if (U && !isa<Instruction>(U->getValue()))
return Plan.getOrAddLiveIn(U->getValue());
auto *Expanded = new VPExpandSCEVRecipe(Expr);
- Plan.getEntry()->appendRecipe(Expanded);
+ VPBasicBlock *EntryVPBB = Plan.getEntry();
+ Plan.getEntry()->insert(Expanded, EntryVPBB->getFirstNonPhi());
return Expanded;
}
@@ -78,6 +79,12 @@ bool vputils::isHeaderMask(const VPValue *V, const VPlan &Plan) {
return true;
}
+ // For plans with forced tail folding, the header mask may not be an ALM.
+ if (match(V, m_ICmp(m_VPValue(A),
+ m_Broadcast(m_Specific(Plan.getBackedgeTakenCount()))))) {
+ return IsWideCanonicalIV(A);
+ }
+
return match(V, m_ICmp(m_VPValue(A), m_VPValue(B))) && IsWideCanonicalIV(A) &&
B == Plan.getBackedgeTakenCount();
}
@@ -607,6 +614,15 @@ VPSingleDefRecipe *vputils::findHeaderMask(VPlan &Plan) {
HeaderMask = VPI;
}
}
+
+ for (VPRecipeBase &R : LoopRegion->getEntryBasicBlock()->phis()) {
+ auto *Def = cast<VPSingleDefRecipe>(&R);
+ if (vputils::isHeaderMask(Def, Plan)) {
+ assert(!HeaderMask && "Multiple header masks found?");
+ HeaderMask = Def;
+ }
+ }
+
return HeaderMask;
}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll b/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll
new file mode 100644
index 0000000000000..65b02eb9c79e1
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll
@@ -0,0 +1,472 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --filter-out-after "^scalar.ph:" --version 5
+; RUN: opt -S -mtriple=aarch64-unknown-linux-gnu -mattr=+sve2 -passes=loop-vectorize -force-partial-aliasing-vectorization -prefer-predicate-over-epilogue=predicate-dont-vectorize %s | FileCheck %s --check-prefix=CHECK-TF
+
+define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-TF-LABEL: define void @alias_mask(
+; CHECK-TF-SAME: ptr noalias [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-TF-NEXT: [[ENTRY:.*:]]
+; CHECK-TF-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-TF-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-TF-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-TF-NEXT: br i1 [[CMP11]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK-TF: [[FOR_BODY_PREHEADER]]:
+; CHECK-TF-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK-TF: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-TF-NEXT: [[TMP0:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-TF-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-TF-NEXT: [[ALIAS_MASK:%.*]] = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-TF-NEXT: [[TMP3:%.*]] = zext <vscale x 16 x i1> [[ALIAS_MASK]] to <vscale x 16 x i32>
+; CHECK-TF-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.nxv16i32(<vscale x 16 x i32> [[TMP3]])
+; CHECK-TF-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-TF-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-TF-NEXT: [[TMP5:%.*]] = sub i64 -1, [[N]]
+; CHECK-TF-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP5]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-TF-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-TF: [[VECTOR_PH]]:
+; CHECK-TF-NEXT: [[TMP7:%.*]] = sub i64 [[N]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP8:%.*]] = icmp ugt i64 [[N]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[N]])
+; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-TF: [[VECTOR_BODY]]:
+; CHECK-TF-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[TMP10:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[ALIAS_MASK]]
+; CHECK-TF-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP11]], <vscale x 16 x i1> [[TMP10]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD3:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP12]], <vscale x 16 x i1> [[TMP10]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[TMP13:%.*]] = select <vscale x 16 x i1> [[TMP10]], <vscale x 16 x i8> [[WIDE_MASKED_LOAD]], <vscale x 16 x i8> splat (i8 1)
+; CHECK-TF-NEXT: [[TMP14:%.*]] = sdiv <vscale x 16 x i8> [[WIDE_MASKED_LOAD3]], [[TMP13]]
+; CHECK-TF-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP14]], ptr align 1 [[TMP15]], <vscale x 16 x i1> [[TMP10]])
+; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP9]])
+; CHECK-TF-NEXT: [[TMP16:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
+; CHECK-TF-NEXT: [[TMP17:%.*]] = xor i1 [[TMP16]], true
+; CHECK-TF-NEXT: br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-TF: [[MIDDLE_BLOCK]]:
+; CHECK-TF-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK-TF: [[SCALAR_PH]]:
+;
+
+entry:
+ %cmp11 = icmp sgt i64 %n, 0
+ br i1 %cmp11, label %for.body, label %exit
+
+for.body: ; preds = %for.body.preheader, %for.body
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %gep.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %div = sdiv i8 %load.b, %load.a
+ %gep.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %div, ptr %gep.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit: ; preds = %for.body, %entry
+ ret void
+}
+
+; Note: This test could emit a `llvm.loop.dependence.raw` mask to avoid creating
+; a dependency between the store and the load, but it is not necessary for
+; correctness.
+define i32 @alias_mask_read_after_write(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-TF-LABEL: define i32 @alias_mask_read_after_write(
+; CHECK-TF-SAME: ptr noalias [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-TF-NEXT: [[ENTRY:.*:]]
+; CHECK-TF-NEXT: [[C2:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-TF-NEXT: [[B1:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-TF-NEXT: [[CMP19:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-TF-NEXT: br i1 [[CMP19]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK-TF: [[FOR_BODY_PREHEADER]]:
+; CHECK-TF-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK-TF: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-TF-NEXT: [[TMP0:%.*]] = inttoptr i64 [[C2]] to ptr
+; CHECK-TF-NEXT: [[TMP1:%.*]] = inttoptr i64 [[B1]] to ptr
+; CHECK-TF-NEXT: [[ALIAS_MASK:%.*]] = call <vscale x 4 x i1> @llvm.loop.dependence.war.mask.nxv4i1(ptr [[TMP0]], ptr [[TMP1]], i64 4)
+; CHECK-TF-NEXT: [[TMP3:%.*]] = zext <vscale x 4 x i1> [[ALIAS_MASK]] to <vscale x 4 x i32>
+; CHECK-TF-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP3]])
+; CHECK-TF-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-TF-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-TF-NEXT: [[TMP5:%.*]] = sub i64 -1, [[N]]
+; CHECK-TF-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP5]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-TF-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-TF: [[VECTOR_PH]]:
+; CHECK-TF-NEXT: [[TMP7:%.*]] = sub i64 [[N]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP8:%.*]] = icmp ugt i64 [[N]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
+; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-TF: [[VECTOR_BODY]]:
+; CHECK-TF-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[VEC_PHI:%.*]] = phi <vscale x 4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP16:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[TMP10:%.*]] = and <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], [[ALIAS_MASK]]
+; CHECK-TF-NEXT: [[TMP11:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr align 2 [[TMP11]], <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i32> poison)
+; CHECK-TF-NEXT: [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[C]], i64 [[INDEX]]
+; CHECK-TF-NEXT: call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[WIDE_MASKED_LOAD]], ptr align 2 [[TMP12]], <vscale x 4 x i1> [[TMP10]])
+; CHECK-TF-NEXT: [[TMP13:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD3:%.*]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr align 2 [[TMP13]], <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i32> poison)
+; CHECK-TF-NEXT: [[TMP14:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], [[VEC_PHI]]
+; CHECK-TF-NEXT: [[TMP15:%.*]] = add <vscale x 4 x i32> [[TMP14]], [[WIDE_MASKED_LOAD3]]
+; CHECK-TF-NEXT: [[TMP16]] = select <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i32> [[TMP15]], <vscale x 4 x i32> [[VEC_PHI]]
+; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP9]])
+; CHECK-TF-NEXT: [[TMP17:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
+; CHECK-TF-NEXT: [[TMP18:%.*]] = xor i1 [[TMP17]], true
+; CHECK-TF-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK-TF: [[MIDDLE_BLOCK]]:
+; CHECK-TF-NEXT: [[TMP19:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP16]])
+; CHECK-TF-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK-TF: [[SCALAR_PH]]:
+;
+
+
+entry:
+ %cmp19 = icmp sgt i64 %n, 0
+ br i1 %cmp19, label %for.body, label %exit
+
+for.body: ; preds = %entry, %for.body
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %accum = phi i32 [ 0, %entry ], [ %add2, %for.body ]
+ %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+ %load.a = load i32, ptr %gep.a, align 2
+ %gep.c = getelementptr inbounds i32, ptr %c, i64 %iv
+ store i32 %load.a, ptr %gep.c, align 2
+ %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+ %load.b = load i32, ptr %gep.b, align 2
+ %add = add i32 %load.a, %accum
+ %add2 = add i32 %add, %load.b
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit: ; preds = %entry, %for.body
+ %result = phi i32 [ 0, %entry ], [ %add2, %for.body ]
+ ret i32 %result
+}
+
+define void @alias_mask_multiple(ptr %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-TF-LABEL: define void @alias_mask_multiple(
+; CHECK-TF-SAME: ptr [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-TF-NEXT: [[ENTRY:.*:]]
+; CHECK-TF-NEXT: [[A3:%.*]] = ptrtoaddr ptr [[A]] to i64
+; CHECK-TF-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-TF-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-TF-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-TF-NEXT: br i1 [[CMP11]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK-TF: [[FOR_BODY_PREHEADER]]:
+; CHECK-TF-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK-TF: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-TF-NEXT: [[TMP0:%.*]] = inttoptr i64 [[A3]] to ptr
+; CHECK-TF-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-TF-NEXT: [[TMP2:%.*]] = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-TF-NEXT: [[TMP3:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-TF-NEXT: [[TMP4:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-TF-NEXT: [[TMP5:%.*]] = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr [[TMP3]], ptr [[TMP4]], i64 1)
+; CHECK-TF-NEXT: [[ALIAS_MASK:%.*]] = and <vscale x 16 x i1> [[TMP2]], [[TMP5]]
+; CHECK-TF-NEXT: [[TMP7:%.*]] = zext <vscale x 16 x i1> [[ALIAS_MASK]] to <vscale x 16 x i32>
+; CHECK-TF-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.nxv16i32(<vscale x 16 x i32> [[TMP7]])
+; CHECK-TF-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP8]] to i64
+; CHECK-TF-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-TF-NEXT: [[TMP9:%.*]] = sub i64 -1, [[N]]
+; CHECK-TF-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP9]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP10:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-TF-NEXT: br i1 [[TMP10]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-TF: [[VECTOR_PH]]:
+; CHECK-TF-NEXT: [[TMP11:%.*]] = sub i64 [[N]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP12:%.*]] = icmp ugt i64 [[N]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[TMP13:%.*]] = select i1 [[TMP12]], i64 [[TMP11]], i64 0
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[N]])
+; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-TF: [[VECTOR_BODY]]:
+; CHECK-TF-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[TMP14:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[ALIAS_MASK]]
+; CHECK-TF-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP15]], <vscale x 16 x i1> [[TMP14]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD4:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP16]], <vscale x 16 x i1> [[TMP14]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[TMP17:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD4]], [[WIDE_MASKED_LOAD]]
+; CHECK-TF-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP17]], ptr align 1 [[TMP18]], <vscale x 16 x i1> [[TMP14]])
+; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP13]])
+; CHECK-TF-NEXT: [[TMP19:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
+; CHECK-TF-NEXT: [[TMP20:%.*]] = xor i1 [[TMP19]], true
+; CHECK-TF-NEXT: br i1 [[TMP20]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-TF: [[MIDDLE_BLOCK]]:
+; CHECK-TF-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK-TF: [[SCALAR_PH]]:
+;
+
+entry:
+ %cmp11 = icmp sgt i64 %n, 0
+ br i1 %cmp11, label %for.body, label %exit
+
+for.body: ; preds = %for.body.preheader, %for.body
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %gep.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %add = add i8 %load.b, %load.a
+ %gep.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %add, ptr %gep.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit: ; preds = %for.body, %entry
+ ret void
+}
+
+; Checks using a scalar outside the loop, with requires extracting the last
+; active element.
+define i8 @alias_masking_exit_value(ptr %ptrA, ptr %ptrB) {
+; CHECK-TF-LABEL: define i8 @alias_masking_exit_value(
+; CHECK-TF-SAME: ptr [[PTRA:%.*]], ptr [[PTRB:%.*]]) #[[ATTR0]] {
+; CHECK-TF-NEXT: [[ENTRY:.*:]]
+; CHECK-TF-NEXT: [[PTRA2:%.*]] = ptrtoaddr ptr [[PTRA]] to i64
+; CHECK-TF-NEXT: [[PTRB1:%.*]] = ptrtoaddr ptr [[PTRB]] to i64
+; CHECK-TF-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK-TF: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-TF-NEXT: [[TMP0:%.*]] = inttoptr i64 [[PTRA2]] to ptr
+; CHECK-TF-NEXT: [[TMP1:%.*]] = inttoptr i64 [[PTRB1]] to ptr
+; CHECK-TF-NEXT: [[ALIAS_MASK:%.*]] = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-TF-NEXT: [[TMP3:%.*]] = zext <vscale x 16 x i1> [[ALIAS_MASK]] to <vscale x 16 x i32>
+; CHECK-TF-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.nxv16i32(<vscale x 16 x i32> [[TMP3]])
+; CHECK-TF-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-TF-NEXT: [[TMP5:%.*]] = trunc i64 [[NUM_ACTIVE_LANES]] to i32
+; CHECK-TF-NEXT: [[TMP6:%.*]] = trunc i32 [[TMP5]] to i8
+; CHECK-TF-NEXT: [[TMP7:%.*]] = mul i8 1, [[TMP6]]
+; CHECK-TF-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x i8> poison, i8 [[TMP7]], i64 0
+; CHECK-TF-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x i8> [[BROADCAST_SPLATINSERT]], <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
+; CHECK-TF-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i32 [[TMP5]], 1
+; CHECK-TF-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i32 -1001, [[TMP5]]
+; CHECK-TF-NEXT: [[TMP8:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-TF-NEXT: br i1 [[TMP8]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-TF: [[VECTOR_PH]]:
+; CHECK-TF-NEXT: [[TMP9:%.*]] = sub i32 1000, [[TMP5]]
+; CHECK-TF-NEXT: [[TMP10:%.*]] = icmp ugt i32 1000, [[TMP5]]
+; CHECK-TF-NEXT: [[TMP11:%.*]] = select i1 [[TMP10]], i32 [[TMP9]], i32 0
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 0, i32 1000)
+; CHECK-TF-NEXT: [[TMP12:%.*]] = call <vscale x 16 x i8> @llvm.stepvector.nxv16i8()
+; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-TF: [[VECTOR_BODY]]:
+; CHECK-TF-NEXT: [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[VEC_IND:%.*]] = phi <vscale x 16 x i8> [ [[TMP12]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[TMP13:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[ALIAS_MASK]]
+; CHECK-TF-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[PTRA]], i32 [[INDEX]]
+; CHECK-TF-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[PTRB]], i32 [[INDEX]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP14]], <vscale x 16 x i1> [[TMP13]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[TMP16:%.*]] = add <vscale x 16 x i8> [[VEC_IND]], [[WIDE_MASKED_LOAD]]
+; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP16]], ptr align 1 [[TMP15]], <vscale x 16 x i1> [[TMP13]])
+; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], [[TMP5]]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 [[INDEX]], i32 [[TMP11]])
+; CHECK-TF-NEXT: [[TMP17:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
+; CHECK-TF-NEXT: [[TMP18:%.*]] = xor i1 [[TMP17]], true
+; CHECK-TF-NEXT: [[VEC_IND_NEXT]] = add <vscale x 16 x i8> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; CHECK-TF-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-TF: [[MIDDLE_BLOCK]]:
+; CHECK-TF-NEXT: [[TMP19:%.*]] = xor <vscale x 16 x i1> [[TMP13]], splat (i1 true)
+; CHECK-TF-NEXT: [[FIRST_INACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP19]], i1 false)
+; CHECK-TF-NEXT: [[LAST_ACTIVE_LANE:%.*]] = sub i64 [[FIRST_INACTIVE_LANE]], 1
+; CHECK-TF-NEXT: [[TMP20:%.*]] = extractelement <vscale x 16 x i8> [[TMP16]], i64 [[LAST_ACTIVE_LANE]]
+; CHECK-TF-NEXT: br [[EXIT:label %.*]]
+; CHECK-TF: [[SCALAR_PH]]:
+;
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
+ %gepA = getelementptr inbounds i8, ptr %ptrA, i32 %iv
+ %gepB = getelementptr inbounds i8, ptr %ptrB, i32 %iv
+ %loadA = load i8, ptr %gepA
+ %iv.trunc = trunc i32 %iv to i8
+ %add = add i8 %iv.trunc, %loadA
+ store i8 %add, ptr %gepB
+ %iv.next = add nsw i32 %iv, 1
+ %ec = icmp eq i32 %iv.next, 1000
+ br i1 %ec, label %exit, label %loop
+
+exit:
+ %exit.value = phi i8 [ %add, %loop ]
+ ret i8 %exit.value
+}
+
+; Unsupported: Reversing the alias mask is not correct.
+define void @alias_mask_reverse_iterate(ptr noalias %ptrA, ptr %ptrB, ptr %ptrC, i64 %n) {
+; CHECK-TF-LABEL: define void @alias_mask_reverse_iterate(
+; CHECK-TF-SAME: ptr noalias [[PTRA:%.*]], ptr [[PTRB:%.*]], ptr [[PTRC:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-TF-NEXT: [[ENTRY:.*:]]
+; CHECK-TF-NEXT: [[PTRC2:%.*]] = ptrtoaddr ptr [[PTRC]] to i64
+; CHECK-TF-NEXT: [[PTRB1:%.*]] = ptrtoaddr ptr [[PTRB]] to i64
+; CHECK-TF-NEXT: [[IV_START:%.*]] = add i64 [[N]], -1
+; CHECK-TF-NEXT: br label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-TF: [[VECTOR_MEMCHECK]]:
+; CHECK-TF-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-TF-NEXT: [[TMP1:%.*]] = mul nuw i64 [[TMP0]], 16
+; CHECK-TF-NEXT: [[TMP2:%.*]] = sub i64 [[PTRB1]], [[PTRC2]]
+; CHECK-TF-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
+; CHECK-TF-NEXT: br i1 [[DIFF_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-TF: [[VECTOR_PH]]:
+; CHECK-TF-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-TF-NEXT: [[TMP4:%.*]] = shl nuw i64 [[TMP3]], 4
+; CHECK-TF-NEXT: [[TMP5:%.*]] = sub i64 [[IV_START]], [[TMP4]]
+; CHECK-TF-NEXT: [[TMP6:%.*]] = icmp ugt i64 [[IV_START]], [[TMP4]]
+; CHECK-TF-NEXT: [[TMP7:%.*]] = select i1 [[TMP6]], i64 [[TMP5]], i64 0
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[IV_START]])
+; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-TF: [[VECTOR_BODY]]:
+; CHECK-TF-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[OFFSET_IDX:%.*]] = sub i64 [[IV_START]], [[INDEX]]
+; CHECK-TF-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[PTRA]], i64 [[OFFSET_IDX]]
+; CHECK-TF-NEXT: [[TMP9:%.*]] = sub nuw nsw i64 [[TMP4]], 1
+; CHECK-TF-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], -1
+; CHECK-TF-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[TMP8]], i64 [[TMP10]]
+; CHECK-TF-NEXT: [[REVERSE:%.*]] = call <vscale x 16 x i1> @llvm.vector.reverse.nxv16i1(<vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP11]], <vscale x 16 x i1> [[REVERSE]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[REVERSE3:%.*]] = call <vscale x 16 x i8> @llvm.vector.reverse.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_LOAD]])
+; CHECK-TF-NEXT: [[TMP12:%.*]] = getelementptr inbounds i8, ptr [[PTRB]], i64 [[OFFSET_IDX]]
+; CHECK-TF-NEXT: [[TMP13:%.*]] = getelementptr i8, ptr [[TMP12]], i64 [[TMP10]]
+; CHECK-TF-NEXT: [[REVERSE4:%.*]] = call <vscale x 16 x i1> @llvm.vector.reverse.nxv16i1(<vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr align 1 [[TMP13]], <vscale x 16 x i1> [[REVERSE4]], <vscale x 16 x i8> poison)
+; CHECK-TF-NEXT: [[REVERSE6:%.*]] = call <vscale x 16 x i8> @llvm.vector.reverse.nxv16i8(<vscale x 16 x i8> [[WIDE_MASKED_LOAD5]])
+; CHECK-TF-NEXT: [[TMP14:%.*]] = add <vscale x 16 x i8> [[REVERSE6]], [[REVERSE3]]
+; CHECK-TF-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[PTRC]], i64 [[OFFSET_IDX]]
+; CHECK-TF-NEXT: [[TMP16:%.*]] = getelementptr i8, ptr [[TMP15]], i64 [[TMP10]]
+; CHECK-TF-NEXT: [[REVERSE7:%.*]] = call <vscale x 16 x i8> @llvm.vector.reverse.nxv16i8(<vscale x 16 x i8> [[TMP14]])
+; CHECK-TF-NEXT: [[REVERSE8:%.*]] = call <vscale x 16 x i1> @llvm.vector.reverse.nxv16i1(<vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[REVERSE7]], ptr align 1 [[TMP16]], <vscale x 16 x i1> [[REVERSE8]])
+; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP4]]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP7]])
+; CHECK-TF-NEXT: [[TMP17:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
+; CHECK-TF-NEXT: [[TMP18:%.*]] = xor i1 [[TMP17]], true
+; CHECK-TF-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK-TF: [[MIDDLE_BLOCK]]:
+; CHECK-TF-NEXT: br [[EXIT:label %.*]]
+; CHECK-TF: [[SCALAR_PH]]:
+;
+entry:
+ %iv.start = add nsw i64 %n, -1
+ br label %loop
+
+loop:
+ %iv = phi i64 [ %iv.start, %entry ], [ %iv.next, %loop ]
+ %gep.A = getelementptr inbounds i8, ptr %ptrA, i64 %iv
+ %loadA = load i8, ptr %gep.A, align 1
+ %gep.B = getelementptr inbounds i8, ptr %ptrB, i64 %iv
+ %loadB = load i8, ptr %gep.B, align 1
+ %add = add i8 %loadB, %loadA
+ %gep.C = getelementptr inbounds i8, ptr %ptrC, i64 %iv
+ store i8 %add, ptr %gep.C, align 1
+ %iv.next = add nsw i64 %iv, -1
+ %ec = icmp eq i64 %iv.next, 0
+ br i1 %ec, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+; Test taken from: scalable-first-order-recurrence.ll. Check we don't use
+; an alias-mask with first-order recurrences, as we cannot handle the
+; splice.right with the alias-mask/clamped VF yet.
+define i32 @recurrence_1(ptr nocapture readonly %a, ptr nocapture %b, i32 %n) {
+; CHECK-TF-LABEL: define i32 @recurrence_1(
+; CHECK-TF-SAME: ptr readonly captures(none) [[A:%.*]], ptr captures(none) [[B:%.*]], i32 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-TF-NEXT: [[ENTRY:.*:]]
+; CHECK-TF-NEXT: [[A2:%.*]] = ptrtoaddr ptr [[A]] to i64
+; CHECK-TF-NEXT: [[B1:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-TF-NEXT: br label %[[FOR_PREHEADER:.*]]
+; CHECK-TF: [[FOR_PREHEADER]]:
+; CHECK-TF-NEXT: [[PRE_LOAD:%.*]] = load i32, ptr [[A]], align 4
+; CHECK-TF-NEXT: [[TMP0:%.*]] = add i32 [[N]], -1
+; CHECK-TF-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
+; CHECK-TF-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
+; CHECK-TF-NEXT: br label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-TF: [[VECTOR_MEMCHECK]]:
+; CHECK-TF-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-TF-NEXT: [[TMP4:%.*]] = mul nuw i64 [[TMP3]], 4
+; CHECK-TF-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-TF-NEXT: [[TMP6:%.*]] = add i64 [[B1]], -4
+; CHECK-TF-NEXT: [[TMP7:%.*]] = sub i64 [[TMP6]], [[A2]]
+; CHECK-TF-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP7]], [[TMP5]]
+; CHECK-TF-NEXT: br i1 [[DIFF_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-TF: [[VECTOR_PH]]:
+; CHECK-TF-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-TF-NEXT: [[TMP9:%.*]] = shl nuw i64 [[TMP8]], 2
+; CHECK-TF-NEXT: [[TMP10:%.*]] = sub i64 [[TMP2]], [[TMP9]]
+; CHECK-TF-NEXT: [[TMP11:%.*]] = icmp ugt i64 [[TMP2]], [[TMP9]]
+; CHECK-TF-NEXT: [[TMP12:%.*]] = select i1 [[TMP11]], i64 [[TMP10]], i64 0
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
+; CHECK-TF-NEXT: [[TMP13:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-TF-NEXT: [[TMP14:%.*]] = mul nuw i32 [[TMP13]], 4
+; CHECK-TF-NEXT: [[TMP15:%.*]] = sub i32 [[TMP14]], 1
+; CHECK-TF-NEXT: [[VECTOR_RECUR_INIT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[PRE_LOAD]], i32 [[TMP15]]
+; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-TF: [[VECTOR_BODY]]:
+; CHECK-TF-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[VECTOR_RECUR:%.*]] = phi <vscale x 4 x i32> [ [[VECTOR_RECUR_INIT]], %[[VECTOR_PH]] ], [ [[WIDE_MASKED_LOAD:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-TF-NEXT: [[TMP16:%.*]] = add nuw nsw i64 [[INDEX]], 1
+; CHECK-TF-NEXT: [[TMP17:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP16]]
+; CHECK-TF-NEXT: [[WIDE_MASKED_LOAD]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr align 4 [[TMP17]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i32> poison)
+; CHECK-TF-NEXT: [[TMP18:%.*]] = call <vscale x 4 x i32> @llvm.vector.splice.right.nxv4i32(<vscale x 4 x i32> [[VECTOR_RECUR]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], i32 1)
+; CHECK-TF-NEXT: [[TMP19:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
+; CHECK-TF-NEXT: [[TMP20:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], [[TMP18]]
+; CHECK-TF-NEXT: call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[TMP20]], ptr align 4 [[TMP19]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP12]])
+; CHECK-TF-NEXT: [[TMP21:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
+; CHECK-TF-NEXT: [[TMP22:%.*]] = xor i1 [[TMP21]], true
+; CHECK-TF-NEXT: br i1 [[TMP22]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
+; CHECK-TF: [[MIDDLE_BLOCK]]:
+; CHECK-TF-NEXT: [[TMP23:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], splat (i1 true)
+; CHECK-TF-NEXT: [[FIRST_INACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv4i1(<vscale x 4 x i1> [[TMP23]], i1 false)
+; CHECK-TF-NEXT: [[LAST_ACTIVE_LANE:%.*]] = sub i64 [[FIRST_INACTIVE_LANE]], 1
+; CHECK-TF-NEXT: [[TMP24:%.*]] = sub i64 [[LAST_ACTIVE_LANE]], 1
+; CHECK-TF-NEXT: [[TMP25:%.*]] = extractelement <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], i64 [[TMP24]]
+; CHECK-TF-NEXT: [[TMP26:%.*]] = call i32 @llvm.vscale.i32()
+; CHECK-TF-NEXT: [[TMP27:%.*]] = mul nuw i32 [[TMP26]], 4
+; CHECK-TF-NEXT: [[TMP28:%.*]] = sub i32 [[TMP27]], 1
+; CHECK-TF-NEXT: [[TMP29:%.*]] = extractelement <vscale x 4 x i32> [[VECTOR_RECUR]], i32 [[TMP28]]
+; CHECK-TF-NEXT: [[TMP30:%.*]] = icmp eq i64 [[LAST_ACTIVE_LANE]], 0
+; CHECK-TF-NEXT: [[TMP31:%.*]] = select i1 [[TMP30]], i32 [[TMP29]], i32 [[TMP25]]
+; CHECK-TF-NEXT: br [[FOR_EXIT:label %.*]]
+; CHECK-TF: [[SCALAR_PH]]:
+;
+
+entry:
+ br label %for.preheader
+
+for.preheader:
+ %pre_load = load i32, ptr %a
+ br label %scalar.body
+
+scalar.body:
+ %0 = phi i32 [ %pre_load, %for.preheader ], [ %1, %scalar.body ]
+ %indvars.iv = phi i64 [ 0, %for.preheader ], [ %indvars.iv.next, %scalar.body ]
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %arrayidx32 = getelementptr inbounds i32, ptr %a, i64 %indvars.iv.next
+ %1 = load i32, ptr %arrayidx32
+ %arrayidx34 = getelementptr inbounds i32, ptr %b, i64 %indvars.iv
+ %add35 = add i32 %1, %0
+ store i32 %add35, ptr %arrayidx34
+ %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+ %exitcond = icmp eq i32 %lftr.wideiv, %n
+ br i1 %exitcond, label %for.exit, label %scalar.body
+
+for.exit:
+ ret i32 %0
+}
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/alias-mask-force-evl.ll b/llvm/test/Transforms/LoopVectorize/RISCV/alias-mask-force-evl.ll
new file mode 100644
index 0000000000000..88f88f5ffdb6c
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/alias-mask-force-evl.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --filter-out-after "^scalar.ph:" --version 5
+; RUN: opt -S -mattr=+v -mtriple riscv64 -force-partial-aliasing-vectorization -prefer-predicate-over-epilogue=predicate-dont-vectorize -force-tail-folding-style=data-with-evl -passes=loop-vectorize %s | FileCheck %s
+
+; Note: Alias masks are not supported with EVL at the moment.
+
+define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-LABEL: define void @alias_mask(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-NEXT: br i1 [[CMP11]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK: [[FOR_BODY_PREHEADER]]:
+; CHECK-NEXT: br label %[[VECTOR_MEMCHECK:.*]]
+; CHECK: [[VECTOR_MEMCHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP1:%.*]] = mul nuw i64 [[TMP0]], 16
+; CHECK-NEXT: [[TMP2:%.*]] = sub i64 [[C1]], [[B2]]
+; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
+; CHECK-NEXT: br i1 [[DIFF_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[CURRENT_ITERATION_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[CURRENT_ITERATION_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[AVL:%.*]] = phi i64 [ [[N]], %[[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 16, i1 true)
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[CURRENT_ITERATION_IV]]
+; CHECK-NEXT: [[VP_OP_LOAD:%.*]] = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 [[TMP4]], <vscale x 16 x i1> splat (i1 true), i32 [[TMP3]])
+; CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[CURRENT_ITERATION_IV]]
+; CHECK-NEXT: [[VP_OP_LOAD3:%.*]] = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr align 1 [[TMP5]], <vscale x 16 x i1> splat (i1 true), i32 [[TMP3]])
+; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 16 x i8> @llvm.vp.merge.nxv16i8(<vscale x 16 x i1> splat (i1 true), <vscale x 16 x i8> [[VP_OP_LOAD]], <vscale x 16 x i8> splat (i8 1), i32 [[TMP3]])
+; CHECK-NEXT: [[TMP7:%.*]] = sdiv <vscale x 16 x i8> [[VP_OP_LOAD3]], [[TMP6]]
+; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[CURRENT_ITERATION_IV]]
+; CHECK-NEXT: call void @llvm.vp.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP7]], ptr align 1 [[TMP8]], <vscale x 16 x i1> splat (i1 true), i32 [[TMP3]])
+; CHECK-NEXT: [[TMP9:%.*]] = zext i32 [[TMP3]] to i64
+; CHECK-NEXT: [[CURRENT_ITERATION_NEXT]] = add i64 [[TMP9]], [[CURRENT_ITERATION_IV]]
+; CHECK-NEXT: [[AVL_NEXT]] = sub nuw i64 [[AVL]], [[TMP9]]
+; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[AVL_NEXT]], 0
+; CHECK-NEXT: br i1 [[TMP10]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ %cmp11 = icmp sgt i64 %n, 0
+ br i1 %cmp11, label %for.body, label %exit
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %gep.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %div = sdiv i8 %load.b, %load.a
+ %gep.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %div, ptr %gep.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
diff --git a/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll b/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll
new file mode 100644
index 0000000000000..a12226b316a66
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll
@@ -0,0 +1,92 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -passes=loop-vectorize -mattr=+sve2 -force-partial-aliasing-vectorization -prefer-predicate-over-epilogue=predicate-dont-vectorize -disable-output -vplan-print-after="printFinalVPlan$" -S %s 2>&1 | FileCheck --check-prefixes=FINAL %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; FINAL-LABEL: VPlan for loop in 'alias_mask'
+; FINAL: VPlan 'Final VPlan for VF={vscale x 1,vscale x 2,vscale x 4,vscale x 8,vscale x 16},UF={1}' {
+; FINAL-NEXT: Live-in ir<%n> = original trip-count
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<entry>:
+; FINAL-NEXT: IR %b2 = ptrtoaddr ptr %b to i64
+; FINAL-NEXT: IR %c1 = ptrtoaddr ptr %c to i64
+; FINAL-NEXT: Successor(s): vector.clamped.vf.check
+; FINAL-EMPTY:
+; FINAL-NEXT: vector.clamped.vf.check:
+; FINAL-NEXT: EMIT-SCALAR vp<[[VP2:%[0-9]+]]> = inttoptr ir<%b2> to ptr
+; FINAL-NEXT: EMIT-SCALAR vp<[[VP3:%[0-9]+]]> = inttoptr ir<%c1> to ptr
+; FINAL-NEXT: WIDEN-INTRINSIC vp<[[VP4:%[0-9]+]]> = call llvm.loop.dependence.war.mask(vp<[[VP2]]>, vp<[[VP3]]>, ir<1>)
+; FINAL-NEXT: EMIT vp<[[VP5:%[0-9]+]]> = num-active-lanes vp<[[VP4]]>
+; FINAL-NEXT: EMIT vp<%vf.is.scalar> = icmp ule vp<[[VP5]]>, ir<1>
+; FINAL-NEXT: EMIT vp<[[VP6:%[0-9]+]]> = sub ir<-1>, ir<%n>
+; FINAL-NEXT: EMIT vp<%vf.step.overflow> = icmp ult vp<[[VP6]]>, vp<[[VP5]]>
+; FINAL-NEXT: EMIT vp<[[VP7:%[0-9]+]]> = or vp<%vf.is.scalar>, vp<%vf.step.overflow>
+; FINAL-NEXT: EMIT branch-on-cond vp<[[VP7]]>
+; FINAL-NEXT: Successor(s): ir-bb<scalar.ph>, vector.ph
+; FINAL-EMPTY:
+; FINAL-NEXT: vector.ph:
+; FINAL-NEXT: EMIT vp<[[VP9:%[0-9]+]]> = TC > VF ? TC - VF : 0 ir<%n>, vp<[[VP5]]>
+; FINAL-NEXT: EMIT vp<%active.lane.mask.entry> = active lane mask ir<0>, ir<%n>, ir<1>
+; FINAL-NEXT: Successor(s): vector.body
+; FINAL-EMPTY:
+; FINAL-NEXT: vector.body:
+; FINAL-NEXT: EMIT-SCALAR vp<%index> = phi [ ir<0>, vector.ph ], [ vp<%index.next>, vector.body ]
+; FINAL-NEXT: ACTIVE-LANE-MASK-PHI vp<[[VP10:%[0-9]+]]> = phi vp<%active.lane.mask.entry>, vp<%active.lane.mask.next>
+; FINAL-NEXT: EMIT vp<[[VP11:%[0-9]+]]> = and vp<[[VP10]]>, vp<[[VP4]]>
+; FINAL-NEXT: CLONE ir<%ptr.a> = getelementptr inbounds ir<%a>, vp<%index>
+; FINAL-NEXT: WIDEN ir<%ld.a> = load ir<%ptr.a>, vp<[[VP11]]>
+; FINAL-NEXT: CLONE ir<%ptr.b> = getelementptr inbounds ir<%b>, vp<%index>
+; FINAL-NEXT: WIDEN ir<%ld.b> = load ir<%ptr.b>, vp<[[VP11]]>
+; FINAL-NEXT: WIDEN ir<%add> = add ir<%ld.b>, ir<%ld.a>
+; FINAL-NEXT: CLONE ir<%ptr.c> = getelementptr inbounds ir<%c>, vp<%index>
+; FINAL-NEXT: WIDEN store ir<%ptr.c>, ir<%add>, vp<[[VP11]]>
+; FINAL-NEXT: EMIT vp<%index.next> = add vp<%index>, vp<[[VP5]]>
+; FINAL-NEXT: EMIT vp<%active.lane.mask.next> = active lane mask vp<%index>, vp<[[VP9]]>, ir<1>
+; FINAL-NEXT: EMIT vp<[[VP12:%[0-9]+]]> = not vp<%active.lane.mask.next>
+; FINAL-NEXT: EMIT branch-on-cond vp<[[VP12]]>
+; FINAL-NEXT: Successor(s): middle.block, vector.body
+; FINAL-EMPTY:
+; FINAL-NEXT: middle.block:
+; FINAL-NEXT: Successor(s): ir-bb<exit>
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<exit>:
+; FINAL-NEXT: No successors
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<scalar.ph>:
+; FINAL-NEXT: Successor(s): ir-bb<for.body>
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<for.body>:
+; FINAL-NEXT: IR %iv = phi i64 [ 0, %scalar.ph ], [ %iv.next, %for.body ] (extra operand: ir<0> from ir-bb<scalar.ph>)
+; FINAL-NEXT: IR %ptr.a = getelementptr inbounds i8, ptr %a, i64 %iv
+; FINAL-NEXT: IR %ld.a = load i8, ptr %ptr.a, align 1
+; FINAL-NEXT: IR %ptr.b = getelementptr inbounds i8, ptr %b, i64 %iv
+; FINAL-NEXT: IR %ld.b = load i8, ptr %ptr.b, align 1
+; FINAL-NEXT: IR %add = add i8 %ld.b, %ld.a
+; FINAL-NEXT: IR %ptr.c = getelementptr inbounds i8, ptr %c, i64 %iv
+; FINAL-NEXT: IR store i8 %add, ptr %ptr.c, align 1
+; FINAL-NEXT: IR %iv.next = add nuw nsw i64 %iv, 1
+; FINAL-NEXT: IR %exitcond.not = icmp eq i64 %iv.next, %n
+; FINAL-NEXT: No successors
+; FINAL-NEXT: }
+;
+entry:
+ br label %for.body
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %ptr.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %ld.a = load i8, ptr %ptr.a, align 1
+ %ptr.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %ld.b = load i8, ptr %ptr.b, align 1
+ %add = add i8 %ld.b, %ld.a
+ %ptr.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %add, ptr %ptr.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
diff --git a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing-alias-mask.ll b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing-alias-mask.ll
new file mode 100644
index 0000000000000..808412d918e70
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing-alias-mask.ll
@@ -0,0 +1,93 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-partial-aliasing-vectorization -force-target-supports-masked-memory-ops -prefer-predicate-over-epilogue=predicate-dont-vectorize -disable-output -vplan-print-after="printFinalVPlan$" -S %s 2>&1 | FileCheck --check-prefixes=FINAL %s
+
+define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; FINAL-LABEL: VPlan for loop in 'alias_mask'
+; FINAL: VPlan 'Final VPlan for VF={4},UF={1}' {
+; FINAL-NEXT: Live-in ir<%n> = original trip-count
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<entry>:
+; FINAL-NEXT: IR %b2 = ptrtoaddr ptr %b to i64
+; FINAL-NEXT: IR %c1 = ptrtoaddr ptr %c to i64
+; FINAL-NEXT: Successor(s): vector.clamped.vf.check
+; FINAL-EMPTY:
+; FINAL-NEXT: vector.clamped.vf.check:
+; FINAL-NEXT: EMIT-SCALAR vp<[[VP2:%[0-9]+]]> = inttoptr ir<%b2> to ptr
+; FINAL-NEXT: EMIT-SCALAR vp<[[VP3:%[0-9]+]]> = inttoptr ir<%c1> to ptr
+; FINAL-NEXT: WIDEN-INTRINSIC vp<[[VP4:%[0-9]+]]> = call llvm.loop.dependence.war.mask(vp<[[VP2]]>, vp<[[VP3]]>, ir<1>)
+; FINAL-NEXT: EMIT vp<[[VP5:%[0-9]+]]> = num-active-lanes vp<[[VP4]]>
+; FINAL-NEXT: EMIT vp<%vf.is.scalar> = icmp ule vp<[[VP5]]>, ir<1>
+; FINAL-NEXT: EMIT vp<[[VP6:%[0-9]+]]> = sub ir<-1>, ir<%n>
+; FINAL-NEXT: EMIT vp<%vf.step.overflow> = icmp ult vp<[[VP6]]>, vp<[[VP5]]>
+; FINAL-NEXT: EMIT vp<[[VP7:%[0-9]+]]> = or vp<%vf.is.scalar>, vp<%vf.step.overflow>
+; FINAL-NEXT: EMIT branch-on-cond vp<[[VP7]]>
+; FINAL-NEXT: Successor(s): ir-bb<scalar.ph>, vector.ph
+; FINAL-EMPTY:
+; FINAL-NEXT: vector.ph:
+; FINAL-NEXT: EMIT vp<[[VP9:%[0-9]+]]> = sub vp<[[VP5]]>, ir<1>
+; FINAL-NEXT: EMIT vp<%n.rnd.up> = add ir<%n>, vp<[[VP9]]>
+; FINAL-NEXT: EMIT vp<%n.mod.vf> = urem vp<%n.rnd.up>, vp<[[VP5]]>
+; FINAL-NEXT: EMIT vp<%n.vec> = sub vp<%n.rnd.up>, vp<%n.mod.vf>
+; FINAL-NEXT: EMIT vp<%trip.count.minus.1> = sub ir<%n>, ir<1>
+; FINAL-NEXT: EMIT vp<[[VP10:%[0-9]+]]> = broadcast vp<%trip.count.minus.1>
+; FINAL-NEXT: Successor(s): vector.body
+; FINAL-EMPTY:
+; FINAL-NEXT: vector.body:
+; FINAL-NEXT: EMIT-SCALAR vp<%index> = phi [ ir<0>, vector.ph ], [ vp<%index.next>, vector.body ]
+; FINAL-NEXT: EMIT vp<[[VP11:%[0-9]+]]> = WIDEN-CANONICAL-INDUCTION vp<%index>
+; FINAL-NEXT: EMIT vp<[[VP12:%[0-9]+]]> = icmp ule vp<[[VP11]]>, vp<[[VP10]]>
+; FINAL-NEXT: EMIT vp<[[VP13:%[0-9]+]]> = and vp<[[VP12]]>, vp<[[VP4]]>
+; FINAL-NEXT: CLONE ir<%ptr.a> = getelementptr inbounds ir<%a>, vp<%index>
+; FINAL-NEXT: WIDEN ir<%ld.a> = load ir<%ptr.a>, vp<[[VP13]]>
+; FINAL-NEXT: CLONE ir<%ptr.b> = getelementptr inbounds ir<%b>, vp<%index>
+; FINAL-NEXT: WIDEN ir<%ld.b> = load ir<%ptr.b>, vp<[[VP13]]>
+; FINAL-NEXT: WIDEN ir<%add> = add ir<%ld.b>, ir<%ld.a>
+; FINAL-NEXT: CLONE ir<%ptr.c> = getelementptr inbounds ir<%c>, vp<%index>
+; FINAL-NEXT: WIDEN store ir<%ptr.c>, ir<%add>, vp<[[VP13]]>
+; FINAL-NEXT: EMIT vp<%index.next> = add vp<%index>, vp<[[VP5]]>
+; FINAL-NEXT: EMIT vp<[[VP14:%[0-9]+]]> = icmp eq vp<%index.next>, vp<%n.vec>
+; FINAL-NEXT: EMIT branch-on-cond vp<[[VP14]]>
+; FINAL-NEXT: Successor(s): middle.block, vector.body
+; FINAL-EMPTY:
+; FINAL-NEXT: middle.block:
+; FINAL-NEXT: Successor(s): ir-bb<exit>
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<exit>:
+; FINAL-NEXT: No successors
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<scalar.ph>:
+; FINAL-NEXT: Successor(s): ir-bb<for.body>
+; FINAL-EMPTY:
+; FINAL-NEXT: ir-bb<for.body>:
+; FINAL-NEXT: IR %iv = phi i64 [ 0, %scalar.ph ], [ %iv.next, %for.body ] (extra operand: ir<0> from ir-bb<scalar.ph>)
+; FINAL-NEXT: IR %ptr.a = getelementptr inbounds i8, ptr %a, i64 %iv
+; FINAL-NEXT: IR %ld.a = load i8, ptr %ptr.a, align 1
+; FINAL-NEXT: IR %ptr.b = getelementptr inbounds i8, ptr %b, i64 %iv
+; FINAL-NEXT: IR %ld.b = load i8, ptr %ptr.b, align 1
+; FINAL-NEXT: IR %add = add i8 %ld.b, %ld.a
+; FINAL-NEXT: IR %ptr.c = getelementptr inbounds i8, ptr %c, i64 %iv
+; FINAL-NEXT: IR store i8 %add, ptr %ptr.c, align 1
+; FINAL-NEXT: IR %iv.next = add nuw nsw i64 %iv, 1
+; FINAL-NEXT: IR %exitcond.not = icmp eq i64 %iv.next, %n
+; FINAL-NEXT: No successors
+; FINAL-NEXT: }
+;
+entry:
+ br label %for.body
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %ptr.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %ld.a = load i8, ptr %ptr.a, align 1
+ %ptr.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %ld.b = load i8, ptr %ptr.b, align 1
+ %add = add i8 %ld.b, %ld.a
+ %ptr.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %add, ptr %ptr.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
diff --git a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing.ll b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing.ll
index 0d923183e251a..2a77ae3609ad7 100644
--- a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing.ll
+++ b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-printing.ll
@@ -478,12 +478,12 @@ define void @print_expand_scev(i64 %y, ptr %ptr) {
; CHECK-NEXT: Live-in vp<[[VP2:%[0-9]+]]> = vector-trip-count
; CHECK-NEXT: vp<[[VP3:%[0-9]+]]> = original trip-count
; CHECK-EMPTY:
-; CHECK-NEXT: ir-bb<entry>:
-; CHECK-NEXT: IR %div = udiv i64 %y, 492802768830814060
-; CHECK-NEXT: IR %inc = add i64 %div, 1
-; CHECK-NEXT: EMIT vp<[[VP3]]> = EXPAND SCEV (1 + ((15 + (%y /u 492802768830814060))<nuw><nsw> /u (1 + (%y /u 492802768830814060))<nuw><nsw>))<nuw><nsw>
-; CHECK-NEXT: EMIT vp<[[VP4:%[0-9]+]]> = EXPAND SCEV (1 + (%y /u 492802768830814060))<nuw><nsw>
-; CHECK-NEXT: Successor(s): scalar.ph, vector.ph
+; CHECK-NEXT: ir-bb<entry>:
+; CHECK-NEXT: EMIT vp<[[VP4:%.+]]> = EXPAND SCEV (1 + (%y /u 492802768830814060))<nuw><nsw>
+; CHECK-NEXT: EMIT vp<[[VP3]]> = EXPAND SCEV (1 + ((15 + (%y /u 492802768830814060))<nuw><nsw> /u (1 + (%y /u 492802768830814060))<nuw><nsw>))<nuw><nsw>
+; CHECK-NEXT: IR %div = udiv i64 %y, 492802768830814060
+; CHECK-NEXT: IR %inc = add i64 %div, 1
+; CHECK-NEXT: Successor(s): scalar.ph, vector.ph
; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:
; CHECK-NEXT: vp<[[VP5:%[0-9]+]]> = DERIVED-IV ir<0> + vp<[[VP2]]> * vp<[[VP4]]>
diff --git a/llvm/test/Transforms/LoopVectorize/alias-mask-negative-tests.ll b/llvm/test/Transforms/LoopVectorize/alias-mask-negative-tests.ll
new file mode 100644
index 0000000000000..d33b68f62e2dc
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/alias-mask-negative-tests.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --filter-out-after "^scalar.ph:" --version 5
+; RUN: opt -S -force-partial-aliasing-vectorization -force-target-supports-masked-memory-ops -prefer-predicate-over-epilogue=predicate-dont-vectorize -force-vector-width=4 -passes=loop-vectorize %s | FileCheck %s
+
+; Note: First order recurrences are not supported with alias-masking.
+define i32 @first_order_recurrence(ptr nocapture readonly %a, ptr nocapture %b, i32 %n) {
+; CHECK-LABEL: define i32 @first_order_recurrence(
+; CHECK-SAME: ptr readonly captures(none) [[A:%.*]], ptr captures(none) [[B:%.*]], i32 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[A2:%.*]] = ptrtoaddr ptr [[A]] to i64
+; CHECK-NEXT: [[B1:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-NEXT: br label %[[FOR_PREHEADER:.*]]
+; CHECK: [[FOR_PREHEADER]]:
+; CHECK-NEXT: [[PRE_LOAD:%.*]] = load i32, ptr [[A]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[N]], -1
+; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
+; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
+; CHECK-NEXT: br label %[[VECTOR_MEMCHECK:.*]]
+; CHECK: [[VECTOR_MEMCHECK]]:
+; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[B1]], -4
+; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], [[A2]]
+; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP4]], 16
+; CHECK-NEXT: br i1 [[DIFF_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[TMP2]], 3
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[TMP2]], 1
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: [[VECTOR_RECUR_INIT:%.*]] = insertelement <4 x i32> poison, i32 [[PRE_LOAD]], i32 3
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VECTOR_RECUR:%.*]] = phi <4 x i32> [ [[VECTOR_RECUR_INIT]], %[[VECTOR_PH]] ], [ [[WIDE_MASKED_LOAD:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <4 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT3]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <4 x i64> [[BROADCAST_SPLAT4]], <i64 0, i64 1, i64 2, i64 3>
+; CHECK-NEXT: [[TMP5:%.*]] = icmp ule <4 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP6:%.*]] = add nuw nsw i64 [[INDEX]], 1
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP6]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD]] = call <4 x i32> @llvm.masked.load.v4i32.p0(ptr align 4 [[TMP7]], <4 x i1> [[TMP5]], <4 x i32> poison)
+; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
+; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
+; CHECK-NEXT: [[TMP10:%.*]] = add <4 x i32> [[WIDE_MASKED_LOAD]], [[TMP8]]
+; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0(<4 x i32> [[TMP10]], ptr align 4 [[TMP9]], <4 x i1> [[TMP5]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
+; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP11]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i1> [[TMP5]], splat (i1 true)
+; CHECK-NEXT: [[FIRST_INACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP12]], i1 false)
+; CHECK-NEXT: [[LAST_ACTIVE_LANE:%.*]] = sub i64 [[FIRST_INACTIVE_LANE]], 1
+; CHECK-NEXT: [[TMP13:%.*]] = sub i64 [[LAST_ACTIVE_LANE]], 1
+; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x i32> [[WIDE_MASKED_LOAD]], i64 [[TMP13]]
+; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[VECTOR_RECUR]], i32 3
+; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[LAST_ACTIVE_LANE]], 0
+; CHECK-NEXT: [[TMP17:%.*]] = select i1 [[TMP16]], i32 [[TMP15]], i32 [[TMP14]]
+; CHECK-NEXT: br [[FOR_EXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ br label %for.preheader
+
+for.preheader:
+ %pre_load = load i32, ptr %a
+ br label %scalar.body
+
+scalar.body:
+ %0 = phi i32 [ %pre_load, %for.preheader ], [ %1, %scalar.body ]
+ %indvars.iv = phi i64 [ 0, %for.preheader ], [ %indvars.iv.next, %scalar.body ]
+ %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+ %arrayidx32 = getelementptr inbounds i32, ptr %a, i64 %indvars.iv.next
+ %1 = load i32, ptr %arrayidx32
+ %arrayidx34 = getelementptr inbounds i32, ptr %b, i64 %indvars.iv
+ %add35 = add i32 %1, %0
+ store i32 %add35, ptr %arrayidx34
+ %lftr.wideiv = trunc i64 %indvars.iv.next to i32
+ %exitcond = icmp eq i32 %lftr.wideiv, %n
+ br i1 %exitcond, label %for.exit, label %scalar.body
+
+for.exit:
+ ret i32 %0
+}
diff --git a/llvm/test/Transforms/LoopVectorize/alias-mask.ll b/llvm/test/Transforms/LoopVectorize/alias-mask.ll
new file mode 100644
index 0000000000000..2ec99bb9d3213
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/alias-mask.ll
@@ -0,0 +1,382 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --filter-out-after "^scalar.ph:" --version 5
+; RUN: opt -S -force-partial-aliasing-vectorization -force-target-supports-masked-memory-ops -prefer-predicate-over-epilogue=predicate-dont-vectorize -force-vector-width=8 -passes=loop-vectorize %s | FileCheck %s
+; RUN: opt -S -force-partial-aliasing-vectorization -force-target-supports-masked-memory-ops -prefer-predicate-over-epilogue=predicate-dont-vectorize -force-vector-interleave=2 -force-vector-width=8 -passes=loop-vectorize %s | FileCheck %s
+; RUN: opt -S -force-partial-aliasing-vectorization -force-target-supports-masked-memory-ops -prefer-predicate-over-epilogue=predicate-dont-vectorize -epilogue-vectorization-force-VF=2 -force-vector-interleave=2 -force-vector-width=8 -passes=loop-vectorize %s | FileCheck %s
+
+; Note: -force-vector-interleave and -epilogue-vectorization-force-VF does not
+; change the results as alias-masking is not supported with interleaving or
+; epilogue vectorization.
+
+define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-LABEL: define void @alias_mask(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-NEXT: br i1 [[CMP11]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK: [[FOR_BODY_PREHEADER]]:
+; CHECK-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-NEXT: [[ALIAS_MASK:%.*]] = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i1> [[ALIAS_MASK]] to <8 x i32>
+; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
+; CHECK-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[TMP5:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP5]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP7]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[N]], 1
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT3]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <8 x i64> [[BROADCAST_SPLAT4]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ule <8 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <8 x i1> [[TMP8]], [[ALIAS_MASK]]
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP10]], <8 x i1> [[TMP9]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP11]], <8 x i1> [[TMP9]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP12:%.*]] = select <8 x i1> [[TMP9]], <8 x i8> [[WIDE_MASKED_LOAD]], <8 x i8> splat (i8 1)
+; CHECK-NEXT: [[TMP13:%.*]] = sdiv <8 x i8> [[WIDE_MASKED_LOAD5]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+; CHECK-NEXT: call void @llvm.masked.store.v8i8.p0(<8 x i8> [[TMP13]], ptr align 1 [[TMP14]], <8 x i1> [[TMP9]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP15]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ %cmp11 = icmp sgt i64 %n, 0
+ br i1 %cmp11, label %for.body, label %exit
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %gep.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %div = sdiv i8 %load.b, %load.a
+ %gep.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %div, ptr %gep.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
+
+; Alias mask created via combining multiple dependence masks.
+define void @alias_mask_multiple(ptr %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-LABEL: define void @alias_mask_multiple(
+; CHECK-SAME: ptr [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[A3:%.*]] = ptrtoaddr ptr [[A]] to i64
+; CHECK-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-NEXT: br i1 [[CMP11]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK: [[FOR_BODY_PREHEADER]]:
+; CHECK-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = inttoptr i64 [[A3]] to ptr
+; CHECK-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-NEXT: [[TMP2:%.*]] = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-NEXT: [[TMP3:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-NEXT: [[TMP4:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-NEXT: [[TMP5:%.*]] = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr [[TMP3]], ptr [[TMP4]], i64 1)
+; CHECK-NEXT: [[ALIAS_MASK:%.*]] = and <8 x i1> [[TMP2]], [[TMP5]]
+; CHECK-NEXT: [[TMP7:%.*]] = zext <8 x i1> [[ALIAS_MASK]] to <8 x i32>
+; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP7]])
+; CHECK-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP8]] to i64
+; CHECK-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[TMP9:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP9]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP10:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-NEXT: br i1 [[TMP10]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP11:%.*]] = sub i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP11]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[N]], 1
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT5:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT4]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <8 x i64> [[BROADCAST_SPLAT5]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: [[TMP12:%.*]] = icmp ule <8 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP13:%.*]] = and <8 x i1> [[TMP12]], [[ALIAS_MASK]]
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP14]], <8 x i1> [[TMP13]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD6:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP15]], <8 x i1> [[TMP13]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP16:%.*]] = add <8 x i8> [[WIDE_MASKED_LOAD6]], [[WIDE_MASKED_LOAD]]
+; CHECK-NEXT: [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+; CHECK-NEXT: call void @llvm.masked.store.v8i8.p0(<8 x i8> [[TMP16]], ptr align 1 [[TMP17]], <8 x i1> [[TMP13]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ %cmp11 = icmp sgt i64 %n, 0
+ br i1 %cmp11, label %for.body, label %exit
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %gep.a = getelementptr inbounds i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr inbounds i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %add = add i8 %load.b, %load.a
+ %gep.c = getelementptr inbounds i8, ptr %c, i64 %iv
+ store i8 %add, ptr %gep.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
+
+; Alias masking + a simple add reduction.
+define i32 @alias_mask_with_reduction(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
+; CHECK-LABEL: define i32 @alias_mask_with_reduction(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-NEXT: [[ALIAS_MASK:%.*]] = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i1> [[ALIAS_MASK]] to <8 x i32>
+; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
+; CHECK-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[TMP5:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP5]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP7]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[N]], 1
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <8 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP15:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT3]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <8 x i64> [[BROADCAST_SPLAT4]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ule <8 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <8 x i1> [[TMP8]], [[ALIAS_MASK]]
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds nuw i8, ptr [[A]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP10]], <8 x i1> [[TMP9]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds nuw i8, ptr [[B]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP11]], <8 x i1> [[TMP9]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP12:%.*]] = add <8 x i8> [[WIDE_MASKED_LOAD5]], [[WIDE_MASKED_LOAD]]
+; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds nuw i8, ptr [[C]], i64 [[INDEX]]
+; CHECK-NEXT: call void @llvm.masked.store.v8i8.p0(<8 x i8> [[TMP12]], ptr align 1 [[TMP13]], <8 x i1> [[TMP9]])
+; CHECK-NEXT: [[TMP14:%.*]] = zext <8 x i8> [[TMP12]] to <8 x i32>
+; CHECK-NEXT: [[TMP15]] = add <8 x i32> [[VEC_PHI]], [[TMP14]]
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[TMP17:%.*]] = select <8 x i1> [[TMP9]], <8 x i32> [[TMP15]], <8 x i32> [[VEC_PHI]]
+; CHECK-NEXT: [[TMP18:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP17]])
+; CHECK-NEXT: br [[EXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ br label %for.body
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %reduce = phi i32 [ 0, %entry ], [ %reduce.next, %for.body ]
+ %ptr.a = getelementptr inbounds nuw i8, ptr %a, i64 %iv
+ %ld.a = load i8, ptr %ptr.a, align 1
+ %ptr.b = getelementptr inbounds nuw i8, ptr %b, i64 %iv
+ %ld.b = load i8, ptr %ptr.b, align 1
+ %add = add i8 %ld.b, %ld.a
+ %ptr.c = getelementptr inbounds nuw i8, ptr %c, i64 %iv
+ store i8 %add, ptr %ptr.c, align 1
+ %ext.add = zext i8 %add to i32
+ %reduce.next = add nuw nsw i32 %reduce, %ext.add
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret i32 %reduce.next
+}
+
+define void @alias_mask_non_default_address_space(ptr addrspace(1) noalias %a, ptr addrspace(1) %b, ptr addrspace(1) %c, i64 %n) {
+; CHECK-LABEL: define void @alias_mask_non_default_address_space(
+; CHECK-SAME: ptr addrspace(1) noalias [[A:%.*]], ptr addrspace(1) [[B:%.*]], ptr addrspace(1) [[C:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[B2:%.*]] = ptrtoaddr ptr addrspace(1) [[B]] to i64
+; CHECK-NEXT: [[C1:%.*]] = ptrtoaddr ptr addrspace(1) [[C]] to i64
+; CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i64 [[N]], 0
+; CHECK-NEXT: br i1 [[CMP11]], label %[[FOR_BODY_PREHEADER:.*]], [[EXIT:label %.*]]
+; CHECK: [[FOR_BODY_PREHEADER]]:
+; CHECK-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-NEXT: [[ALIAS_MASK:%.*]] = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i1> [[ALIAS_MASK]] to <8 x i32>
+; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
+; CHECK-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[TMP5:%.*]] = sub i64 -1, [[N]]
+; CHECK-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 [[TMP5]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP7]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[N]], 1
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT3]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <8 x i64> [[BROADCAST_SPLAT4]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: [[TMP8:%.*]] = icmp ule <8 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP9:%.*]] = and <8 x i1> [[TMP8]], [[ALIAS_MASK]]
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[A]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p1(ptr addrspace(1) align 1 [[TMP10]], <8 x i1> [[TMP9]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[B]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD5:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p1(ptr addrspace(1) align 1 [[TMP11]], <8 x i1> [[TMP9]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP12:%.*]] = select <8 x i1> [[TMP9]], <8 x i8> [[WIDE_MASKED_LOAD]], <8 x i8> splat (i8 1)
+; CHECK-NEXT: [[TMP13:%.*]] = sdiv <8 x i8> [[WIDE_MASKED_LOAD5]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[C]], i64 [[INDEX]]
+; CHECK-NEXT: call void @llvm.masked.store.v8i8.p1(<8 x i8> [[TMP13]], ptr addrspace(1) align 1 [[TMP14]], <8 x i1> [[TMP9]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP15]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT_LOOPEXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ %cmp11 = icmp sgt i64 %n, 0
+ br i1 %cmp11, label %for.body, label %exit
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %gep.a = getelementptr inbounds i8, ptr addrspace(1) %a, i64 %iv
+ %load.a = load i8, ptr addrspace(1) %gep.a, align 1
+ %gep.b = getelementptr inbounds i8, ptr addrspace(1) %b, i64 %iv
+ %load.b = load i8, ptr addrspace(1) %gep.b, align 1
+ %div = sdiv i8 %load.b, %load.a
+ %gep.c = getelementptr inbounds i8, ptr addrspace(1) %c, i64 %iv
+ store i8 %div, ptr addrspace(1) %gep.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, %n
+ br i1 %exitcond.not, label %exit, label %for.body
+
+exit:
+ ret void
+}
+
+; Test alias mask with a known trip-count that would be one iteration of the full VF.
+define void @alias_mask_known_trip_count(ptr noalias %a, ptr %b, ptr %c) {
+; CHECK-LABEL: define void @alias_mask_known_trip_count(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr [[B:%.*]], ptr [[C:%.*]]) {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[B2:%.*]] = ptrtoaddr ptr [[B]] to i64
+; CHECK-NEXT: [[C1:%.*]] = ptrtoaddr ptr [[C]] to i64
+; CHECK-NEXT: br label %[[VECTOR_CLAMPED_VF_CHECK:.*]]
+; CHECK: [[VECTOR_CLAMPED_VF_CHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = inttoptr i64 [[B2]] to ptr
+; CHECK-NEXT: [[TMP1:%.*]] = inttoptr i64 [[C1]] to ptr
+; CHECK-NEXT: [[ALIAS_MASK:%.*]] = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr [[TMP0]], ptr [[TMP1]], i64 1)
+; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i1> [[ALIAS_MASK]] to <8 x i32>
+; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
+; CHECK-NEXT: [[NUM_ACTIVE_LANES:%.*]] = zext i32 [[TMP4]] to i64
+; CHECK-NEXT: [[VF_IS_SCALAR:%.*]] = icmp ule i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[VF_STEP_OVERFLOW:%.*]] = icmp ult i64 -8, [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP5:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
+; CHECK-NEXT: br i1 [[TMP5]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP6:%.*]] = sub i64 [[NUM_ACTIVE_LANES]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 7, [[TMP6]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT: [[VEC_IV:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
+; CHECK-NEXT: [[TMP7:%.*]] = icmp ule <8 x i64> [[VEC_IV]], splat (i64 6)
+; CHECK-NEXT: [[TMP8:%.*]] = and <8 x i1> [[TMP7]], [[ALIAS_MASK]]
+; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds nuw i8, ptr [[A]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP9]], <8 x i1> [[TMP8]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds nuw i8, ptr [[B]], i64 [[INDEX]]
+; CHECK-NEXT: [[WIDE_MASKED_LOAD3:%.*]] = call <8 x i8> @llvm.masked.load.v8i8.p0(ptr align 1 [[TMP10]], <8 x i1> [[TMP8]], <8 x i8> poison)
+; CHECK-NEXT: [[TMP11:%.*]] = add <8 x i8> [[WIDE_MASKED_LOAD3]], [[WIDE_MASKED_LOAD]]
+; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds nuw i8, ptr [[C]], i64 [[INDEX]]
+; CHECK-NEXT: call void @llvm.masked.store.v8i8.p0(<8 x i8> [[TMP11]], ptr align 1 [[TMP12]], <8 x i1> [[TMP8]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
+; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP13]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br [[EXIT:label %.*]]
+; CHECK: [[SCALAR_PH]]:
+;
+entry:
+ br label %for.body
+
+for.body:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+ %ptr.a = getelementptr inbounds nuw i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %ptr.a, align 1
+ %ptr.b = getelementptr inbounds nuw i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %ptr.b, align 1
+ %add = add i8 %load.b, %load.a
+ %ptr.c = getelementptr inbounds nuw i8, ptr %c, i64 %iv
+ store i8 %add, ptr %ptr.c, align 1
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond.not = icmp eq i64 %iv.next, 7
+ br i1 %exitcond.not, label %exit, label %for.body, !llvm.loop !2
+
+exit:
+ ret void
+}
+
+!2 = distinct !{!2, !3}
+!3 = !{!"llvm.loop.vectorize.enable", i1 true}
diff --git a/llvm/test/Transforms/LoopVectorize/pointer-induction.ll b/llvm/test/Transforms/LoopVectorize/pointer-induction.ll
index d5088fe60ee9f..e87cb5b27d5e1 100644
--- a/llvm/test/Transforms/LoopVectorize/pointer-induction.ll
+++ b/llvm/test/Transforms/LoopVectorize/pointer-induction.ll
@@ -437,8 +437,8 @@ define i64 @ivopt_widen_ptr_indvar_1(ptr noalias %a, i64 %stride, i64 %n) {
;
; STRIDED-LABEL: @ivopt_widen_ptr_indvar_1(
; STRIDED-NEXT: entry:
-; STRIDED-NEXT: [[TMP0:%.*]] = add i64 [[N:%.*]], 1
; STRIDED-NEXT: [[TMP1:%.*]] = shl i64 [[STRIDE:%.*]], 3
+; STRIDED-NEXT: [[TMP0:%.*]] = add i64 [[N:%.*]], 1
; STRIDED-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], 4
; STRIDED-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; STRIDED: vector.ph:
@@ -522,8 +522,8 @@ define i64 @ivopt_widen_ptr_indvar_2(ptr noalias %a, i64 %stride, i64 %n) {
;
; STRIDED-LABEL: @ivopt_widen_ptr_indvar_2(
; STRIDED-NEXT: entry:
-; STRIDED-NEXT: [[TMP0:%.*]] = add i64 [[N:%.*]], 1
; STRIDED-NEXT: [[TMP1:%.*]] = shl i64 [[STRIDE:%.*]], 3
+; STRIDED-NEXT: [[TMP0:%.*]] = add i64 [[N:%.*]], 1
; STRIDED-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], 4
; STRIDED-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; STRIDED: vector.ph:
@@ -629,8 +629,8 @@ define i64 @ivopt_widen_ptr_indvar_3(ptr noalias %a, i64 %stride, i64 %n) {
;
; STRIDED-LABEL: @ivopt_widen_ptr_indvar_3(
; STRIDED-NEXT: entry:
-; STRIDED-NEXT: [[TMP0:%.*]] = add i64 [[N:%.*]], 1
; STRIDED-NEXT: [[TMP1:%.*]] = shl i64 [[STRIDE:%.*]], 3
+; STRIDED-NEXT: [[TMP0:%.*]] = add i64 [[N:%.*]], 1
; STRIDED-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP0]], 4
; STRIDED-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; STRIDED: vector.ph:
@@ -711,10 +711,10 @@ define void @strided_ptr_iv_runtime_stride(ptr %pIn, ptr %pOut, i32 %nCols, i32
; STRIDED-NEXT: entry:
; STRIDED-NEXT: [[PIN2:%.*]] = ptrtoaddr ptr [[PIN:%.*]] to i64
; STRIDED-NEXT: [[POUT1:%.*]] = ptrtoaddr ptr [[POUT:%.*]] to i64
-; STRIDED-NEXT: [[TMP0:%.*]] = zext i32 [[NCOLS:%.*]] to i64
-; STRIDED-NEXT: [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP0]], i64 1)
; STRIDED-NEXT: [[TMP1:%.*]] = sext i32 [[STRIDE:%.*]] to i64
; STRIDED-NEXT: [[TMP2:%.*]] = shl nsw i64 [[TMP1]], 2
+; STRIDED-NEXT: [[TMP10:%.*]] = zext i32 [[NCOLS:%.*]] to i64
+; STRIDED-NEXT: [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP10]], i64 1)
; STRIDED-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[UMAX]], 4
; STRIDED-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
; STRIDED: vector.scevcheck:
diff --git a/llvm/test/Transforms/LoopVectorize/reuse-lcssa-phi-scev-expansion.ll b/llvm/test/Transforms/LoopVectorize/reuse-lcssa-phi-scev-expansion.ll
index 55c73cb0928ff..c97fc36ac76d1 100644
--- a/llvm/test/Transforms/LoopVectorize/reuse-lcssa-phi-scev-expansion.ll
+++ b/llvm/test/Transforms/LoopVectorize/reuse-lcssa-phi-scev-expansion.ll
@@ -205,10 +205,15 @@ define void @expand_diff_scev_unknown(ptr %dst, i1 %invar.c, i32 %step) mustprog
; CHECK-NEXT: br i1 [[INVAR_C]], label %[[LOOP_2_PREHEADER:.*]], label %[[LOOP_1]]
; CHECK: [[LOOP_2_PREHEADER]]:
; CHECK-NEXT: [[IV_1_LCSSA:%.*]] = phi i32 [ [[IV_1]], %[[LOOP_1]] ]
+; CHECK-NEXT: [[TMP0:%.*]] = sub i32 2, [[STEP]]
+; CHECK-NEXT: [[TMP12:%.*]] = add i32 [[IV_1_LCSSA]], [[TMP0]]
+; CHECK-NEXT: [[SMAX1:%.*]] = call i32 @llvm.smax.i32(i32 [[TMP12]], i32 0)
+; CHECK-NEXT: [[TMP3:%.*]] = mul i32 [[INDVAR]], -1
+; CHECK-NEXT: [[TMP14:%.*]] = add i32 [[TMP3]], -1
+; CHECK-NEXT: [[TMP15:%.*]] = add i32 [[SMAX1]], [[TMP14]]
; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[IV_1_LCSSA]], [[STEP]]
; CHECK-NEXT: [[SMAX:%.*]] = call i32 @llvm.smax.i32(i32 [[TMP1]], i32 0)
; CHECK-NEXT: [[TMP2:%.*]] = mul i32 [[STEP]], -2
-; CHECK-NEXT: [[TMP3:%.*]] = mul i32 [[INDVAR]], -1
; CHECK-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], [[TMP2]]
; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[SMAX]], [[TMP4]]
; CHECK-NEXT: [[UMIN:%.*]] = call i32 @llvm.umin.i32(i32 [[TMP5]], i32 1)
@@ -217,11 +222,6 @@ define void @expand_diff_scev_unknown(ptr %dst, i1 %invar.c, i32 %step) mustprog
; CHECK-NEXT: [[UMAX:%.*]] = call i32 @llvm.umax.i32(i32 [[STEP]], i32 1)
; CHECK-NEXT: [[TMP8:%.*]] = udiv i32 [[TMP7]], [[UMAX]]
; CHECK-NEXT: [[TMP9:%.*]] = add i32 [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP16:%.*]] = sub i32 2, [[STEP]]
-; CHECK-NEXT: [[TMP12:%.*]] = add i32 [[IV_1_LCSSA]], [[TMP16]]
-; CHECK-NEXT: [[SMAX1:%.*]] = call i32 @llvm.smax.i32(i32 [[TMP12]], i32 0)
-; CHECK-NEXT: [[TMP14:%.*]] = add i32 [[TMP3]], -1
-; CHECK-NEXT: [[TMP15:%.*]] = add i32 [[SMAX1]], [[TMP14]]
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP15]], 2
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
; CHECK: [[VECTOR_SCEVCHECK]]:
>From 3b06dbe0bbe65d3008a3d5f6913d15a5552898eb Mon Sep 17 00:00:00 2001
From: Benjamin Maxwell <benjamin.maxwell at arm.com>
Date: Wed, 4 Mar 2026 16:44:33 +0000
Subject: [PATCH 2/2] Rebase tests
---
.../LoopVectorize/AArch64/alias-mask.ll | 30 ++++---------------
.../AArch64/vplan-printing-alias-mask.ll | 17 +++++------
2 files changed, 14 insertions(+), 33 deletions(-)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll b/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll
index 65b02eb9c79e1..c0cce7013ec8f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/alias-mask.ll
@@ -24,9 +24,6 @@ define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
; CHECK-TF-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
; CHECK-TF-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK-TF: [[VECTOR_PH]]:
-; CHECK-TF-NEXT: [[TMP7:%.*]] = sub i64 [[N]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[TMP8:%.*]] = icmp ugt i64 [[N]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[N]])
; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK-TF: [[VECTOR_BODY]]:
@@ -42,7 +39,7 @@ define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
; CHECK-TF-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP14]], ptr align 1 [[TMP15]], <vscale x 16 x i1> [[TMP10]])
; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP9]])
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[N]])
; CHECK-TF-NEXT: [[TMP16:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; CHECK-TF-NEXT: [[TMP17:%.*]] = xor i1 [[TMP16]], true
; CHECK-TF-NEXT: br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
@@ -98,9 +95,6 @@ define i32 @alias_mask_read_after_write(ptr noalias %a, ptr %b, ptr %c, i64 %n)
; CHECK-TF-NEXT: [[TMP6:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
; CHECK-TF-NEXT: br i1 [[TMP6]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK-TF: [[VECTOR_PH]]:
-; CHECK-TF-NEXT: [[TMP7:%.*]] = sub i64 [[N]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[TMP8:%.*]] = icmp ugt i64 [[N]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK-TF: [[VECTOR_BODY]]:
@@ -118,7 +112,7 @@ define i32 @alias_mask_read_after_write(ptr noalias %a, ptr %b, ptr %c, i64 %n)
; CHECK-TF-NEXT: [[TMP15:%.*]] = add <vscale x 4 x i32> [[TMP14]], [[WIDE_MASKED_LOAD3]]
; CHECK-TF-NEXT: [[TMP16]] = select <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i32> [[TMP15]], <vscale x 4 x i32> [[VEC_PHI]]
; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP9]])
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[N]])
; CHECK-TF-NEXT: [[TMP17:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; CHECK-TF-NEXT: [[TMP18:%.*]] = xor i1 [[TMP17]], true
; CHECK-TF-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
@@ -181,9 +175,6 @@ define void @alias_mask_multiple(ptr %a, ptr %b, ptr %c, i64 %n) {
; CHECK-TF-NEXT: [[TMP10:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
; CHECK-TF-NEXT: br i1 [[TMP10]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK-TF: [[VECTOR_PH]]:
-; CHECK-TF-NEXT: [[TMP11:%.*]] = sub i64 [[N]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[TMP12:%.*]] = icmp ugt i64 [[N]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[TMP13:%.*]] = select i1 [[TMP12]], i64 [[TMP11]], i64 0
; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[N]])
; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK-TF: [[VECTOR_BODY]]:
@@ -198,7 +189,7 @@ define void @alias_mask_multiple(ptr %a, ptr %b, ptr %c, i64 %n) {
; CHECK-TF-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP17]], ptr align 1 [[TMP18]], <vscale x 16 x i1> [[TMP14]])
; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[NUM_ACTIVE_LANES]]
-; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP13]])
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[N]])
; CHECK-TF-NEXT: [[TMP19:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; CHECK-TF-NEXT: [[TMP20:%.*]] = xor i1 [[TMP19]], true
; CHECK-TF-NEXT: br i1 [[TMP20]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
@@ -254,9 +245,6 @@ define i8 @alias_masking_exit_value(ptr %ptrA, ptr %ptrB) {
; CHECK-TF-NEXT: [[TMP8:%.*]] = or i1 [[VF_IS_SCALAR]], [[VF_STEP_OVERFLOW]]
; CHECK-TF-NEXT: br i1 [[TMP8]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK-TF: [[VECTOR_PH]]:
-; CHECK-TF-NEXT: [[TMP9:%.*]] = sub i32 1000, [[TMP5]]
-; CHECK-TF-NEXT: [[TMP10:%.*]] = icmp ugt i32 1000, [[TMP5]]
-; CHECK-TF-NEXT: [[TMP11:%.*]] = select i1 [[TMP10]], i32 [[TMP9]], i32 0
; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 0, i32 1000)
; CHECK-TF-NEXT: [[TMP12:%.*]] = call <vscale x 16 x i8> @llvm.stepvector.nxv16i8()
; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
@@ -271,7 +259,7 @@ define i8 @alias_masking_exit_value(ptr %ptrA, ptr %ptrB) {
; CHECK-TF-NEXT: [[TMP16:%.*]] = add <vscale x 16 x i8> [[VEC_IND]], [[WIDE_MASKED_LOAD]]
; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP16]], ptr align 1 [[TMP15]], <vscale x 16 x i1> [[TMP13]])
; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], [[TMP5]]
-; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 [[INDEX]], i32 [[TMP11]])
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 [[INDEX_NEXT]], i32 1000)
; CHECK-TF-NEXT: [[TMP17:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; CHECK-TF-NEXT: [[TMP18:%.*]] = xor i1 [[TMP17]], true
; CHECK-TF-NEXT: [[VEC_IND_NEXT]] = add <vscale x 16 x i8> [[VEC_IND]], [[BROADCAST_SPLAT]]
@@ -322,9 +310,6 @@ define void @alias_mask_reverse_iterate(ptr noalias %ptrA, ptr %ptrB, ptr %ptrC,
; CHECK-TF: [[VECTOR_PH]]:
; CHECK-TF-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-TF-NEXT: [[TMP4:%.*]] = shl nuw i64 [[TMP3]], 4
-; CHECK-TF-NEXT: [[TMP5:%.*]] = sub i64 [[IV_START]], [[TMP4]]
-; CHECK-TF-NEXT: [[TMP6:%.*]] = icmp ugt i64 [[IV_START]], [[TMP4]]
-; CHECK-TF-NEXT: [[TMP7:%.*]] = select i1 [[TMP6]], i64 [[TMP5]], i64 0
; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[IV_START]])
; CHECK-TF-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK-TF: [[VECTOR_BODY]]:
@@ -350,7 +335,7 @@ define void @alias_mask_reverse_iterate(ptr noalias %ptrA, ptr %ptrB, ptr %ptrC,
; CHECK-TF-NEXT: [[REVERSE8:%.*]] = call <vscale x 16 x i1> @llvm.vector.reverse.nxv16i1(<vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-TF-NEXT: call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[REVERSE7]], ptr align 1 [[TMP16]], <vscale x 16 x i1> [[REVERSE8]])
; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP4]]
-; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX]], i64 [[TMP7]])
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[IV_START]])
; CHECK-TF-NEXT: [[TMP17:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; CHECK-TF-NEXT: [[TMP18:%.*]] = xor i1 [[TMP17]], true
; CHECK-TF-NEXT: br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
@@ -406,9 +391,6 @@ define i32 @recurrence_1(ptr nocapture readonly %a, ptr nocapture %b, i32 %n) {
; CHECK-TF: [[VECTOR_PH]]:
; CHECK-TF-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-TF-NEXT: [[TMP9:%.*]] = shl nuw i64 [[TMP8]], 2
-; CHECK-TF-NEXT: [[TMP10:%.*]] = sub i64 [[TMP2]], [[TMP9]]
-; CHECK-TF-NEXT: [[TMP11:%.*]] = icmp ugt i64 [[TMP2]], [[TMP9]]
-; CHECK-TF-NEXT: [[TMP12:%.*]] = select i1 [[TMP11]], i64 [[TMP10]], i64 0
; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
; CHECK-TF-NEXT: [[TMP13:%.*]] = call i32 @llvm.vscale.i32()
; CHECK-TF-NEXT: [[TMP14:%.*]] = mul nuw i32 [[TMP13]], 4
@@ -427,7 +409,7 @@ define i32 @recurrence_1(ptr nocapture readonly %a, ptr nocapture %b, i32 %n) {
; CHECK-TF-NEXT: [[TMP20:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], [[TMP18]]
; CHECK-TF-NEXT: call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[TMP20]], ptr align 4 [[TMP19]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
; CHECK-TF-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
-; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP12]])
+; CHECK-TF-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[TMP2]])
; CHECK-TF-NEXT: [[TMP21:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i32 0
; CHECK-TF-NEXT: [[TMP22:%.*]] = xor i1 [[TMP21]], true
; CHECK-TF-NEXT: br i1 [[TMP22]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
diff --git a/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll b/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll
index a12226b316a66..eb884af700b83 100644
--- a/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll
+++ b/llvm/test/Transforms/LoopVectorize/VPlan/AArch64/vplan-printing-alias-mask.ll
@@ -27,25 +27,24 @@ define void @alias_mask(ptr noalias %a, ptr %b, ptr %c, i64 %n) {
; FINAL-NEXT: Successor(s): ir-bb<scalar.ph>, vector.ph
; FINAL-EMPTY:
; FINAL-NEXT: vector.ph:
-; FINAL-NEXT: EMIT vp<[[VP9:%[0-9]+]]> = TC > VF ? TC - VF : 0 ir<%n>, vp<[[VP5]]>
; FINAL-NEXT: EMIT vp<%active.lane.mask.entry> = active lane mask ir<0>, ir<%n>, ir<1>
; FINAL-NEXT: Successor(s): vector.body
; FINAL-EMPTY:
; FINAL-NEXT: vector.body:
; FINAL-NEXT: EMIT-SCALAR vp<%index> = phi [ ir<0>, vector.ph ], [ vp<%index.next>, vector.body ]
-; FINAL-NEXT: ACTIVE-LANE-MASK-PHI vp<[[VP10:%[0-9]+]]> = phi vp<%active.lane.mask.entry>, vp<%active.lane.mask.next>
-; FINAL-NEXT: EMIT vp<[[VP11:%[0-9]+]]> = and vp<[[VP10]]>, vp<[[VP4]]>
+; FINAL-NEXT: ACTIVE-LANE-MASK-PHI vp<[[VP9:%[0-9]+]]> = phi vp<%active.lane.mask.entry>, vp<%active.lane.mask.next>
+; FINAL-NEXT: EMIT vp<[[VP10:%[0-9]+]]> = and vp<[[VP9]]>, vp<[[VP4]]>
; FINAL-NEXT: CLONE ir<%ptr.a> = getelementptr inbounds ir<%a>, vp<%index>
-; FINAL-NEXT: WIDEN ir<%ld.a> = load ir<%ptr.a>, vp<[[VP11]]>
+; FINAL-NEXT: WIDEN ir<%ld.a> = load ir<%ptr.a>, vp<[[VP10]]>
; FINAL-NEXT: CLONE ir<%ptr.b> = getelementptr inbounds ir<%b>, vp<%index>
-; FINAL-NEXT: WIDEN ir<%ld.b> = load ir<%ptr.b>, vp<[[VP11]]>
+; FINAL-NEXT: WIDEN ir<%ld.b> = load ir<%ptr.b>, vp<[[VP10]]>
; FINAL-NEXT: WIDEN ir<%add> = add ir<%ld.b>, ir<%ld.a>
; FINAL-NEXT: CLONE ir<%ptr.c> = getelementptr inbounds ir<%c>, vp<%index>
-; FINAL-NEXT: WIDEN store ir<%ptr.c>, ir<%add>, vp<[[VP11]]>
+; FINAL-NEXT: WIDEN store ir<%ptr.c>, ir<%add>, vp<[[VP10]]>
; FINAL-NEXT: EMIT vp<%index.next> = add vp<%index>, vp<[[VP5]]>
-; FINAL-NEXT: EMIT vp<%active.lane.mask.next> = active lane mask vp<%index>, vp<[[VP9]]>, ir<1>
-; FINAL-NEXT: EMIT vp<[[VP12:%[0-9]+]]> = not vp<%active.lane.mask.next>
-; FINAL-NEXT: EMIT branch-on-cond vp<[[VP12]]>
+; FINAL-NEXT: EMIT vp<%active.lane.mask.next> = active lane mask vp<%index.next>, ir<%n>, ir<1>
+; FINAL-NEXT: EMIT vp<[[VP11:%[0-9]+]]> = not vp<%active.lane.mask.next>
+; FINAL-NEXT: EMIT branch-on-cond vp<[[VP11]]>
; FINAL-NEXT: Successor(s): middle.block, vector.body
; FINAL-EMPTY:
; FINAL-NEXT: middle.block:
More information about the llvm-commits
mailing list