[llvm] d65cc85 - [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
Philip Reames via llvm-commits
llvm-commits at lists.llvm.org
Sat Mar 19 10:07:12 PDT 2022
Ok, this is definitely wrong. But so is the existing code. I plan on
fixing the generic case shortly, but I'm going to leave your special
case to you to fix or revert. I don't understand the invariants of this
patch enough to be comfortable making a fix.
Here's a test case for the special case you added (also committed in
bdbcca61):
; Variant of test10 block invariant operands to the udivs
; FIXME: This is wrong, we're hoisting a faulting udiv above an infinite
loop.
define void @test11(i64 %x, i64 %y, i64* %b, i64* %c) {
; CHECK-LABEL: @test11(
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x i64> poison, i64
[[X:%.*]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x i64> [[TMP1]], i64
[[Y:%.*]], i32 1
; CHECK-NEXT: [[TMP3:%.*]] = udiv <2 x i64> <i64 200, i64 200>, [[TMP2]]
; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; CHECK-NEXT: store i64 [[TMP4]], i64* [[B:%.*]], align 4
; CHECK-NEXT: [[TMP5:%.*]] = call i64 @may_inf_loop_ro()
; CHECK-NEXT: [[CA2:%.*]] = getelementptr i64, i64* [[C:%.*]], i32 1
; CHECK-NEXT: [[TMP6:%.*]] = bitcast i64* [[C]] to <2 x i64>*
; CHECK-NEXT: [[TMP7:%.*]] = load <2 x i64>, <2 x i64>* [[TMP6]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i64> [[TMP3]], [[TMP7]]
; CHECK-NEXT: [[B2:%.*]] = getelementptr i64, i64* [[B]], i32 1
; CHECK-NEXT: [[TMP9:%.*]] = bitcast i64* [[B]] to <2 x i64>*
; CHECK-NEXT: store <2 x i64> [[TMP8]], <2 x i64>* [[TMP9]], align 4
; CHECK-NEXT: ret void
;
%u1 = udiv i64 200, %x
store i64 %u1, i64* %b
call i64 @may_inf_loop_ro()
%u2 = udiv i64 200, %y
%c1 = load i64, i64* %c
%ca2 = getelementptr i64, i64* %c, i32 1
%c2 = load i64, i64* %ca2
%add1 = add i64 %u1, %c1
%add2 = add i64 %u2, %c2
store i64 %add1, i64* %b
%b2 = getelementptr i64, i64* %b, i32 1
store i64 %add2, i64* %b2
ret void
}
On 3/18/22 13:27, Philip Reames wrote:
> I added a comment to the existing code in 1093949cf which more fully
> explains the missing dependency and hidden assumption.
>
> I am not 100% sure your code has the same problem. I'd suggest
> exploring combinations such as a potentially faulting udiv following a
> readnone infinite loop call with block-invariant operands. I don't
> have a particular test case for you because massaging the code into
> actually reordering is quite involved. I tried, but did not manage to
> create one with a few minutes of trying.
>
> Philip
>
> On 3/18/22 10:26, Philip Reames via llvm-commits wrote:
>> FYI, I'm pretty sure this patch is wrong. The case which I believe it
>> gets wrong involves a bundle containing a readonly call which is not
>> guaranteed to return. (i.e. may contain an infinite loop) If I'm
>> reading the code correctly, it may reorder such a call earlier in the
>> basic block - including reordering of two such calls in the process.
>>
>> This is the same bug which existed in D118538 which is why I noticed it.
>>
>> If this case isn't possible for some reason, please add test coverage
>> and clarify comments as to why.
>>
>> Philip
>>
>> On 3/17/22 11:04, Alexey Bataev via llvm-commits wrote:
>>> Author: Alexey Bataev
>>> Date: 2022-03-17T11:03:45-07:00
>>> New Revision: d65cc8597792ab04142cd2214c46c5c167191bcd
>>>
>>> URL:
>>> https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd
>>> DIFF:
>>> https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd.diff
>>>
>>> LOG: [SLP]Do not schedule instructions with constants/argument/phi
>>> operands and external users.
>>>
>>> No need to schedule entry nodes where all instructions are not memory
>>> read/write instructions and their operands are either constants, or
>>> arguments, or phis, or instructions from others blocks, or their users
>>> are phis or from the other blocks.
>>> The resulting vector instructions can be placed at
>>> the beginning of the basic block without scheduling (if operands does
>>> not need to be scheduled) or at the end of the block (if users are
>>> outside of the block).
>>> It may save some compile time and scheduling resources.
>>>
>>> Differential Revision: https://reviews.llvm.org/D121121
>>>
>>> Added:
>>>
>>> Modified:
>>> llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>>
>>> Removed:
>>>
>>>
>>> ################################################################################
>>>
>>> diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> index 48382a12fcf3c..9ab31198adaab 100644
>>> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> @@ -776,6 +776,57 @@ static void
>>> reorderScalars(SmallVectorImpl<Value *> &Scalars,
>>> Scalars[Mask[I]] = Prev[I];
>>> }
>>> +/// Checks if the provided value does not require scheduling. It
>>> does not
>>> +/// require scheduling if this is not an instruction or it is an
>>> instruction
>>> +/// that does not read/write memory and all operands are either not
>>> instructions
>>> +/// or phi nodes or instructions from
>>> diff erent blocks.
>>> +static bool areAllOperandsNonInsts(Value *V) {
>>> + auto *I = dyn_cast<Instruction>(V);
>>> + if (!I)
>>> + return true;
>>> + return !I->mayReadOrWriteMemory() && all_of(I->operands(),
>>> [I](Value *V) {
>>> + auto *IO = dyn_cast<Instruction>(V);
>>> + if (!IO)
>>> + return true;
>>> + return isa<PHINode>(IO) || IO->getParent() != I->getParent();
>>> + });
>>> +}
>>> +
>>> +/// Checks if the provided value does not require scheduling. It
>>> does not
>>> +/// require scheduling if this is not an instruction or it is an
>>> instruction
>>> +/// that does not read/write memory and all users are phi nodes or
>>> instructions
>>> +/// from the
>>> diff erent blocks.
>>> +static bool isUsedOutsideBlock(Value *V) {
>>> + auto *I = dyn_cast<Instruction>(V);
>>> + if (!I)
>>> + return true;
>>> + // Limits the number of uses to save compile time.
>>> + constexpr int UsesLimit = 8;
>>> + return !I->mayReadOrWriteMemory() &&
>>> !I->hasNUsesOrMore(UsesLimit) &&
>>> + all_of(I->users(), [I](User *U) {
>>> + auto *IU = dyn_cast<Instruction>(U);
>>> + if (!IU)
>>> + return true;
>>> + return IU->getParent() != I->getParent() ||
>>> isa<PHINode>(IU);
>>> + });
>>> +}
>>> +
>>> +/// Checks if the specified value does not require scheduling. It
>>> does not
>>> +/// require scheduling if all operands and all users do not need to
>>> be scheduled
>>> +/// in the current basic block.
>>> +static bool doesNotNeedToBeScheduled(Value *V) {
>>> + return areAllOperandsNonInsts(V) && isUsedOutsideBlock(V);
>>> +}
>>> +
>>> +/// Checks if the specified array of instructions does not require
>>> scheduling.
>>> +/// It is so if all either instructions have operands that do not
>>> require
>>> +/// scheduling or their users do not require scheduling since they
>>> are phis or
>>> +/// in other basic blocks.
>>> +static bool doesNotNeedToSchedule(ArrayRef<Value *> VL) {
>>> + return !VL.empty() &&
>>> + (all_of(VL, isUsedOutsideBlock) || all_of(VL,
>>> areAllOperandsNonInsts));
>>> +}
>>> +
>>> namespace slpvectorizer {
>>> /// Bottom Up SLP Vectorizer.
>>> @@ -2359,15 +2410,21 @@ class BoUpSLP {
>>> ScalarToTreeEntry[V] = Last;
>>> }
>>> // Update the scheduler bundle to point to this TreeEntry.
>>> - unsigned Lane = 0;
>>> - for (ScheduleData *BundleMember = Bundle.getValue();
>>> BundleMember;
>>> - BundleMember = BundleMember->NextInBundle) {
>>> - BundleMember->TE = Last;
>>> - BundleMember->Lane = Lane;
>>> - ++Lane;
>>> - }
>>> - assert((!Bundle.getValue() || Lane == VL.size()) &&
>>> + ScheduleData *BundleMember = Bundle.getValue();
>>> + assert((BundleMember || isa<PHINode>(S.MainOp) ||
>>> + isVectorLikeInstWithConstOps(S.MainOp) ||
>>> + doesNotNeedToSchedule(VL)) &&
>>> "Bundle and VL out of sync");
>>> + if (BundleMember) {
>>> + for (Value *V : VL) {
>>> + if (doesNotNeedToBeScheduled(V))
>>> + continue;
>>> + assert(BundleMember && "Unexpected end of bundle.");
>>> + BundleMember->TE = Last;
>>> + BundleMember = BundleMember->NextInBundle;
>>> + }
>>> + }
>>> + assert(!BundleMember && "Bundle and VL out of sync");
>>> } else {
>>> MustGather.insert(VL.begin(), VL.end());
>>> }
>>> @@ -2504,7 +2561,6 @@ class BoUpSLP {
>>> clearDependencies();
>>> OpValue = OpVal;
>>> TE = nullptr;
>>> - Lane = -1;
>>> }
>>> /// Verify basic self consistency properties
>>> @@ -2544,7 +2600,7 @@ class BoUpSLP {
>>> /// Returns true if it represents an instruction bundle and
>>> not only a
>>> /// single instruction.
>>> bool isPartOfBundle() const {
>>> - return NextInBundle != nullptr || FirstInBundle != this;
>>> + return NextInBundle != nullptr || FirstInBundle != this || TE;
>>> }
>>> /// Returns true if it is ready for scheduling, i.e. it has
>>> no more
>>> @@ -2649,9 +2705,6 @@ class BoUpSLP {
>>> /// Note that this is negative as long as Dependencies is not
>>> calculated.
>>> int UnscheduledDeps = InvalidDeps;
>>> - /// The lane of this node in the TreeEntry.
>>> - int Lane = -1;
>>> -
>>> /// True if this instruction is scheduled (or considered as
>>> scheduled in the
>>> /// dry-run).
>>> bool IsScheduled = false;
>>> @@ -2669,6 +2722,21 @@ class BoUpSLP {
>>> friend struct DOTGraphTraits<BoUpSLP *>;
>>> /// Contains all scheduling data for a basic block.
>>> + /// It does not schedules instructions, which are not memory
>>> read/write
>>> + /// instructions and their operands are either constants, or
>>> arguments, or
>>> + /// phis, or instructions from others blocks, or their users are
>>> phis or from
>>> + /// the other blocks. The resulting vector instructions can be
>>> placed at the
>>> + /// beginning of the basic block without scheduling (if operands
>>> does not need
>>> + /// to be scheduled) or at the end of the block (if users are
>>> outside of the
>>> + /// block). It allows to save some compile time and memory used
>>> by the
>>> + /// compiler.
>>> + /// ScheduleData is assigned for each instruction in between the
>>> boundaries of
>>> + /// the tree entry, even for those, which are not part of the
>>> graph. It is
>>> + /// required to correctly follow the dependencies between the
>>> instructions and
>>> + /// their correct scheduling. The ScheduleData is not allocated
>>> for the
>>> + /// instructions, which do not require scheduling, like phis,
>>> nodes with
>>> + /// extractelements/insertelements only or nodes with
>>> instructions, with
>>> + /// uses/operands outside of the block.
>>> struct BlockScheduling {
>>> BlockScheduling(BasicBlock *BB)
>>> : BB(BB), ChunkSize(BB->size()), ChunkPos(ChunkSize) {}
>>> @@ -2696,7 +2764,7 @@ class BoUpSLP {
>>> if (BB != I->getParent())
>>> // Avoid lookup if can't possibly be in map.
>>> return nullptr;
>>> - ScheduleData *SD = ScheduleDataMap[I];
>>> + ScheduleData *SD = ScheduleDataMap.lookup(I);
>>> if (SD && isInSchedulingRegion(SD))
>>> return SD;
>>> return nullptr;
>>> @@ -2713,7 +2781,7 @@ class BoUpSLP {
>>> return getScheduleData(V);
>>> auto I = ExtraScheduleDataMap.find(V);
>>> if (I != ExtraScheduleDataMap.end()) {
>>> - ScheduleData *SD = I->second[Key];
>>> + ScheduleData *SD = I->second.lookup(Key);
>>> if (SD && isInSchedulingRegion(SD))
>>> return SD;
>>> }
>>> @@ -2735,7 +2803,7 @@ class BoUpSLP {
>>> BundleMember = BundleMember->NextInBundle) {
>>> if (BundleMember->Inst != BundleMember->OpValue)
>>> continue;
>>> -
>>> +
>>> // Handle the def-use chain dependencies.
>>> // Decrement the unscheduled counter and insert to ready
>>> list if ready.
>>> @@ -2760,7 +2828,9 @@ class BoUpSLP {
>>> // reordered during buildTree(). We therefore need to get
>>> its operands
>>> // through the TreeEntry.
>>> if (TreeEntry *TE = BundleMember->TE) {
>>> - int Lane = BundleMember->Lane;
>>> + // Need to search for the lane since the tree entry can
>>> be reordered.
>>> + int Lane = std::distance(TE->Scalars.begin(),
>>> + find(TE->Scalars,
>>> BundleMember->Inst));
>>> assert(Lane >= 0 && "Lane not set");
>>> // Since vectorization tree is being built recursively
>>> this assertion
>>> @@ -2769,7 +2839,7 @@ class BoUpSLP {
>>> // where their second (immediate) operand is not added.
>>> Since
>>> // immediates do not affect scheduler behavior this is
>>> considered
>>> // okay.
>>> - auto *In = TE->getMainOp();
>>> + auto *In = BundleMember->Inst;
>>> assert(In &&
>>> (isa<ExtractValueInst>(In) ||
>>> isa<ExtractElementInst>(In) ||
>>> In->getNumOperands() == TE->getNumOperands()) &&
>>> @@ -2814,7 +2884,8 @@ class BoUpSLP {
>>> for (auto *I = ScheduleStart; I != ScheduleEnd; I =
>>> I->getNextNode()) {
>>> auto *SD = getScheduleData(I);
>>> - assert(SD && "primary scheduledata must exist in window");
>>> + if (!SD)
>>> + continue;
>>> assert(isInSchedulingRegion(SD) &&
>>> "primary schedule data not in window?");
>>> assert(isInSchedulingRegion(SD->FirstInBundle) &&
>>> @@ -3856,6 +3927,22 @@ static LoadsState
>>> canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
>>> return LoadsState::Gather;
>>> }
>>> +/// \return true if the specified list of values has only one
>>> instruction that
>>> +/// requires scheduling, false otherwise.
>>> +static bool needToScheduleSingleInstruction(ArrayRef<Value *> VL) {
>>> + Value *NeedsScheduling = nullptr;
>>> + for (Value *V : VL) {
>>> + if (doesNotNeedToBeScheduled(V))
>>> + continue;
>>> + if (!NeedsScheduling) {
>>> + NeedsScheduling = V;
>>> + continue;
>>> + }
>>> + return false;
>>> + }
>>> + return NeedsScheduling;
>>> +}
>>> +
>>> void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
>>> const EdgeInfo &UserTreeIdx) {
>>> assert((allConstant(VL) || allSameType(VL)) && "Invalid types!");
>>> @@ -6396,6 +6483,44 @@ void BoUpSLP::setInsertPointAfterBundle(const
>>> TreeEntry *E) {
>>> return !E->isOpcodeOrAlt(I) || I->getParent() == BB;
>>> }));
>>> + auto &&FindLastInst = [E, Front]() {
>>> + Instruction *LastInst = Front;
>>> + for (Value *V : E->Scalars) {
>>> + auto *I = dyn_cast<Instruction>(V);
>>> + if (!I)
>>> + continue;
>>> + if (LastInst->comesBefore(I))
>>> + LastInst = I;
>>> + }
>>> + return LastInst;
>>> + };
>>> +
>>> + auto &&FindFirstInst = [E, Front]() {
>>> + Instruction *FirstInst = Front;
>>> + for (Value *V : E->Scalars) {
>>> + auto *I = dyn_cast<Instruction>(V);
>>> + if (!I)
>>> + continue;
>>> + if (I->comesBefore(FirstInst))
>>> + FirstInst = I;
>>> + }
>>> + return FirstInst;
>>> + };
>>> +
>>> + // Set the insert point to the beginning of the basic block if
>>> the entry
>>> + // should not be scheduled.
>>> + if (E->State != TreeEntry::NeedToGather &&
>>> + doesNotNeedToSchedule(E->Scalars)) {
>>> + BasicBlock::iterator InsertPt;
>>> + if (all_of(E->Scalars, isUsedOutsideBlock))
>>> + InsertPt = FindLastInst()->getIterator();
>>> + else
>>> + InsertPt = FindFirstInst()->getIterator();
>>> + Builder.SetInsertPoint(BB, InsertPt);
>>> + Builder.SetCurrentDebugLocation(Front->getDebugLoc());
>>> + return;
>>> + }
>>> +
>>> // The last instruction in the bundle in program order.
>>> Instruction *LastInst = nullptr;
>>> @@ -6404,8 +6529,10 @@ void
>>> BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
>>> // VL.back() and iterate over schedule data until we reach the
>>> end of the
>>> // bundle. The end of the bundle is marked by null ScheduleData.
>>> if (BlocksSchedules.count(BB)) {
>>> - auto *Bundle =
>>> - BlocksSchedules[BB]->getScheduleData(E->isOneOf(E->Scalars.back()));
>>> + Value *V = E->isOneOf(E->Scalars.back());
>>> + if (doesNotNeedToBeScheduled(V))
>>> + V = *find_if_not(E->Scalars, doesNotNeedToBeScheduled);
>>> + auto *Bundle = BlocksSchedules[BB]->getScheduleData(V);
>>> if (Bundle && Bundle->isPartOfBundle())
>>> for (; Bundle; Bundle = Bundle->NextInBundle)
>>> if (Bundle->OpValue == Bundle->Inst)
>>> @@ -6430,15 +6557,8 @@ void BoUpSLP::setInsertPointAfterBundle(const
>>> TreeEntry *E) {
>>> // not ideal. However, this should be exceedingly rare since it
>>> requires that
>>> // we both exit early from buildTree_rec and that the bundle be
>>> out-of-order
>>> // (causing us to iterate all the way to the end of the block).
>>> - if (!LastInst) {
>>> - SmallPtrSet<Value *, 16> Bundle(E->Scalars.begin(),
>>> E->Scalars.end());
>>> - for (auto &I : make_range(BasicBlock::iterator(Front),
>>> BB->end())) {
>>> - if (Bundle.erase(&I) && E->isOpcodeOrAlt(&I))
>>> - LastInst = &I;
>>> - if (Bundle.empty())
>>> - break;
>>> - }
>>> - }
>>> + if (!LastInst)
>>> + LastInst = FindLastInst();
>>> assert(LastInst && "Failed to find last instruction in bundle");
>>> // Set the insertion point after the last instruction in the
>>> bundle. Set the
>>> @@ -7631,9 +7751,11 @@ void BoUpSLP::optimizeGatherSequence() {
>>> BoUpSLP::ScheduleData *
>>> BoUpSLP::BlockScheduling::buildBundle(ArrayRef<Value *> VL) {
>>> - ScheduleData *Bundle = nullptr;
>>> + ScheduleData *Bundle = nullptr;
>>> ScheduleData *PrevInBundle = nullptr;
>>> for (Value *V : VL) {
>>> + if (doesNotNeedToBeScheduled(V))
>>> + continue;
>>> ScheduleData *BundleMember = getScheduleData(V);
>>> assert(BundleMember &&
>>> "no ScheduleData for bundle member "
>>> @@ -7661,7 +7783,8 @@
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL,
>>> BoUpSLP *SLP,
>>> const
>>> InstructionsState &S) {
>>> // No need to schedule PHIs, insertelement, extractelement and
>>> extractvalue
>>> // instructions.
>>> - if (isa<PHINode>(S.OpValue) ||
>>> isVectorLikeInstWithConstOps(S.OpValue))
>>> + if (isa<PHINode>(S.OpValue) ||
>>> isVectorLikeInstWithConstOps(S.OpValue) ||
>>> + doesNotNeedToSchedule(VL))
>>> return nullptr;
>>> // Initialize the instruction bundle.
>>> @@ -7707,6 +7830,8 @@
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL,
>>> BoUpSLP *SLP,
>>> // Make sure that the scheduling region contains all
>>> // instructions of the bundle.
>>> for (Value *V : VL) {
>>> + if (doesNotNeedToBeScheduled(V))
>>> + continue;
>>> if (!extendSchedulingRegion(V, S)) {
>>> // If the scheduling region got new instructions at the
>>> lower end (or it
>>> // is a new region for the first bundle). This makes it
>>> necessary to
>>> @@ -7721,6 +7846,8 @@
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL,
>>> BoUpSLP *SLP,
>>> bool ReSchedule = false;
>>> for (Value *V : VL) {
>>> + if (doesNotNeedToBeScheduled(V))
>>> + continue;
>>> ScheduleData *BundleMember = getScheduleData(V);
>>> assert(BundleMember &&
>>> "no ScheduleData for bundle member (maybe not in same
>>> basic block)");
>>> @@ -7750,14 +7877,18 @@
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL,
>>> BoUpSLP *SLP,
>>> void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value
>>> *> VL,
>>> Value *OpValue) {
>>> - if (isa<PHINode>(OpValue) || isVectorLikeInstWithConstOps(OpValue))
>>> + if (isa<PHINode>(OpValue) ||
>>> isVectorLikeInstWithConstOps(OpValue) ||
>>> + doesNotNeedToSchedule(VL))
>>> return;
>>> + if (doesNotNeedToBeScheduled(OpValue))
>>> + OpValue = *find_if_not(VL, doesNotNeedToBeScheduled);
>>> ScheduleData *Bundle = getScheduleData(OpValue);
>>> LLVM_DEBUG(dbgs() << "SLP: cancel scheduling of " << *Bundle <<
>>> "\n");
>>> assert(!Bundle->IsScheduled &&
>>> "Can't cancel bundle which is already scheduled");
>>> - assert(Bundle->isSchedulingEntity() && Bundle->isPartOfBundle() &&
>>> + assert(Bundle->isSchedulingEntity() &&
>>> + (Bundle->isPartOfBundle() ||
>>> needToScheduleSingleInstruction(VL)) &&
>>> "tried to unbundle something which is not a bundle");
>>> // Remove the bundle from the ready list.
>>> @@ -7771,6 +7902,7 @@ void
>>> BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,
>>> BundleMember->FirstInBundle = BundleMember;
>>> ScheduleData *Next = BundleMember->NextInBundle;
>>> BundleMember->NextInBundle = nullptr;
>>> + BundleMember->TE = nullptr;
>>> if (BundleMember->unscheduledDepsInBundle() == 0) {
>>> ReadyInsts.insert(BundleMember);
>>> }
>>> @@ -7794,6 +7926,7 @@ bool
>>> BoUpSLP::BlockScheduling::extendSchedulingRegion(Value *V,
>>> Instruction *I = dyn_cast<Instruction>(V);
>>> assert(I && "bundle member must be an instruction");
>>> assert(!isa<PHINode>(I) && !isVectorLikeInstWithConstOps(I) &&
>>> + !doesNotNeedToBeScheduled(I) &&
>>> "phi nodes/insertelements/extractelements/extractvalues
>>> don't need to "
>>> "be scheduled");
>>> auto &&CheckScheduleForI = [this, &S](Instruction *I) -> bool {
>>> @@ -7870,7 +8003,10 @@ void
>>> BoUpSLP::BlockScheduling::initScheduleData(Instruction *FromI,
>>> ScheduleData
>>> *NextLoadStore) {
>>> ScheduleData *CurrentLoadStore = PrevLoadStore;
>>> for (Instruction *I = FromI; I != ToI; I = I->getNextNode()) {
>>> - ScheduleData *SD = ScheduleDataMap[I];
>>> + // No need to allocate data for non-schedulable instructions.
>>> + if (doesNotNeedToBeScheduled(I))
>>> + continue;
>>> + ScheduleData *SD = ScheduleDataMap.lookup(I);
>>> if (!SD) {
>>> SD = allocateScheduleDataChunks();
>>> ScheduleDataMap[I] = SD;
>>> @@ -8054,8 +8190,10 @@ void BoUpSLP::scheduleBlock(BlockScheduling
>>> *BS) {
>>> for (auto *I = BS->ScheduleStart; I != BS->ScheduleEnd;
>>> I = I->getNextNode()) {
>>> BS->doForAllOpcodes(I, [this, &Idx, &NumToSchedule,
>>> BS](ScheduleData *SD) {
>>> + TreeEntry *SDTE = getTreeEntry(SD->Inst);
>>> assert((isVectorLikeInstWithConstOps(SD->Inst) ||
>>> - SD->isPartOfBundle() == (getTreeEntry(SD->Inst) !=
>>> nullptr)) &&
>>> + SD->isPartOfBundle() ==
>>> + (SDTE && !doesNotNeedToSchedule(SDTE->Scalars))) &&
>>> "scheduler and vectorizer bundle mismatch");
>>> SD->FirstInBundle->SchedulingPriority = Idx++;
>>> if (SD->isSchedulingEntity()) {
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> index 536f72a73684e..ec7b03af83f8b 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> @@ -36,6 +36,7 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture
>>> readonly %a, i16* nocapture re
>>> ; GENERIC-NEXT: [[I_0103:%.*]] = phi i32 [ [[INC:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; GENERIC-NEXT: [[SUM_0102:%.*]] = phi i32 [ [[ADD66]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; GENERIC-NEXT: [[A_ADDR_0101:%.*]] = phi i16* [
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]],
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; GENERIC-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16,
>>> i16* [[A_ADDR_0101]], i64 8
>>> ; GENERIC-NEXT: [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to
>>> <8 x i16>*
>>> ; GENERIC-NEXT: [[TMP1:%.*]] = load <8 x i16>, <8 x i16>*
>>> [[TMP0]], align 2
>>> ; GENERIC-NEXT: [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x
>>> i32>
>>> @@ -85,7 +86,6 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture
>>> readonly %a, i16* nocapture re
>>> ; GENERIC-NEXT: [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]],
>>> align 2
>>> ; GENERIC-NEXT: [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>> ; GENERIC-NEXT: [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; GENERIC-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16,
>>> i16* [[A_ADDR_0101]], i64 8
>>> ; GENERIC-NEXT: [[TMP28:%.*]] = extractelement <8 x i32>
>>> [[TMP6]], i64 7
>>> ; GENERIC-NEXT: [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>> ; GENERIC-NEXT: [[ARRAYIDX64:%.*]] = getelementptr inbounds
>>> i16, i16* [[G]], i64 [[TMP29]]
>>> @@ -111,6 +111,7 @@ define i32 @gather_reduce_8x16_i32(i16*
>>> nocapture readonly %a, i16* nocapture re
>>> ; KRYO-NEXT: [[I_0103:%.*]] = phi i32 [ [[INC:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; KRYO-NEXT: [[SUM_0102:%.*]] = phi i32 [ [[ADD66]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; KRYO-NEXT: [[A_ADDR_0101:%.*]] = phi i16* [
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]],
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; KRYO-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16, i16*
>>> [[A_ADDR_0101]], i64 8
>>> ; KRYO-NEXT: [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8
>>> x i16>*
>>> ; KRYO-NEXT: [[TMP1:%.*]] = load <8 x i16>, <8 x i16>*
>>> [[TMP0]], align 2
>>> ; KRYO-NEXT: [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>> @@ -160,7 +161,6 @@ define i32 @gather_reduce_8x16_i32(i16*
>>> nocapture readonly %a, i16* nocapture re
>>> ; KRYO-NEXT: [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]],
>>> align 2
>>> ; KRYO-NEXT: [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>> ; KRYO-NEXT: [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; KRYO-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16, i16*
>>> [[A_ADDR_0101]], i64 8
>>> ; KRYO-NEXT: [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]],
>>> i64 7
>>> ; KRYO-NEXT: [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>> ; KRYO-NEXT: [[ARRAYIDX64:%.*]] = getelementptr inbounds i16,
>>> i16* [[G]], i64 [[TMP29]]
>>> @@ -297,6 +297,7 @@ define i32 @gather_reduce_8x16_i64(i16*
>>> nocapture readonly %a, i16* nocapture re
>>> ; GENERIC-NEXT: [[I_0103:%.*]] = phi i32 [ [[INC:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; GENERIC-NEXT: [[SUM_0102:%.*]] = phi i32 [ [[ADD66]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; GENERIC-NEXT: [[A_ADDR_0101:%.*]] = phi i16* [
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]],
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; GENERIC-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16,
>>> i16* [[A_ADDR_0101]], i64 8
>>> ; GENERIC-NEXT: [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to
>>> <8 x i16>*
>>> ; GENERIC-NEXT: [[TMP1:%.*]] = load <8 x i16>, <8 x i16>*
>>> [[TMP0]], align 2
>>> ; GENERIC-NEXT: [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x
>>> i32>
>>> @@ -346,7 +347,6 @@ define i32 @gather_reduce_8x16_i64(i16*
>>> nocapture readonly %a, i16* nocapture re
>>> ; GENERIC-NEXT: [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]],
>>> align 2
>>> ; GENERIC-NEXT: [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>> ; GENERIC-NEXT: [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; GENERIC-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16,
>>> i16* [[A_ADDR_0101]], i64 8
>>> ; GENERIC-NEXT: [[TMP28:%.*]] = extractelement <8 x i32>
>>> [[TMP6]], i64 7
>>> ; GENERIC-NEXT: [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>> ; GENERIC-NEXT: [[ARRAYIDX64:%.*]] = getelementptr inbounds
>>> i16, i16* [[G]], i64 [[TMP29]]
>>> @@ -372,6 +372,7 @@ define i32 @gather_reduce_8x16_i64(i16*
>>> nocapture readonly %a, i16* nocapture re
>>> ; KRYO-NEXT: [[I_0103:%.*]] = phi i32 [ [[INC:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; KRYO-NEXT: [[SUM_0102:%.*]] = phi i32 [ [[ADD66]],
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>> ; KRYO-NEXT: [[A_ADDR_0101:%.*]] = phi i16* [
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]],
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; KRYO-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16, i16*
>>> [[A_ADDR_0101]], i64 8
>>> ; KRYO-NEXT: [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8
>>> x i16>*
>>> ; KRYO-NEXT: [[TMP1:%.*]] = load <8 x i16>, <8 x i16>*
>>> [[TMP0]], align 2
>>> ; KRYO-NEXT: [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>> @@ -421,7 +422,6 @@ define i32 @gather_reduce_8x16_i64(i16*
>>> nocapture readonly %a, i16* nocapture re
>>> ; KRYO-NEXT: [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]],
>>> align 2
>>> ; KRYO-NEXT: [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>> ; KRYO-NEXT: [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; KRYO-NEXT: [[INCDEC_PTR58]] = getelementptr inbounds i16, i16*
>>> [[A_ADDR_0101]], i64 8
>>> ; KRYO-NEXT: [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]],
>>> i64 7
>>> ; KRYO-NEXT: [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>> ; KRYO-NEXT: [[ARRAYIDX64:%.*]] = getelementptr inbounds i16,
>>> i16* [[G]], i64 [[TMP29]]
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> index e9c502b6982cd..01d743fcbfe97 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> @@ -35,41 +35,14 @@ define void @PR28330(i32 %n) {
>>> ;
>>> ; MAX-COST-LABEL: @PR28330(
>>> ; MAX-COST-NEXT: entry:
>>> -; MAX-COST-NEXT: [[P0:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
>>> -; MAX-COST-NEXT: [[P1:%.*]] = icmp eq i8 [[P0]], 0
>>> -; MAX-COST-NEXT: [[P2:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 2), align 2
>>> -; MAX-COST-NEXT: [[P3:%.*]] = icmp eq i8 [[P2]], 0
>>> -; MAX-COST-NEXT: [[P4:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
>>> -; MAX-COST-NEXT: [[P5:%.*]] = icmp eq i8 [[P4]], 0
>>> -; MAX-COST-NEXT: [[P6:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
>>> -; MAX-COST-NEXT: [[P7:%.*]] = icmp eq i8 [[P6]], 0
>>> -; MAX-COST-NEXT: [[P8:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>>> -; MAX-COST-NEXT: [[P9:%.*]] = icmp eq i8 [[P8]], 0
>>> -; MAX-COST-NEXT: [[P10:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>>> -; MAX-COST-NEXT: [[P11:%.*]] = icmp eq i8 [[P10]], 0
>>> -; MAX-COST-NEXT: [[P12:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>>> -; MAX-COST-NEXT: [[P13:%.*]] = icmp eq i8 [[P12]], 0
>>> -; MAX-COST-NEXT: [[P14:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>>> -; MAX-COST-NEXT: [[P15:%.*]] = icmp eq i8 [[P14]], 0
>>> +; MAX-COST-NEXT: [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast
>>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1)
>>> to <8 x i8>*), align 1
>>> +; MAX-COST-NEXT: [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]],
>>> zeroinitializer
>>> ; MAX-COST-NEXT: br label [[FOR_BODY:%.*]]
>>> ; MAX-COST: for.body:
>>> -; MAX-COST-NEXT: [[P17:%.*]] = phi i32 [ [[P34:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> -; MAX-COST-NEXT: [[P19:%.*]] = select i1 [[P1]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P20:%.*]] = add i32 [[P17]], [[P19]]
>>> -; MAX-COST-NEXT: [[P21:%.*]] = select i1 [[P3]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P22:%.*]] = add i32 [[P20]], [[P21]]
>>> -; MAX-COST-NEXT: [[P23:%.*]] = select i1 [[P5]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P24:%.*]] = add i32 [[P22]], [[P23]]
>>> -; MAX-COST-NEXT: [[P25:%.*]] = select i1 [[P7]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P26:%.*]] = add i32 [[P24]], [[P25]]
>>> -; MAX-COST-NEXT: [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P28:%.*]] = add i32 [[P26]], [[P27]]
>>> -; MAX-COST-NEXT: [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P30:%.*]] = add i32 [[P28]], [[P29]]
>>> -; MAX-COST-NEXT: [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P32:%.*]] = add i32 [[P30]], [[P31]]
>>> -; MAX-COST-NEXT: [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P34]] = add i32 [[P32]], [[P33]]
>>> +; MAX-COST-NEXT: [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> +; MAX-COST-NEXT: [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x
>>> i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720,
>>> i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80,
>>> i32 -80, i32 -80, i32 -80, i32 -80>
>>> +; MAX-COST-NEXT: [[TMP3:%.*]] = call i32
>>> @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>>> +; MAX-COST-NEXT: [[OP_EXTRA]] = add i32 [[TMP3]], [[P17]]
>>> ; MAX-COST-NEXT: br label [[FOR_BODY]]
>>> ;
>>> entry:
>>> @@ -139,30 +112,14 @@ define void @PR32038(i32 %n) {
>>> ;
>>> ; MAX-COST-LABEL: @PR32038(
>>> ; MAX-COST-NEXT: entry:
>>> -; MAX-COST-NEXT: [[TMP0:%.*]] = load <4 x i8>, <4 x i8>* bitcast
>>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1)
>>> to <4 x i8>*), align 1
>>> -; MAX-COST-NEXT: [[TMP1:%.*]] = icmp eq <4 x i8> [[TMP0]],
>>> zeroinitializer
>>> -; MAX-COST-NEXT: [[P8:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>>> -; MAX-COST-NEXT: [[P9:%.*]] = icmp eq i8 [[P8]], 0
>>> -; MAX-COST-NEXT: [[P10:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>>> -; MAX-COST-NEXT: [[P11:%.*]] = icmp eq i8 [[P10]], 0
>>> -; MAX-COST-NEXT: [[P12:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>>> -; MAX-COST-NEXT: [[P13:%.*]] = icmp eq i8 [[P12]], 0
>>> -; MAX-COST-NEXT: [[P14:%.*]] = load i8, i8* getelementptr
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>>> -; MAX-COST-NEXT: [[P15:%.*]] = icmp eq i8 [[P14]], 0
>>> +; MAX-COST-NEXT: [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast
>>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1)
>>> to <8 x i8>*), align 1
>>> +; MAX-COST-NEXT: [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]],
>>> zeroinitializer
>>> ; MAX-COST-NEXT: br label [[FOR_BODY:%.*]]
>>> ; MAX-COST: for.body:
>>> -; MAX-COST-NEXT: [[P17:%.*]] = phi i32 [ [[P34:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> -; MAX-COST-NEXT: [[TMP2:%.*]] = select <4 x i1> [[TMP1]], <4 x
>>> i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80,
>>> i32 -80, i32 -80, i32 -80>
>>> -; MAX-COST-NEXT: [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[TMP3:%.*]] = call i32
>>> @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
>>> -; MAX-COST-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], [[P27]]
>>> -; MAX-COST-NEXT: [[TMP5:%.*]] = add i32 [[TMP4]], [[P29]]
>>> -; MAX-COST-NEXT: [[OP_EXTRA:%.*]] = add i32 [[TMP5]], -5
>>> -; MAX-COST-NEXT: [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]
>>> -; MAX-COST-NEXT: [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT: [[P34]] = add i32 [[P32]], [[P33]]
>>> +; MAX-COST-NEXT: [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]],
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> +; MAX-COST-NEXT: [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x
>>> i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720,
>>> i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80,
>>> i32 -80, i32 -80, i32 -80, i32 -80>
>>> +; MAX-COST-NEXT: [[TMP3:%.*]] = call i32
>>> @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>>> +; MAX-COST-NEXT: [[OP_EXTRA]] = add i32 [[TMP3]], -5
>>> ; MAX-COST-NEXT: br label [[FOR_BODY]]
>>> ;
>>> entry:
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> index 39f2f885bc26b..c1451090d23c0 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> @@ -14,14 +14,14 @@ define void @patatino(i64 %n, i64 %i, %struct.S*
>>> %p) !dbg !7 {
>>> ; CHECK-NEXT: call void @llvm.dbg.value(metadata %struct.S*
>>> [[P:%.*]], metadata [[META20:![0-9]+]], metadata !DIExpression()),
>>> !dbg [[DBG25:![0-9]+]]
>>> ; CHECK-NEXT: [[X1:%.*]] = getelementptr inbounds
>>> [[STRUCT_S:%.*]], %struct.S* [[P]], i64 [[N]], i32 0, !dbg
>>> [[DBG26:![0-9]+]]
>>> ; CHECK-NEXT: call void @llvm.dbg.value(metadata i64 undef,
>>> metadata [[META21:![0-9]+]], metadata !DIExpression()), !dbg
>>> [[DBG27:![0-9]+]]
>>> -; CHECK-NEXT: [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]],
>>> %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG28:![0-9]+]]
>>> +; CHECK-NEXT: call void @llvm.dbg.value(metadata i64 undef,
>>> metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg
>>> [[DBG28:![0-9]+]]
>>> +; CHECK-NEXT: [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]],
>>> %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG29:![0-9]+]]
>>> ; CHECK-NEXT: [[TMP0:%.*]] = bitcast i64* [[X1]] to <2 x i64>*,
>>> !dbg [[DBG26]]
>>> -; CHECK-NEXT: [[TMP1:%.*]] = load <2 x i64>, <2 x i64>*
>>> [[TMP0]], align 8, !dbg [[DBG26]], !tbaa [[TBAA29:![0-9]+]]
>>> -; CHECK-NEXT: call void @llvm.dbg.value(metadata i64 undef,
>>> metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg
>>> [[DBG33:![0-9]+]]
>>> +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x i64>, <2 x i64>*
>>> [[TMP0]], align 8, !dbg [[DBG26]], !tbaa [[TBAA30:![0-9]+]]
>>> ; CHECK-NEXT: [[X5:%.*]] = getelementptr inbounds [[STRUCT_S]],
>>> %struct.S* [[P]], i64 [[I]], i32 0, !dbg [[DBG34:![0-9]+]]
>>> ; CHECK-NEXT: [[Y7:%.*]] = getelementptr inbounds [[STRUCT_S]],
>>> %struct.S* [[P]], i64 [[I]], i32 1, !dbg [[DBG35:![0-9]+]]
>>> ; CHECK-NEXT: [[TMP2:%.*]] = bitcast i64* [[X5]] to <2 x i64>*,
>>> !dbg [[DBG36:![0-9]+]]
>>> -; CHECK-NEXT: store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]],
>>> align 8, !dbg [[DBG36]], !tbaa [[TBAA29]]
>>> +; CHECK-NEXT: store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]],
>>> align 8, !dbg [[DBG36]], !tbaa [[TBAA30]]
>>> ; CHECK-NEXT: ret void, !dbg [[DBG37:![0-9]+]]
>>> ;
>>> entry:
>>>
>>> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> index 7f51dcae484ca..d15494e092c25 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> @@ -9,11 +9,11 @@ define void @test() #0 {
>>> ; CHECK: loop:
>>> ; CHECK-NEXT: [[DUMMY_PHI:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ],
>>> [ [[OP_EXTRA1:%.*]], [[LOOP]] ]
>>> ; CHECK-NEXT: [[TMP0:%.*]] = phi i64 [ 2, [[ENTRY]] ], [
>>> [[TMP3:%.*]], [[LOOP]] ]
>>> -; CHECK-NEXT: [[DUMMY_ADD:%.*]] = add i16 0, 0
>>> ; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i64> poison,
>>> i64 [[TMP0]], i32 0
>>> ; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i64>
>>> [[TMP1]], <4 x i64> poison, <4 x i32> zeroinitializer
>>> ; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i64> [[SHUFFLE]], <i64 3,
>>> i64 2, i64 1, i64 0>
>>> ; CHECK-NEXT: [[TMP3]] = extractelement <4 x i64> [[TMP2]], i32 3
>>> +; CHECK-NEXT: [[DUMMY_ADD:%.*]] = add i16 0, 0
>>> ; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x i64> [[TMP2]],
>>> i32 0
>>> ; CHECK-NEXT: [[DUMMY_SHL:%.*]] = shl i64 [[TMP4]], 32
>>> ; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i64> <i64 1, i64 1, i64
>>> 1, i64 1>, [[TMP2]]
>>>
>>> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> index 7ab610f994264..f878bda14ad84 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> @@ -10,10 +10,10 @@ define void @mainTest(i32 %param, i32 * %vals,
>>> i32 %len) {
>>> ; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP7:%.*]],
>>> [[BCI_15]] ], [ [[TMP0]], [[BCI_15_PREHEADER:%.*]] ]
>>> ; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32>
>>> [[TMP1]], <2 x i32> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0,
>>> i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32
>>> 0, i32 0, i32 1>
>>> ; CHECK-NEXT: [[TMP2:%.*]] = extractelement <16 x i32>
>>> [[SHUFFLE]], i32 0
>>> -; CHECK-NEXT: [[TMP3:%.*]] = extractelement <16 x i32>
>>> [[SHUFFLE]], i32 15
>>> -; CHECK-NEXT: store atomic i32 [[TMP3]], i32* [[VALS:%.*]]
>>> unordered, align 4
>>> -; CHECK-NEXT: [[TMP4:%.*]] = add <16 x i32> [[SHUFFLE]], <i32
>>> 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32
>>> 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>>> -; CHECK-NEXT: [[TMP5:%.*]] = call i32
>>> @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP4]])
>>> +; CHECK-NEXT: [[TMP3:%.*]] = add <16 x i32> [[SHUFFLE]], <i32
>>> 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32
>>> 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>>> +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <16 x i32>
>>> [[SHUFFLE]], i32 15
>>> +; CHECK-NEXT: store atomic i32 [[TMP4]], i32* [[VALS:%.*]]
>>> unordered, align 4
>>> +; CHECK-NEXT: [[TMP5:%.*]] = call i32
>>> @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP3]])
>>> ; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP5]], [[TMP2]]
>>> ; CHECK-NEXT: [[V44:%.*]] = add i32 [[TMP2]], 16
>>> ; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> poison,
>>> i32 [[V44]], i32 0
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> index de371d8895c7d..94739340c8b5a 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> @@ -29,10 +29,10 @@ define void @exceed(double %0, double %1) {
>>> ; CHECK-NEXT: [[IXX22:%.*]] = fsub double undef, undef
>>> ; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double>
>>> [[TMP6]], i32 0
>>> ; CHECK-NEXT: [[IX2:%.*]] = fmul double [[TMP8]], [[TMP8]]
>>> -; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double>
>>> [[TMP2]], double [[TMP1]], i32 1
>>> -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP6]],
>>> [[TMP9]]
>>> -; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]],
>>> [[TMP5]]
>>> -; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]],
>>> [[TMP11]]
>>> +; CHECK-NEXT: [[TMP9:%.*]] = fadd fast <2 x double> [[TMP3]],
>>> [[TMP5]]
>>> +; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x double>
>>> [[TMP2]], double [[TMP1]], i32 1
>>> +; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP6]],
>>> [[TMP10]]
>>> +; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP11]],
>>> [[TMP9]]
>>> ; CHECK-NEXT: [[IXX101:%.*]] = fsub double undef, undef
>>> ; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double>
>>> poison, double [[TMP1]], i32 1
>>> ; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double>
>>> [[TMP13]], double [[TMP7]], i32 0
>>>
>>> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> index 80cb197982d48..8dc4a8936b722 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> @@ -58,10 +58,10 @@ define void @test(ptr %r, ptr %p, ptr %q) #0 {
>>> define void @test2(i64* %a, i64* %b) {
>>> ; CHECK-LABEL: @test2(
>>> -; CHECK-NEXT: [[A2:%.*]] = getelementptr inbounds i64, ptr
>>> [[A:%.*]], i64 2
>>> -; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr
>>> [[A]], i32 0
>>> +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr
>>> [[A:%.*]], i32 0
>>> ; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x ptr> [[TMP1]],
>>> ptr [[B:%.*]], i32 1
>>> ; CHECK-NEXT: [[TMP3:%.*]] = getelementptr i64, <2 x ptr>
>>> [[TMP2]], <2 x i64> <i64 1, i64 3>
>>> +; CHECK-NEXT: [[A2:%.*]] = getelementptr inbounds i64, ptr
>>> [[A]], i64 2
>>> ; CHECK-NEXT: [[TMP4:%.*]] = ptrtoint <2 x ptr> [[TMP3]] to <2
>>> x i64>
>>> ; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x ptr> [[TMP3]],
>>> i32 0
>>> ; CHECK-NEXT: [[TMP6:%.*]] = load <2 x i64>, ptr [[TMP5]], align 8
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> index f6dd7526e6e76..35a6c63d29b6c 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias
>>> nocapture %0, float* noalias nocaptu
>>> ; AVX2-NEXT: ret void
>>> ;
>>> ; AVX512F-LABEL: @gather_load_div(
>>> -; AVX512F-NEXT: [[TMP3:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512F-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512F-NEXT: [[TMP4:%.*]] = getelementptr float, <4 x float*>
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512F-NEXT: [[TMP5:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT: [[TMP6:%.*]] = shufflevector <2 x float*>
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512F-NEXT: [[TMP7:%.*]] = getelementptr float, <2 x float*>
>>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512F-NEXT: [[TMP8:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> -; AVX512F-NEXT: [[TMP9:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT: [[TMP10:%.*]] = shufflevector <4 x float*>
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT: [[TMP11:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT: [[TMP12:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT: [[TMP13:%.*]] = shufflevector <8 x float*>
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512F-NEXT: [[TMP14:%.*]] = insertelement <8 x float*>
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512F-NEXT: [[TMP15:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512F-NEXT: [[TMP16:%.*]] = getelementptr float, <8 x
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512F-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]],
>>> [[TMP17]]
>>> +; AVX512F-NEXT: [[TMP3:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512F-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512F-NEXT: [[TMP4:%.*]] = getelementptr float, <8 x float*>
>>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64
>>> 30, i64 27, i64 23>
>>> +; AVX512F-NEXT: [[TMP5:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512F-NEXT: [[TMP6:%.*]] = getelementptr float, <4 x float*>
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512F-NEXT: [[TMP7:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT: [[TMP8:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512F-NEXT: [[TMP9:%.*]] = getelementptr float, <2 x float*>
>>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512F-NEXT: [[TMP10:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> +; AVX512F-NEXT: [[TMP11:%.*]] = shufflevector <4 x float*>
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT: [[TMP12:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT: [[TMP13:%.*]] = shufflevector <2 x float*>
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT: [[TMP14:%.*]] = shufflevector <8 x float*>
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512F-NEXT: [[TMP15:%.*]] = insertelement <8 x float*>
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512F-NEXT: [[TMP16:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]],
>>> [[TMP17]]
>>> ; AVX512F-NEXT: [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to
>>> <8 x float>*
>>> ; AVX512F-NEXT: store <8 x float> [[TMP18]], <8 x float>*
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>> ; AVX512F-NEXT: ret void
>>> ;
>>> ; AVX512VL-LABEL: @gather_load_div(
>>> -; AVX512VL-NEXT: [[TMP3:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512VL-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512VL-NEXT: [[TMP4:%.*]] = getelementptr float, <4 x
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512VL-NEXT: [[TMP5:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT: [[TMP6:%.*]] = shufflevector <2 x float*>
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512VL-NEXT: [[TMP7:%.*]] = getelementptr float, <2 x
>>> float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512VL-NEXT: [[TMP8:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> -; AVX512VL-NEXT: [[TMP9:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT: [[TMP10:%.*]] = shufflevector <4 x float*>
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP11:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP12:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP13:%.*]] = shufflevector <8 x float*>
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP14:%.*]] = insertelement <8 x float*>
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512VL-NEXT: [[TMP15:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512VL-NEXT: [[TMP16:%.*]] = getelementptr float, <8 x
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512VL-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]],
>>> [[TMP17]]
>>> +; AVX512VL-NEXT: [[TMP3:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512VL-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512VL-NEXT: [[TMP4:%.*]] = getelementptr float, <8 x
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64
>>> 33, i64 30, i64 27, i64 23>
>>> +; AVX512VL-NEXT: [[TMP5:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512VL-NEXT: [[TMP6:%.*]] = getelementptr float, <4 x
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512VL-NEXT: [[TMP7:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT: [[TMP8:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512VL-NEXT: [[TMP9:%.*]] = getelementptr float, <2 x
>>> float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512VL-NEXT: [[TMP10:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> +; AVX512VL-NEXT: [[TMP11:%.*]] = shufflevector <4 x float*>
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP12:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP13:%.*]] = shufflevector <2 x float*>
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP14:%.*]] = shufflevector <8 x float*>
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP15:%.*]] = insertelement <8 x float*>
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512VL-NEXT: [[TMP16:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]],
>>> [[TMP17]]
>>> ; AVX512VL-NEXT: [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to
>>> <8 x float>*
>>> ; AVX512VL-NEXT: store <8 x float> [[TMP18]], <8 x float>*
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>> ; AVX512VL-NEXT: ret void
>>>
>>> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> index fd1c612a0696e..47f4391fd3b21 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias
>>> nocapture %0, float* noalias nocaptu
>>> ; AVX2-NEXT: ret void
>>> ;
>>> ; AVX512F-LABEL: @gather_load_div(
>>> -; AVX512F-NEXT: [[TMP3:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512F-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512F-NEXT: [[TMP4:%.*]] = getelementptr float, <4 x float*>
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512F-NEXT: [[TMP5:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT: [[TMP6:%.*]] = shufflevector <2 x float*>
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512F-NEXT: [[TMP7:%.*]] = getelementptr float, <2 x float*>
>>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512F-NEXT: [[TMP8:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> -; AVX512F-NEXT: [[TMP9:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT: [[TMP10:%.*]] = shufflevector <4 x float*>
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT: [[TMP11:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT: [[TMP12:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT: [[TMP13:%.*]] = shufflevector <8 x float*>
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512F-NEXT: [[TMP14:%.*]] = insertelement <8 x float*>
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512F-NEXT: [[TMP15:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512F-NEXT: [[TMP16:%.*]] = getelementptr float, <8 x
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512F-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]],
>>> [[TMP17]]
>>> +; AVX512F-NEXT: [[TMP3:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512F-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512F-NEXT: [[TMP4:%.*]] = getelementptr float, <8 x float*>
>>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64
>>> 30, i64 27, i64 23>
>>> +; AVX512F-NEXT: [[TMP5:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512F-NEXT: [[TMP6:%.*]] = getelementptr float, <4 x float*>
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512F-NEXT: [[TMP7:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT: [[TMP8:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512F-NEXT: [[TMP9:%.*]] = getelementptr float, <2 x float*>
>>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512F-NEXT: [[TMP10:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> +; AVX512F-NEXT: [[TMP11:%.*]] = shufflevector <4 x float*>
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT: [[TMP12:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT: [[TMP13:%.*]] = shufflevector <2 x float*>
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT: [[TMP14:%.*]] = shufflevector <8 x float*>
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512F-NEXT: [[TMP15:%.*]] = insertelement <8 x float*>
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512F-NEXT: [[TMP16:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]],
>>> [[TMP17]]
>>> ; AVX512F-NEXT: [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to
>>> <8 x float>*
>>> ; AVX512F-NEXT: store <8 x float> [[TMP18]], <8 x float>*
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>> ; AVX512F-NEXT: ret void
>>> ;
>>> ; AVX512VL-LABEL: @gather_load_div(
>>> -; AVX512VL-NEXT: [[TMP3:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512VL-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512VL-NEXT: [[TMP4:%.*]] = getelementptr float, <4 x
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512VL-NEXT: [[TMP5:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT: [[TMP6:%.*]] = shufflevector <2 x float*>
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512VL-NEXT: [[TMP7:%.*]] = getelementptr float, <2 x
>>> float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512VL-NEXT: [[TMP8:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> -; AVX512VL-NEXT: [[TMP9:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT: [[TMP10:%.*]] = shufflevector <4 x float*>
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP11:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP12:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP13:%.*]] = shufflevector <8 x float*>
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512VL-NEXT: [[TMP14:%.*]] = insertelement <8 x float*>
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512VL-NEXT: [[TMP15:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512VL-NEXT: [[TMP16:%.*]] = getelementptr float, <8 x
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512VL-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]],
>>> [[TMP17]]
>>> +; AVX512VL-NEXT: [[TMP3:%.*]] = insertelement <8 x float*>
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512VL-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512VL-NEXT: [[TMP4:%.*]] = getelementptr float, <8 x
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64
>>> 33, i64 30, i64 27, i64 23>
>>> +; AVX512VL-NEXT: [[TMP5:%.*]] = insertelement <4 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float*>
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512VL-NEXT: [[TMP6:%.*]] = getelementptr float, <4 x
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512VL-NEXT: [[TMP7:%.*]] = insertelement <2 x float*>
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT: [[TMP8:%.*]] = shufflevector <2 x float*>
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512VL-NEXT: [[TMP9:%.*]] = getelementptr float, <2 x
>>> float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512VL-NEXT: [[TMP10:%.*]] = getelementptr inbounds float,
>>> float* [[TMP1]], i64 20
>>> +; AVX512VL-NEXT: [[TMP11:%.*]] = shufflevector <4 x float*>
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP12:%.*]] = shufflevector <8 x float*>
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9,
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP13:%.*]] = shufflevector <2 x float*>
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef,
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP14:%.*]] = shufflevector <8 x float*>
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2,
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512VL-NEXT: [[TMP15:%.*]] = insertelement <8 x float*>
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512VL-NEXT: [[TMP16:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT: [[TMP17:%.*]] = call <8 x float>
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true,
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT: [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]],
>>> [[TMP17]]
>>> ; AVX512VL-NEXT: [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to
>>> <8 x float>*
>>> ; AVX512VL-NEXT: store <8 x float> [[TMP18]], <8 x float>*
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>> ; AVX512VL-NEXT: ret void
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> index a4a388e9d095c..6946ab292cdf5 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> @@ -21,11 +21,11 @@ define void @foo(%class.e* %this, %struct.a* %p,
>>> i32 %add7) {
>>> ; CHECK-NEXT: i32 2, label [[SW_BB]]
>>> ; CHECK-NEXT: ]
>>> ; CHECK: sw.bb:
>>> -; CHECK-NEXT: [[TMP2:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>>> -; CHECK-NEXT: [[TMP3:%.*]] = load <2 x i32>, <2 x i32>*
>>> [[TMP2]], align 4
>>> ; CHECK-NEXT: [[SHRINK_SHUFFLE:%.*]] = shufflevector <4 x i32>
>>> [[SHUFFLE]], <4 x i32> poison, <2 x i32> <i32 2, i32 0>
>>> -; CHECK-NEXT: [[TMP4:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]],
>>> <i32 -1, i32 -1>
>>> -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP3]], [[TMP4]]
>>> +; CHECK-NEXT: [[TMP2:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]],
>>> <i32 -1, i32 -1>
>>> +; CHECK-NEXT: [[TMP3:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>>> +; CHECK-NEXT: [[TMP4:%.*]] = load <2 x i32>, <2 x i32>*
>>> [[TMP3]], align 4
>>> +; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP4]], [[TMP2]]
>>> ; CHECK-NEXT: br label [[SW_EPILOG]]
>>> ; CHECK: sw.epilog:
>>> ; CHECK-NEXT: [[TMP6:%.*]] = phi <2 x i32> [ undef,
>>> [[ENTRY:%.*]] ], [ [[TMP5]], [[SW_BB]] ]
>>>
>>> diff --git
>>> a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> index 87709a87b3692..109c27e4f4f4e 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> @@ -16,8 +16,8 @@ define void @foo() {
>>> ; CHECK-NEXT: [[TMP3:%.*]] = load double, double* undef, align 8
>>> ; CHECK-NEXT: br i1 undef, label [[BB3]], label [[BB4:%.*]]
>>> ; CHECK: bb4:
>>> -; CHECK-NEXT: [[CONV2:%.*]] = uitofp i16 undef to double
>>> ; CHECK-NEXT: [[TMP4:%.*]] = fpext <4 x float> [[TMP2]] to <4 x
>>> double>
>>> +; CHECK-NEXT: [[CONV2:%.*]] = uitofp i16 undef to double
>>> ; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> <double
>>> undef, double poison>, double [[TMP3]], i32 1
>>> ; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> <double
>>> undef, double poison>, double [[CONV2]], i32 1
>>> ; CHECK-NEXT: [[TMP7:%.*]] = fsub <2 x double> [[TMP5]], [[TMP6]]
>>>
>>> diff --git a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> index 33ba97921e878..da18a937a6477 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> @@ -133,27 +133,27 @@ define void @phi_float32(half %hval, float
>>> %fval) {
>>> ; MAX256-NEXT: br label [[BB1:%.*]]
>>> ; MAX256: bb1:
>>> ; MAX256-NEXT: [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>>> -; MAX256-NEXT: [[I3:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX256-NEXT: [[I6:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX256-NEXT: [[I9:%.*]] = fpext half [[HVAL]] to float
>>> ; MAX256-NEXT: [[TMP0:%.*]] = insertelement <8 x float> poison,
>>> float [[I]], i32 0
>>> ; MAX256-NEXT: [[SHUFFLE11:%.*]] = shufflevector <8 x float>
>>> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>> ; MAX256-NEXT: [[TMP1:%.*]] = insertelement <8 x float> poison,
>>> float [[FVAL:%.*]], i32 0
>>> ; MAX256-NEXT: [[SHUFFLE12:%.*]] = shufflevector <8 x float>
>>> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>> ; MAX256-NEXT: [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]],
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT: [[TMP3:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP2]]
>>> -; MAX256-NEXT: [[TMP4:%.*]] = insertelement <8 x float> poison,
>>> float [[I3]], i32 0
>>> -; MAX256-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float>
>>> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX256-NEXT: [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]],
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT: [[TMP6:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP5]]
>>> -; MAX256-NEXT: [[TMP7:%.*]] = insertelement <8 x float> poison,
>>> float [[I6]], i32 0
>>> -; MAX256-NEXT: [[SHUFFLE5:%.*]] = shufflevector <8 x float>
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX256-NEXT: [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]],
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT: [[TMP9:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP8]]
>>> -; MAX256-NEXT: [[TMP10:%.*]] = insertelement <8 x float> poison,
>>> float [[I9]], i32 0
>>> -; MAX256-NEXT: [[SHUFFLE8:%.*]] = shufflevector <8 x float>
>>> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX256-NEXT: [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]],
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT: [[TMP12:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP11]]
>>> +; MAX256-NEXT: [[I3:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX256-NEXT: [[TMP3:%.*]] = insertelement <8 x float> poison,
>>> float [[I3]], i32 0
>>> +; MAX256-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float>
>>> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX256-NEXT: [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]],
>>> [[SHUFFLE12]]
>>> +; MAX256-NEXT: [[I6:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX256-NEXT: [[TMP5:%.*]] = insertelement <8 x float> poison,
>>> float [[I6]], i32 0
>>> +; MAX256-NEXT: [[SHUFFLE5:%.*]] = shufflevector <8 x float>
>>> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX256-NEXT: [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]],
>>> [[SHUFFLE12]]
>>> +; MAX256-NEXT: [[I9:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX256-NEXT: [[TMP7:%.*]] = insertelement <8 x float> poison,
>>> float [[I9]], i32 0
>>> +; MAX256-NEXT: [[SHUFFLE8:%.*]] = shufflevector <8 x float>
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX256-NEXT: [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]],
>>> [[SHUFFLE12]]
>>> +; MAX256-NEXT: [[TMP9:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP2]]
>>> +; MAX256-NEXT: [[TMP10:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP4]]
>>> +; MAX256-NEXT: [[TMP11:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP6]]
>>> +; MAX256-NEXT: [[TMP12:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP8]]
>>> ; MAX256-NEXT: switch i32 undef, label [[BB5:%.*]] [
>>> ; MAX256-NEXT: i32 0, label [[BB2:%.*]]
>>> ; MAX256-NEXT: i32 1, label [[BB3:%.*]]
>>> @@ -166,10 +166,10 @@ define void @phi_float32(half %hval, float
>>> %fval) {
>>> ; MAX256: bb5:
>>> ; MAX256-NEXT: br label [[BB2]]
>>> ; MAX256: bb2:
>>> -; MAX256-NEXT: [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> -; MAX256-NEXT: [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [
>>> [[TMP9]], [[BB1]] ]
>>> +; MAX256-NEXT: [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX256-NEXT: [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [
>>> [[TMP11]], [[BB1]] ]
>>> ; MAX256-NEXT: [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]],
>>> [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [
>>> [[TMP12]], [[BB1]] ]
>>> -; MAX256-NEXT: [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]],
>>> [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX256-NEXT: [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]],
>>> [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> ; MAX256-NEXT: [[TMP17:%.*]] = extractelement <8 x float>
>>> [[TMP14]], i32 7
>>> ; MAX256-NEXT: store float [[TMP17]], float* undef, align 4
>>> ; MAX256-NEXT: ret void
>>> @@ -179,27 +179,27 @@ define void @phi_float32(half %hval, float
>>> %fval) {
>>> ; MAX1024-NEXT: br label [[BB1:%.*]]
>>> ; MAX1024: bb1:
>>> ; MAX1024-NEXT: [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>>> -; MAX1024-NEXT: [[I3:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX1024-NEXT: [[I6:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX1024-NEXT: [[I9:%.*]] = fpext half [[HVAL]] to float
>>> ; MAX1024-NEXT: [[TMP0:%.*]] = insertelement <8 x float>
>>> poison, float [[I]], i32 0
>>> ; MAX1024-NEXT: [[SHUFFLE11:%.*]] = shufflevector <8 x float>
>>> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>> ; MAX1024-NEXT: [[TMP1:%.*]] = insertelement <8 x float>
>>> poison, float [[FVAL:%.*]], i32 0
>>> ; MAX1024-NEXT: [[SHUFFLE12:%.*]] = shufflevector <8 x float>
>>> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>> ; MAX1024-NEXT: [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]],
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT: [[TMP3:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP2]]
>>> -; MAX1024-NEXT: [[TMP4:%.*]] = insertelement <8 x float> poison,
>>> float [[I3]], i32 0
>>> -; MAX1024-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float>
>>> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX1024-NEXT: [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]],
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT: [[TMP6:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP5]]
>>> -; MAX1024-NEXT: [[TMP7:%.*]] = insertelement <8 x float> poison,
>>> float [[I6]], i32 0
>>> -; MAX1024-NEXT: [[SHUFFLE5:%.*]] = shufflevector <8 x float>
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX1024-NEXT: [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]],
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT: [[TMP9:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP8]]
>>> -; MAX1024-NEXT: [[TMP10:%.*]] = insertelement <8 x float>
>>> poison, float [[I9]], i32 0
>>> -; MAX1024-NEXT: [[SHUFFLE8:%.*]] = shufflevector <8 x float>
>>> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX1024-NEXT: [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]],
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT: [[TMP12:%.*]] = fadd <8 x float>
>>> zeroinitializer, [[TMP11]]
>>> +; MAX1024-NEXT: [[I3:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX1024-NEXT: [[TMP3:%.*]] = insertelement <8 x float> poison,
>>> float [[I3]], i32 0
>>> +; MAX1024-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x float>
>>> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX1024-NEXT: [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]],
>>> [[SHUFFLE12]]
>>> +; MAX1024-NEXT: [[I6:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX1024-NEXT: [[TMP5:%.*]] = insertelement <8 x float> poison,
>>> float [[I6]], i32 0
>>> +; MAX1024-NEXT: [[SHUFFLE5:%.*]] = shufflevector <8 x float>
>>> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX1024-NEXT: [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]],
>>> [[SHUFFLE12]]
>>> +; MAX1024-NEXT: [[I9:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX1024-NEXT: [[TMP7:%.*]] = insertelement <8 x float> poison,
>>> float [[I9]], i32 0
>>> +; MAX1024-NEXT: [[SHUFFLE8:%.*]] = shufflevector <8 x float>
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX1024-NEXT: [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]],
>>> [[SHUFFLE12]]
>>> +; MAX1024-NEXT: [[TMP9:%.*]] = fadd <8 x float> zeroinitializer,
>>> [[TMP2]]
>>> +; MAX1024-NEXT: [[TMP10:%.*]] = fadd <8 x float>
>>> zeroinitializer, [[TMP4]]
>>> +; MAX1024-NEXT: [[TMP11:%.*]] = fadd <8 x float>
>>> zeroinitializer, [[TMP6]]
>>> +; MAX1024-NEXT: [[TMP12:%.*]] = fadd <8 x float>
>>> zeroinitializer, [[TMP8]]
>>> ; MAX1024-NEXT: switch i32 undef, label [[BB5:%.*]] [
>>> ; MAX1024-NEXT: i32 0, label [[BB2:%.*]]
>>> ; MAX1024-NEXT: i32 1, label [[BB3:%.*]]
>>> @@ -212,10 +212,10 @@ define void @phi_float32(half %hval, float
>>> %fval) {
>>> ; MAX1024: bb5:
>>> ; MAX1024-NEXT: br label [[BB2]]
>>> ; MAX1024: bb2:
>>> -; MAX1024-NEXT: [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> -; MAX1024-NEXT: [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [
>>> [[TMP9]], [[BB1]] ]
>>> +; MAX1024-NEXT: [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX1024-NEXT: [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]],
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [
>>> [[TMP11]], [[BB1]] ]
>>> ; MAX1024-NEXT: [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]],
>>> [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [
>>> [[TMP12]], [[BB1]] ]
>>> -; MAX1024-NEXT: [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]],
>>> [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX1024-NEXT: [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]],
>>> [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [
>>> [[SHUFFLE12]], [[BB1]] ]
>>> ; MAX1024-NEXT: [[TMP17:%.*]] = extractelement <8 x float>
>>> [[TMP14]], i32 7
>>> ; MAX1024-NEXT: store float [[TMP17]], float* undef, align 4
>>> ; MAX1024-NEXT: ret void
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
More information about the llvm-commits
mailing list