[llvm] d65cc85 - [SLP]Do not schedule instructions with constants/argument/phi operands and external users.

Fri Mar 18 13:27:57 PDT 2022

I added a comment to the existing code in 1093949cf which more fully 
explains the missing dependency and hidden assumption.

I am not 100% sure your code has the same problem.  I'd suggest 
exploring combinations such as a potentially faulting udiv following a 
readnone infinite loop call with block-invariant operands.  I don't have 
a particular test case for you because massaging the code into actually 
reordering is quite involved. I tried, but did not manage to create one 
with a few minutes of trying.

Philip

On 3/18/22 10:26, Philip Reames via llvm-commits wrote:
> FYI, I'm pretty sure this patch is wrong.  The case which I believe it 
> gets wrong involves a bundle containing a readonly call which is not 
> guaranteed to return.  (i.e. may contain an infinite loop)  If I'm 
> reading the code correctly, it may reorder such a call earlier in the 
> basic block - including reordering of two such calls in the process.
>
> This is the same bug which existed in D118538 which is why I noticed it.
>
> If this case isn't possible for some reason, please add test coverage 
> and clarify comments as to why.
>
> Philip
>
> On 3/17/22 11:04, Alexey Bataev via llvm-commits wrote:
>> Author: Alexey Bataev
>> Date: 2022-03-17T11:03:45-07:00
>> New Revision: d65cc8597792ab04142cd2214c46c5c167191bcd
>>
>> URL: 
>> https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd
>> DIFF: 
>> https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd.diff
>>
>> LOG: [SLP]Do not schedule instructions with constants/argument/phi 
>> operands and external users.
>>
>> No need to schedule entry nodes where all instructions are not memory
>> read/write instructions and their operands are either constants, or
>> arguments, or phis, or instructions from others blocks, or their users
>> are phis or from the other blocks.
>> The resulting vector instructions can be placed at
>> the beginning of the basic block without scheduling (if operands does
>> not need to be scheduled) or at the end of the block (if users are
>> outside of the block).
>> It may save some compile time and scheduling resources.
>>
>> Differential Revision: https://reviews.llvm.org/D121121
>>
>> Added:
>>
>> Modified:
>>      llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>      llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>      llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>      llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>      llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>      llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>> llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>      llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>> llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>      llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>> llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>> llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>      llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>
>> Removed:
>>
>>
>> ################################################################################ 
>>
>> diff  --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp 
>> b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>> index 48382a12fcf3c..9ab31198adaab 100644
>> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>> @@ -776,6 +776,57 @@ static void reorderScalars(SmallVectorImpl<Value 
>> *> &Scalars,
>>         Scalars[Mask[I]] = Prev[I];
>>   }
>>   +/// Checks if the provided value does not require scheduling. It 
>> does not
>> +/// require scheduling if this is not an instruction or it is an 
>> instruction
>> +/// that does not read/write memory and all operands are either not 
>> instructions
>> +/// or phi nodes or instructions from
>> diff erent blocks.
>> +static bool areAllOperandsNonInsts(Value *V) {
>> +  auto *I = dyn_cast<Instruction>(V);
>> +  if (!I)
>> +    return true;
>> +  return !I->mayReadOrWriteMemory() && all_of(I->operands(), 
>> [I](Value *V) {
>> +    auto *IO = dyn_cast<Instruction>(V);
>> +    if (!IO)
>> +      return true;
>> +    return isa<PHINode>(IO) || IO->getParent() != I->getParent();
>> +  });
>> +}
>> +
>> +/// Checks if the provided value does not require scheduling. It 
>> does not
>> +/// require scheduling if this is not an instruction or it is an 
>> instruction
>> +/// that does not read/write memory and all users are phi nodes or 
>> instructions
>> +/// from the
>> diff erent blocks.
>> +static bool isUsedOutsideBlock(Value *V) {
>> +  auto *I = dyn_cast<Instruction>(V);
>> +  if (!I)
>> +    return true;
>> +  // Limits the number of uses to save compile time.
>> +  constexpr int UsesLimit = 8;
>> +  return !I->mayReadOrWriteMemory() && !I->hasNUsesOrMore(UsesLimit) &&
>> +         all_of(I->users(), [I](User *U) {
>> +           auto *IU = dyn_cast<Instruction>(U);
>> +           if (!IU)
>> +             return true;
>> +           return IU->getParent() != I->getParent() || 
>> isa<PHINode>(IU);
>> +         });
>> +}
>> +
>> +/// Checks if the specified value does not require scheduling. It 
>> does not
>> +/// require scheduling if all operands and all users do not need to 
>> be scheduled
>> +/// in the current basic block.
>> +static bool doesNotNeedToBeScheduled(Value *V) {
>> +  return areAllOperandsNonInsts(V) && isUsedOutsideBlock(V);
>> +}
>> +
>> +/// Checks if the specified array of instructions does not require 
>> scheduling.
>> +/// It is so if all either instructions have operands that do not 
>> require
>> +/// scheduling or their users do not require scheduling since they 
>> are phis or
>> +/// in other basic blocks.
>> +static bool doesNotNeedToSchedule(ArrayRef<Value *> VL) {
>> +  return !VL.empty() &&
>> +         (all_of(VL, isUsedOutsideBlock) || all_of(VL, 
>> areAllOperandsNonInsts));
>> +}
>> +
>>   namespace slpvectorizer {
>>     /// Bottom Up SLP Vectorizer.
>> @@ -2359,15 +2410,21 @@ class BoUpSLP {
>>           ScalarToTreeEntry[V] = Last;
>>         }
>>         // Update the scheduler bundle to point to this TreeEntry.
>> -      unsigned Lane = 0;
>> -      for (ScheduleData *BundleMember = Bundle.getValue(); 
>> BundleMember;
>> -           BundleMember = BundleMember->NextInBundle) {
>> -        BundleMember->TE = Last;
>> -        BundleMember->Lane = Lane;
>> -        ++Lane;
>> -      }
>> -      assert((!Bundle.getValue() || Lane == VL.size()) &&
>> +      ScheduleData *BundleMember = Bundle.getValue();
>> +      assert((BundleMember || isa<PHINode>(S.MainOp) ||
>> +              isVectorLikeInstWithConstOps(S.MainOp) ||
>> +              doesNotNeedToSchedule(VL)) &&
>>                "Bundle and VL out of sync");
>> +      if (BundleMember) {
>> +        for (Value *V : VL) {
>> +          if (doesNotNeedToBeScheduled(V))
>> +            continue;
>> +          assert(BundleMember && "Unexpected end of bundle.");
>> +          BundleMember->TE = Last;
>> +          BundleMember = BundleMember->NextInBundle;
>> +        }
>> +      }
>> +      assert(!BundleMember && "Bundle and VL out of sync");
>>       } else {
>>         MustGather.insert(VL.begin(), VL.end());
>>       }
>> @@ -2504,7 +2561,6 @@ class BoUpSLP {
>>         clearDependencies();
>>         OpValue = OpVal;
>>         TE = nullptr;
>> -      Lane = -1;
>>       }
>>         /// Verify basic self consistency properties
>> @@ -2544,7 +2600,7 @@ class BoUpSLP {
>>       /// Returns true if it represents an instruction bundle and not 
>> only a
>>       /// single instruction.
>>       bool isPartOfBundle() const {
>> -      return NextInBundle != nullptr || FirstInBundle != this;
>> +      return NextInBundle != nullptr || FirstInBundle != this || TE;
>>       }
>>         /// Returns true if it is ready for scheduling, i.e. it has 
>> no more
>> @@ -2649,9 +2705,6 @@ class BoUpSLP {
>>       /// Note that this is negative as long as Dependencies is not 
>> calculated.
>>       int UnscheduledDeps = InvalidDeps;
>>   -    /// The lane of this node in the TreeEntry.
>> -    int Lane = -1;
>> -
>>       /// True if this instruction is scheduled (or considered as 
>> scheduled in the
>>       /// dry-run).
>>       bool IsScheduled = false;
>> @@ -2669,6 +2722,21 @@ class BoUpSLP {
>>     friend struct DOTGraphTraits<BoUpSLP *>;
>>       /// Contains all scheduling data for a basic block.
>> +  /// It does not schedules instructions, which are not memory 
>> read/write
>> +  /// instructions and their operands are either constants, or 
>> arguments, or
>> +  /// phis, or instructions from others blocks, or their users are 
>> phis or from
>> +  /// the other blocks. The resulting vector instructions can be 
>> placed at the
>> +  /// beginning of the basic block without scheduling (if operands 
>> does not need
>> +  /// to be scheduled) or at the end of the block (if users are 
>> outside of the
>> +  /// block). It allows to save some compile time and memory used by 
>> the
>> +  /// compiler.
>> +  /// ScheduleData is assigned for each instruction in between the 
>> boundaries of
>> +  /// the tree entry, even for those, which are not part of the 
>> graph. It is
>> +  /// required to correctly follow the dependencies between the 
>> instructions and
>> +  /// their correct scheduling. The ScheduleData is not allocated 
>> for the
>> +  /// instructions, which do not require scheduling, like phis, 
>> nodes with
>> +  /// extractelements/insertelements only or nodes with 
>> instructions, with
>> +  /// uses/operands outside of the block.
>>     struct BlockScheduling {
>>       BlockScheduling(BasicBlock *BB)
>>           : BB(BB), ChunkSize(BB->size()), ChunkPos(ChunkSize) {}
>> @@ -2696,7 +2764,7 @@ class BoUpSLP {
>>         if (BB != I->getParent())
>>           // Avoid lookup if can't possibly be in map.
>>           return nullptr;
>> -      ScheduleData *SD = ScheduleDataMap[I];
>> +      ScheduleData *SD = ScheduleDataMap.lookup(I);
>>         if (SD && isInSchedulingRegion(SD))
>>           return SD;
>>         return nullptr;
>> @@ -2713,7 +2781,7 @@ class BoUpSLP {
>>           return getScheduleData(V);
>>         auto I = ExtraScheduleDataMap.find(V);
>>         if (I != ExtraScheduleDataMap.end()) {
>> -        ScheduleData *SD = I->second[Key];
>> +        ScheduleData *SD = I->second.lookup(Key);
>>           if (SD && isInSchedulingRegion(SD))
>>             return SD;
>>         }
>> @@ -2735,7 +2803,7 @@ class BoUpSLP {
>>              BundleMember = BundleMember->NextInBundle) {
>>           if (BundleMember->Inst != BundleMember->OpValue)
>>             continue;
>> -
>> +
>>           // Handle the def-use chain dependencies.
>>             // Decrement the unscheduled counter and insert to ready 
>> list if ready.
>> @@ -2760,7 +2828,9 @@ class BoUpSLP {
>>           // reordered during buildTree(). We therefore need to get 
>> its operands
>>           // through the TreeEntry.
>>           if (TreeEntry *TE = BundleMember->TE) {
>> -          int Lane = BundleMember->Lane;
>> +          // Need to search for the lane since the tree entry can be 
>> reordered.
>> +          int Lane = std::distance(TE->Scalars.begin(),
>> +                                   find(TE->Scalars, 
>> BundleMember->Inst));
>>             assert(Lane >= 0 && "Lane not set");
>>               // Since vectorization tree is being built recursively 
>> this assertion
>> @@ -2769,7 +2839,7 @@ class BoUpSLP {
>>             // where their second (immediate) operand is not added. 
>> Since
>>             // immediates do not affect scheduler behavior this is 
>> considered
>>             // okay.
>> -          auto *In = TE->getMainOp();
>> +          auto *In = BundleMember->Inst;
>>             assert(In &&
>>                    (isa<ExtractValueInst>(In) || 
>> isa<ExtractElementInst>(In) ||
>>                     In->getNumOperands() == TE->getNumOperands()) &&
>> @@ -2814,7 +2884,8 @@ class BoUpSLP {
>>           for (auto *I = ScheduleStart; I != ScheduleEnd; I = 
>> I->getNextNode()) {
>>           auto *SD = getScheduleData(I);
>> -        assert(SD && "primary scheduledata must exist in window");
>> +        if (!SD)
>> +          continue;
>>           assert(isInSchedulingRegion(SD) &&
>>                  "primary schedule data not in window?");
>>           assert(isInSchedulingRegion(SD->FirstInBundle) &&
>> @@ -3856,6 +3927,22 @@ static LoadsState 
>> canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
>>     return LoadsState::Gather;
>>   }
>>   +/// \return true if the specified list of values has only one 
>> instruction that
>> +/// requires scheduling, false otherwise.
>> +static bool needToScheduleSingleInstruction(ArrayRef<Value *> VL) {
>> +  Value *NeedsScheduling = nullptr;
>> +  for (Value *V : VL) {
>> +    if (doesNotNeedToBeScheduled(V))
>> +      continue;
>> +    if (!NeedsScheduling) {
>> +      NeedsScheduling = V;
>> +      continue;
>> +    }
>> +    return false;
>> +  }
>> +  return NeedsScheduling;
>> +}
>> +
>>   void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
>>                               const EdgeInfo &UserTreeIdx) {
>>     assert((allConstant(VL) || allSameType(VL)) && "Invalid types!");
>> @@ -6396,6 +6483,44 @@ void BoUpSLP::setInsertPointAfterBundle(const 
>> TreeEntry *E) {
>>       return !E->isOpcodeOrAlt(I) || I->getParent() == BB;
>>     }));
>>   +  auto &&FindLastInst = [E, Front]() {
>> +    Instruction *LastInst = Front;
>> +    for (Value *V : E->Scalars) {
>> +      auto *I = dyn_cast<Instruction>(V);
>> +      if (!I)
>> +        continue;
>> +      if (LastInst->comesBefore(I))
>> +        LastInst = I;
>> +    }
>> +    return LastInst;
>> +  };
>> +
>> +  auto &&FindFirstInst = [E, Front]() {
>> +    Instruction *FirstInst = Front;
>> +    for (Value *V : E->Scalars) {
>> +      auto *I = dyn_cast<Instruction>(V);
>> +      if (!I)
>> +        continue;
>> +      if (I->comesBefore(FirstInst))
>> +        FirstInst = I;
>> +    }
>> +    return FirstInst;
>> +  };
>> +
>> +  // Set the insert point to the beginning of the basic block if the 
>> entry
>> +  // should not be scheduled.
>> +  if (E->State != TreeEntry::NeedToGather &&
>> +      doesNotNeedToSchedule(E->Scalars)) {
>> +    BasicBlock::iterator InsertPt;
>> +    if (all_of(E->Scalars, isUsedOutsideBlock))
>> +      InsertPt = FindLastInst()->getIterator();
>> +    else
>> +      InsertPt = FindFirstInst()->getIterator();
>> +    Builder.SetInsertPoint(BB, InsertPt);
>> +    Builder.SetCurrentDebugLocation(Front->getDebugLoc());
>> +    return;
>> +  }
>> +
>>     // The last instruction in the bundle in program order.
>>     Instruction *LastInst = nullptr;
>>   @@ -6404,8 +6529,10 @@ void 
>> BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
>>     // VL.back() and iterate over schedule data until we reach the 
>> end of the
>>     // bundle. The end of the bundle is marked by null ScheduleData.
>>     if (BlocksSchedules.count(BB)) {
>> -    auto *Bundle =
>> - BlocksSchedules[BB]->getScheduleData(E->isOneOf(E->Scalars.back()));
>> +    Value *V = E->isOneOf(E->Scalars.back());
>> +    if (doesNotNeedToBeScheduled(V))
>> +      V = *find_if_not(E->Scalars, doesNotNeedToBeScheduled);
>> +    auto *Bundle = BlocksSchedules[BB]->getScheduleData(V);
>>       if (Bundle && Bundle->isPartOfBundle())
>>         for (; Bundle; Bundle = Bundle->NextInBundle)
>>           if (Bundle->OpValue == Bundle->Inst)
>> @@ -6430,15 +6557,8 @@ void BoUpSLP::setInsertPointAfterBundle(const 
>> TreeEntry *E) {
>>     // not ideal. However, this should be exceedingly rare since it 
>> requires that
>>     // we both exit early from buildTree_rec and that the bundle be 
>> out-of-order
>>     // (causing us to iterate all the way to the end of the block).
>> -  if (!LastInst) {
>> -    SmallPtrSet<Value *, 16> Bundle(E->Scalars.begin(), 
>> E->Scalars.end());
>> -    for (auto &I : make_range(BasicBlock::iterator(Front), 
>> BB->end())) {
>> -      if (Bundle.erase(&I) && E->isOpcodeOrAlt(&I))
>> -        LastInst = &I;
>> -      if (Bundle.empty())
>> -        break;
>> -    }
>> -  }
>> +  if (!LastInst)
>> +    LastInst = FindLastInst();
>>     assert(LastInst && "Failed to find last instruction in bundle");
>>       // Set the insertion point after the last instruction in the 
>> bundle. Set the
>> @@ -7631,9 +7751,11 @@ void BoUpSLP::optimizeGatherSequence() {
>>     BoUpSLP::ScheduleData *
>>   BoUpSLP::BlockScheduling::buildBundle(ArrayRef<Value *> VL) {
>> -  ScheduleData *Bundle = nullptr;
>> +  ScheduleData *Bundle = nullptr;
>>     ScheduleData *PrevInBundle = nullptr;
>>     for (Value *V : VL) {
>> +    if (doesNotNeedToBeScheduled(V))
>> +      continue;
>>       ScheduleData *BundleMember = getScheduleData(V);
>>       assert(BundleMember &&
>>              "no ScheduleData for bundle member "
>> @@ -7661,7 +7783,8 @@ 
>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>> BoUpSLP *SLP,
>>                                               const InstructionsState 
>> &S) {
>>     // No need to schedule PHIs, insertelement, extractelement and 
>> extractvalue
>>     // instructions.
>> -  if (isa<PHINode>(S.OpValue) || 
>> isVectorLikeInstWithConstOps(S.OpValue))
>> +  if (isa<PHINode>(S.OpValue) || 
>> isVectorLikeInstWithConstOps(S.OpValue) ||
>> +      doesNotNeedToSchedule(VL))
>>       return nullptr;
>>       // Initialize the instruction bundle.
>> @@ -7707,6 +7830,8 @@ 
>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>> BoUpSLP *SLP,
>>     // Make sure that the scheduling region contains all
>>     // instructions of the bundle.
>>     for (Value *V : VL) {
>> +    if (doesNotNeedToBeScheduled(V))
>> +      continue;
>>       if (!extendSchedulingRegion(V, S)) {
>>         // If the scheduling region got new instructions at the lower 
>> end (or it
>>         // is a new region for the first bundle). This makes it 
>> necessary to
>> @@ -7721,6 +7846,8 @@ 
>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>> BoUpSLP *SLP,
>>       bool ReSchedule = false;
>>     for (Value *V : VL) {
>> +    if (doesNotNeedToBeScheduled(V))
>> +      continue;
>>       ScheduleData *BundleMember = getScheduleData(V);
>>       assert(BundleMember &&
>>              "no ScheduleData for bundle member (maybe not in same 
>> basic block)");
>> @@ -7750,14 +7877,18 @@ 
>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>> BoUpSLP *SLP,
>>     void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> 
>> VL,
>>                                                   Value *OpValue) {
>> -  if (isa<PHINode>(OpValue) || isVectorLikeInstWithConstOps(OpValue))
>> +  if (isa<PHINode>(OpValue) || isVectorLikeInstWithConstOps(OpValue) ||
>> +      doesNotNeedToSchedule(VL))
>>       return;
>>   +  if (doesNotNeedToBeScheduled(OpValue))
>> +    OpValue = *find_if_not(VL, doesNotNeedToBeScheduled);
>>     ScheduleData *Bundle = getScheduleData(OpValue);
>>     LLVM_DEBUG(dbgs() << "SLP:  cancel scheduling of " << *Bundle << 
>> "\n");
>>     assert(!Bundle->IsScheduled &&
>>            "Can't cancel bundle which is already scheduled");
>> -  assert(Bundle->isSchedulingEntity() && Bundle->isPartOfBundle() &&
>> +  assert(Bundle->isSchedulingEntity() &&
>> +         (Bundle->isPartOfBundle() || 
>> needToScheduleSingleInstruction(VL)) &&
>>            "tried to unbundle something which is not a bundle");
>>       // Remove the bundle from the ready list.
>> @@ -7771,6 +7902,7 @@ void 
>> BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,
>>       BundleMember->FirstInBundle = BundleMember;
>>       ScheduleData *Next = BundleMember->NextInBundle;
>>       BundleMember->NextInBundle = nullptr;
>> +    BundleMember->TE = nullptr;
>>       if (BundleMember->unscheduledDepsInBundle() == 0) {
>>         ReadyInsts.insert(BundleMember);
>>       }
>> @@ -7794,6 +7926,7 @@ bool 
>> BoUpSLP::BlockScheduling::extendSchedulingRegion(Value *V,
>>     Instruction *I = dyn_cast<Instruction>(V);
>>     assert(I && "bundle member must be an instruction");
>>     assert(!isa<PHINode>(I) && !isVectorLikeInstWithConstOps(I) &&
>> +         !doesNotNeedToBeScheduled(I) &&
>>            "phi nodes/insertelements/extractelements/extractvalues 
>> don't need to "
>>            "be scheduled");
>>     auto &&CheckScheduleForI = [this, &S](Instruction *I) -> bool {
>> @@ -7870,7 +8003,10 @@ void 
>> BoUpSLP::BlockScheduling::initScheduleData(Instruction *FromI,
>>                                                   ScheduleData 
>> *NextLoadStore) {
>>     ScheduleData *CurrentLoadStore = PrevLoadStore;
>>     for (Instruction *I = FromI; I != ToI; I = I->getNextNode()) {
>> -    ScheduleData *SD = ScheduleDataMap[I];
>> +    // No need to allocate data for non-schedulable instructions.
>> +    if (doesNotNeedToBeScheduled(I))
>> +      continue;
>> +    ScheduleData *SD = ScheduleDataMap.lookup(I);
>>       if (!SD) {
>>         SD = allocateScheduleDataChunks();
>>         ScheduleDataMap[I] = SD;
>> @@ -8054,8 +8190,10 @@ void BoUpSLP::scheduleBlock(BlockScheduling 
>> *BS) {
>>     for (auto *I = BS->ScheduleStart; I != BS->ScheduleEnd;
>>          I = I->getNextNode()) {
>>       BS->doForAllOpcodes(I, [this, &Idx, &NumToSchedule, 
>> BS](ScheduleData *SD) {
>> +      TreeEntry *SDTE = getTreeEntry(SD->Inst);
>>         assert((isVectorLikeInstWithConstOps(SD->Inst) ||
>> -              SD->isPartOfBundle() == (getTreeEntry(SD->Inst) != 
>> nullptr)) &&
>> +              SD->isPartOfBundle() ==
>> +                  (SDTE && !doesNotNeedToSchedule(SDTE->Scalars))) &&
>>                "scheduler and vectorizer bundle mismatch");
>>         SD->FirstInBundle->SchedulingPriority = Idx++;
>>         if (SD->isSchedulingEntity()) {
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll 
>> b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>> index 536f72a73684e..ec7b03af83f8b 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>> @@ -36,6 +36,7 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; GENERIC-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; GENERIC-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; GENERIC-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>> [[FOR_BODY_PREHEADER]] ]
>> +; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>> i16* [[A_ADDR_0101]], i64 8
>>   ; GENERIC-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to 
>> <8 x i16>*
>>   ; GENERIC-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* 
>> [[TMP0]], align 2
>>   ; GENERIC-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>> @@ -85,7 +86,6 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; GENERIC-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], 
>> align 2
>>   ; GENERIC-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>   ; GENERIC-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>> -; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>> i16* [[A_ADDR_0101]], i64 8
>>   ; GENERIC-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> 
>> [[TMP6]], i64 7
>>   ; GENERIC-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>   ; GENERIC-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, 
>> i16* [[G]], i64 [[TMP29]]
>> @@ -111,6 +111,7 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; KRYO-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; KRYO-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; KRYO-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>> [[FOR_BODY_PREHEADER]] ]
>> +; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>> [[A_ADDR_0101]], i64 8
>>   ; KRYO-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 x 
>> i16>*
>>   ; KRYO-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* [[TMP0]], 
>> align 2
>>   ; KRYO-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>> @@ -160,7 +161,6 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; KRYO-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], align 2
>>   ; KRYO-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>   ; KRYO-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>> -; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>> [[A_ADDR_0101]], i64 8
>>   ; KRYO-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], 
>> i64 7
>>   ; KRYO-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>   ; KRYO-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, 
>> i16* [[G]], i64 [[TMP29]]
>> @@ -297,6 +297,7 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; GENERIC-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; GENERIC-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; GENERIC-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>> [[FOR_BODY_PREHEADER]] ]
>> +; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>> i16* [[A_ADDR_0101]], i64 8
>>   ; GENERIC-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to 
>> <8 x i16>*
>>   ; GENERIC-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* 
>> [[TMP0]], align 2
>>   ; GENERIC-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>> @@ -346,7 +347,6 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; GENERIC-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], 
>> align 2
>>   ; GENERIC-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>   ; GENERIC-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>> -; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>> i16* [[A_ADDR_0101]], i64 8
>>   ; GENERIC-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> 
>> [[TMP6]], i64 7
>>   ; GENERIC-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>   ; GENERIC-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, 
>> i16* [[G]], i64 [[TMP29]]
>> @@ -372,6 +372,7 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; KRYO-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; KRYO-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>   ; KRYO-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>> [[FOR_BODY_PREHEADER]] ]
>> +; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>> [[A_ADDR_0101]], i64 8
>>   ; KRYO-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 x 
>> i16>*
>>   ; KRYO-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* [[TMP0]], 
>> align 2
>>   ; KRYO-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>> @@ -421,7 +422,6 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture 
>> readonly %a, i16* nocapture re
>>   ; KRYO-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], align 2
>>   ; KRYO-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>   ; KRYO-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>> -; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>> [[A_ADDR_0101]], i64 8
>>   ; KRYO-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], 
>> i64 7
>>   ; KRYO-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>   ; KRYO-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, 
>> i16* [[G]], i64 [[TMP29]]
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll 
>> b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>> index e9c502b6982cd..01d743fcbfe97 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>> @@ -35,41 +35,14 @@ define void @PR28330(i32 %n) {
>>   ;
>>   ; MAX-COST-LABEL: @PR28330(
>>   ; MAX-COST-NEXT:  entry:
>> -; MAX-COST-NEXT:    [[P0:%.*]] = load i8, i8* getelementptr inbounds 
>> ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
>> -; MAX-COST-NEXT:    [[P1:%.*]] = icmp eq i8 [[P0]], 0
>> -; MAX-COST-NEXT:    [[P2:%.*]] = load i8, i8* getelementptr inbounds 
>> ([80 x i8], [80 x i8]* @a, i64 0, i64 2), align 2
>> -; MAX-COST-NEXT:    [[P3:%.*]] = icmp eq i8 [[P2]], 0
>> -; MAX-COST-NEXT:    [[P4:%.*]] = load i8, i8* getelementptr inbounds 
>> ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
>> -; MAX-COST-NEXT:    [[P5:%.*]] = icmp eq i8 [[P4]], 0
>> -; MAX-COST-NEXT:    [[P6:%.*]] = load i8, i8* getelementptr inbounds 
>> ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
>> -; MAX-COST-NEXT:    [[P7:%.*]] = icmp eq i8 [[P6]], 0
>> -; MAX-COST-NEXT:    [[P8:%.*]] = load i8, i8* getelementptr inbounds 
>> ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>> -; MAX-COST-NEXT:    [[P9:%.*]] = icmp eq i8 [[P8]], 0
>> -; MAX-COST-NEXT:    [[P10:%.*]] = load i8, i8* getelementptr 
>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>> -; MAX-COST-NEXT:    [[P11:%.*]] = icmp eq i8 [[P10]], 0
>> -; MAX-COST-NEXT:    [[P12:%.*]] = load i8, i8* getelementptr 
>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>> -; MAX-COST-NEXT:    [[P13:%.*]] = icmp eq i8 [[P12]], 0
>> -; MAX-COST-NEXT:    [[P14:%.*]] = load i8, i8* getelementptr 
>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>> -; MAX-COST-NEXT:    [[P15:%.*]] = icmp eq i8 [[P14]], 0
>> +; MAX-COST-NEXT:    [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast 
>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) 
>> to <8 x i8>*), align 1
>> +; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]], 
>> zeroinitializer
>>   ; MAX-COST-NEXT:    br label [[FOR_BODY:%.*]]
>>   ; MAX-COST:       for.body:
>> -; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[P34:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>> -; MAX-COST-NEXT:    [[P19:%.*]] = select i1 [[P1]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P20:%.*]] = add i32 [[P17]], [[P19]]
>> -; MAX-COST-NEXT:    [[P21:%.*]] = select i1 [[P3]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P22:%.*]] = add i32 [[P20]], [[P21]]
>> -; MAX-COST-NEXT:    [[P23:%.*]] = select i1 [[P5]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P24:%.*]] = add i32 [[P22]], [[P23]]
>> -; MAX-COST-NEXT:    [[P25:%.*]] = select i1 [[P7]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P26:%.*]] = add i32 [[P24]], [[P25]]
>> -; MAX-COST-NEXT:    [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P28:%.*]] = add i32 [[P26]], [[P27]]
>> -; MAX-COST-NEXT:    [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P30:%.*]] = add i32 [[P28]], [[P29]]
>> -; MAX-COST-NEXT:    [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P32:%.*]] = add i32 [[P30]], [[P31]]
>> -; MAX-COST-NEXT:    [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P34]] = add i32 [[P32]], [[P33]]
>> +; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>> +; MAX-COST-NEXT:    [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x 
>> i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 
>> -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, i32 
>> -80, i32 -80, i32 -80, i32 -80>
>> +; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 
>> @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>> +; MAX-COST-NEXT:    [[OP_EXTRA]] = add i32 [[TMP3]], [[P17]]
>>   ; MAX-COST-NEXT:    br label [[FOR_BODY]]
>>   ;
>>   entry:
>> @@ -139,30 +112,14 @@ define void @PR32038(i32 %n) {
>>   ;
>>   ; MAX-COST-LABEL: @PR32038(
>>   ; MAX-COST-NEXT:  entry:
>> -; MAX-COST-NEXT:    [[TMP0:%.*]] = load <4 x i8>, <4 x i8>* bitcast 
>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) 
>> to <4 x i8>*), align 1
>> -; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <4 x i8> [[TMP0]], 
>> zeroinitializer
>> -; MAX-COST-NEXT:    [[P8:%.*]] = load i8, i8* getelementptr inbounds 
>> ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>> -; MAX-COST-NEXT:    [[P9:%.*]] = icmp eq i8 [[P8]], 0
>> -; MAX-COST-NEXT:    [[P10:%.*]] = load i8, i8* getelementptr 
>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>> -; MAX-COST-NEXT:    [[P11:%.*]] = icmp eq i8 [[P10]], 0
>> -; MAX-COST-NEXT:    [[P12:%.*]] = load i8, i8* getelementptr 
>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>> -; MAX-COST-NEXT:    [[P13:%.*]] = icmp eq i8 [[P12]], 0
>> -; MAX-COST-NEXT:    [[P14:%.*]] = load i8, i8* getelementptr 
>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>> -; MAX-COST-NEXT:    [[P15:%.*]] = icmp eq i8 [[P14]], 0
>> +; MAX-COST-NEXT:    [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast 
>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) 
>> to <8 x i8>*), align 1
>> +; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]], 
>> zeroinitializer
>>   ; MAX-COST-NEXT:    br label [[FOR_BODY:%.*]]
>>   ; MAX-COST:       for.body:
>> -; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[P34:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>> -; MAX-COST-NEXT:    [[TMP2:%.*]] = select <4 x i1> [[TMP1]], <4 x 
>> i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, 
>> i32 -80, i32 -80, i32 -80>
>> -; MAX-COST-NEXT:    [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 
>> @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
>> -; MAX-COST-NEXT:    [[TMP4:%.*]] = add i32 [[TMP3]], [[P27]]
>> -; MAX-COST-NEXT:    [[TMP5:%.*]] = add i32 [[TMP4]], [[P29]]
>> -; MAX-COST-NEXT:    [[OP_EXTRA:%.*]] = add i32 [[TMP5]], -5
>> -; MAX-COST-NEXT:    [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]
>> -; MAX-COST-NEXT:    [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>> -; MAX-COST-NEXT:    [[P34]] = add i32 [[P32]], [[P33]]
>> +; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]], 
>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>> +; MAX-COST-NEXT:    [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x 
>> i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 
>> -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, i32 
>> -80, i32 -80, i32 -80, i32 -80>
>> +; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 
>> @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>> +; MAX-COST-NEXT:    [[OP_EXTRA]] = add i32 [[TMP3]], -5
>>   ; MAX-COST-NEXT:    br label [[FOR_BODY]]
>>   ;
>>   entry:
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll 
>> b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>> index 39f2f885bc26b..c1451090d23c0 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>> @@ -14,14 +14,14 @@ define void @patatino(i64 %n, i64 %i, %struct.S* 
>> %p) !dbg !7 {
>>   ; CHECK-NEXT:    call void @llvm.dbg.value(metadata %struct.S* 
>> [[P:%.*]], metadata [[META20:![0-9]+]], metadata !DIExpression()), 
>> !dbg [[DBG25:![0-9]+]]
>>   ; CHECK-NEXT:    [[X1:%.*]] = getelementptr inbounds 
>> [[STRUCT_S:%.*]], %struct.S* [[P]], i64 [[N]], i32 0, !dbg 
>> [[DBG26:![0-9]+]]
>>   ; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, 
>> metadata [[META21:![0-9]+]], metadata !DIExpression()), !dbg 
>> [[DBG27:![0-9]+]]
>> -; CHECK-NEXT:    [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>> %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG28:![0-9]+]]
>> +; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, 
>> metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg 
>> [[DBG28:![0-9]+]]
>> +; CHECK-NEXT:    [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>> %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG29:![0-9]+]]
>>   ; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i64* [[X1]] to <2 x i64>*, 
>> !dbg [[DBG26]]
>> -; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, <2 x i64>* [[TMP0]], 
>> align 8, !dbg [[DBG26]], !tbaa [[TBAA29:![0-9]+]]
>> -; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, 
>> metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg 
>> [[DBG33:![0-9]+]]
>> +; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, <2 x i64>* [[TMP0]], 
>> align 8, !dbg [[DBG26]], !tbaa [[TBAA30:![0-9]+]]
>>   ; CHECK-NEXT:    [[X5:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>> %struct.S* [[P]], i64 [[I]], i32 0, !dbg [[DBG34:![0-9]+]]
>>   ; CHECK-NEXT:    [[Y7:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>> %struct.S* [[P]], i64 [[I]], i32 1, !dbg [[DBG35:![0-9]+]]
>>   ; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i64* [[X5]] to <2 x i64>*, 
>> !dbg [[DBG36:![0-9]+]]
>> -; CHECK-NEXT:    store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]], 
>> align 8, !dbg [[DBG36]], !tbaa [[TBAA29]]
>> +; CHECK-NEXT:    store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]], 
>> align 8, !dbg [[DBG36]], !tbaa [[TBAA30]]
>>   ; CHECK-NEXT:    ret void, !dbg [[DBG37:![0-9]+]]
>>   ;
>>   entry:
>>
>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>> index 7f51dcae484ca..d15494e092c25 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>> @@ -9,11 +9,11 @@ define void @test() #0 {
>>   ; CHECK:       loop:
>>   ; CHECK-NEXT:    [[DUMMY_PHI:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], 
>> [ [[OP_EXTRA1:%.*]], [[LOOP]] ]
>>   ; CHECK-NEXT:    [[TMP0:%.*]] = phi i64 [ 2, [[ENTRY]] ], [ 
>> [[TMP3:%.*]], [[LOOP]] ]
>> -; CHECK-NEXT:    [[DUMMY_ADD:%.*]] = add i16 0, 0
>>   ; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <4 x i64> poison, i64 
>> [[TMP0]], i32 0
>>   ; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <4 x i64> 
>> [[TMP1]], <4 x i64> poison, <4 x i32> zeroinitializer
>>   ; CHECK-NEXT:    [[TMP2:%.*]] = add <4 x i64> [[SHUFFLE]], <i64 3, 
>> i64 2, i64 1, i64 0>
>>   ; CHECK-NEXT:    [[TMP3]] = extractelement <4 x i64> [[TMP2]], i32 3
>> +; CHECK-NEXT:    [[DUMMY_ADD:%.*]] = add i16 0, 0
>>   ; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <4 x i64> [[TMP2]], 
>> i32 0
>>   ; CHECK-NEXT:    [[DUMMY_SHL:%.*]] = shl i64 [[TMP4]], 32
>>   ; CHECK-NEXT:    [[TMP5:%.*]] = add <4 x i64> <i64 1, i64 1, i64 1, 
>> i64 1>, [[TMP2]]
>>
>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>> index 7ab610f994264..f878bda14ad84 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>> @@ -10,10 +10,10 @@ define void @mainTest(i32 %param, i32 * %vals, 
>> i32 %len) {
>>   ; CHECK-NEXT:    [[TMP1:%.*]] = phi <2 x i32> [ [[TMP7:%.*]], 
>> [[BCI_15]] ], [ [[TMP0]], [[BCI_15_PREHEADER:%.*]] ]
>>   ; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <2 x i32> 
>> [[TMP1]], <2 x i32> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, 
>> i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, 
>> i32 0, i32 1>
>>   ; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <16 x i32> 
>> [[SHUFFLE]], i32 0
>> -; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <16 x i32> 
>> [[SHUFFLE]], i32 15
>> -; CHECK-NEXT:    store atomic i32 [[TMP3]], i32* [[VALS:%.*]] 
>> unordered, align 4
>> -; CHECK-NEXT:    [[TMP4:%.*]] = add <16 x i32> [[SHUFFLE]], <i32 15, 
>> i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, 
>> i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>> -; CHECK-NEXT:    [[TMP5:%.*]] = call i32 
>> @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP4]])
>> +; CHECK-NEXT:    [[TMP3:%.*]] = add <16 x i32> [[SHUFFLE]], <i32 15, 
>> i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, 
>> i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>> +; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <16 x i32> 
>> [[SHUFFLE]], i32 15
>> +; CHECK-NEXT:    store atomic i32 [[TMP4]], i32* [[VALS:%.*]] 
>> unordered, align 4
>> +; CHECK-NEXT:    [[TMP5:%.*]] = call i32 
>> @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP3]])
>>   ; CHECK-NEXT:    [[OP_EXTRA:%.*]] = and i32 [[TMP5]], [[TMP2]]
>>   ; CHECK-NEXT:    [[V44:%.*]] = add i32 [[TMP2]], 16
>>   ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x i32> poison, i32 
>> [[V44]], i32 0
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>> index de371d8895c7d..94739340c8b5a 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>> @@ -29,10 +29,10 @@ define void @exceed(double %0, double %1) {
>>   ; CHECK-NEXT:    [[IXX22:%.*]] = fsub double undef, undef
>>   ; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <2 x double> 
>> [[TMP6]], i32 0
>>   ; CHECK-NEXT:    [[IX2:%.*]] = fmul double [[TMP8]], [[TMP8]]
>> -; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <2 x double> [[TMP2]], 
>> double [[TMP1]], i32 1
>> -; CHECK-NEXT:    [[TMP10:%.*]] = fadd fast <2 x double> [[TMP6]], 
>> [[TMP9]]
>> -; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], 
>> [[TMP5]]
>> -; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], 
>> [[TMP11]]
>> +; CHECK-NEXT:    [[TMP9:%.*]] = fadd fast <2 x double> [[TMP3]], 
>> [[TMP5]]
>> +; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <2 x double> 
>> [[TMP2]], double [[TMP1]], i32 1
>> +; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <2 x double> [[TMP6]], 
>> [[TMP10]]
>> +; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast <2 x double> [[TMP11]], 
>> [[TMP9]]
>>   ; CHECK-NEXT:    [[IXX101:%.*]] = fsub double undef, undef
>>   ; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <2 x double> poison, 
>> double [[TMP1]], i32 1
>>   ; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x double> 
>> [[TMP13]], double [[TMP7]], i32 0
>>
>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>> index 80cb197982d48..8dc4a8936b722 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>> @@ -58,10 +58,10 @@ define void @test(ptr %r, ptr %p, ptr %q) #0 {
>>     define void @test2(i64* %a, i64* %b) {
>>   ; CHECK-LABEL: @test2(
>> -; CHECK-NEXT:    [[A2:%.*]] = getelementptr inbounds i64, ptr 
>> [[A:%.*]], i64 2
>> -; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr 
>> [[A]], i32 0
>> +; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr 
>> [[A:%.*]], i32 0
>>   ; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x ptr> [[TMP1]], 
>> ptr [[B:%.*]], i32 1
>>   ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr i64, <2 x ptr> 
>> [[TMP2]], <2 x i64> <i64 1, i64 3>
>> +; CHECK-NEXT:    [[A2:%.*]] = getelementptr inbounds i64, ptr [[A]], 
>> i64 2
>>   ; CHECK-NEXT:    [[TMP4:%.*]] = ptrtoint <2 x ptr> [[TMP3]] to <2 x 
>> i64>
>>   ; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x ptr> [[TMP3]], 
>> i32 0
>>   ; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x i64>, ptr [[TMP5]], align 8
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>> index f6dd7526e6e76..35a6c63d29b6c 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias 
>> nocapture %0, float* noalias nocaptu
>>   ; AVX2-NEXT:    ret void
>>   ;
>>   ; AVX512F-LABEL: @gather_load_div(
>> -; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> poison, 
>> float* [[TMP1:%.*]], i64 0
>> -; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>> -; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> -; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> poison, 
>> float* [[TMP1]], i64 0
>> -; AVX512F-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>> -; AVX512F-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>> -; AVX512F-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> -; AVX512F-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> poison, 
>> float* [[TMP1]], i64 0
>> -; AVX512F-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> -; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> -; AVX512F-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>> [[TMP13]], float* [[TMP8]], i64 7
>> -; AVX512F-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>> -; AVX512F-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x float*> 
>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>> 30, i64 27, i64 23>
>> -; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>> [[TMP17]]
>> +; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> poison, 
>> float* [[TMP1:%.*]], i64 0
>> +; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>> +; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> 
>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>> 30, i64 27, i64 23>
>> +; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> poison, 
>> float* [[TMP1]], i64 0
>> +; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>> +; AVX512F-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> +; AVX512F-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> poison, 
>> float* [[TMP1]], i64 0
>> +; AVX512F-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>> +; AVX512F-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>> +; AVX512F-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> +; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> +; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512F-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> +; AVX512F-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>> [[TMP14]], float* [[TMP10]], i64 7
>> +; AVX512F-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>> [[TMP17]]
>>   ; AVX512F-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>> <8 x float>*
>>   ; AVX512F-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>   ; AVX512F-NEXT:    ret void
>>   ;
>>   ; AVX512VL-LABEL: @gather_load_div(
>> -; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> 
>> poison, float* [[TMP1:%.*]], i64 0
>> -; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>> -; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> -; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> 
>> poison, float* [[TMP1]], i64 0
>> -; AVX512VL-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>> -; AVX512VL-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>> -; AVX512VL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> -; AVX512VL-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> 
>> poison, float* [[TMP1]], i64 0
>> -; AVX512VL-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>> [[TMP13]], float* [[TMP8]], i64 7
>> -; AVX512VL-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>> -; AVX512VL-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x 
>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>> 33, i64 30, i64 27, i64 23>
>> -; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>> [[TMP17]]
>> +; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> 
>> poison, float* [[TMP1:%.*]], i64 0
>> +; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>> +; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> 
>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>> 30, i64 27, i64 23>
>> +; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> 
>> poison, float* [[TMP1]], i64 0
>> +; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>> +; AVX512VL-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> +; AVX512VL-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> 
>> poison, float* [[TMP1]], i64 0
>> +; AVX512VL-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>> +; AVX512VL-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>> +; AVX512VL-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> +; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>> [[TMP14]], float* [[TMP10]], i64 7
>> +; AVX512VL-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>> [[TMP17]]
>>   ; AVX512VL-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>> <8 x float>*
>>   ; AVX512VL-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>   ; AVX512VL-NEXT:    ret void
>>
>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>> index fd1c612a0696e..47f4391fd3b21 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias 
>> nocapture %0, float* noalias nocaptu
>>   ; AVX2-NEXT:    ret void
>>   ;
>>   ; AVX512F-LABEL: @gather_load_div(
>> -; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> poison, 
>> float* [[TMP1:%.*]], i64 0
>> -; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>> -; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> -; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> poison, 
>> float* [[TMP1]], i64 0
>> -; AVX512F-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>> -; AVX512F-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>> -; AVX512F-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> -; AVX512F-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> poison, 
>> float* [[TMP1]], i64 0
>> -; AVX512F-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> -; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> -; AVX512F-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>> [[TMP13]], float* [[TMP8]], i64 7
>> -; AVX512F-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>> -; AVX512F-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x float*> 
>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>> 30, i64 27, i64 23>
>> -; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>> [[TMP17]]
>> +; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> poison, 
>> float* [[TMP1:%.*]], i64 0
>> +; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>> +; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> 
>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>> 30, i64 27, i64 23>
>> +; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> poison, 
>> float* [[TMP1]], i64 0
>> +; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>> +; AVX512F-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> +; AVX512F-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> poison, 
>> float* [[TMP1]], i64 0
>> +; AVX512F-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>> +; AVX512F-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>> +; AVX512F-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> +; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> +; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512F-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> +; AVX512F-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>> [[TMP14]], float* [[TMP10]], i64 7
>> +; AVX512F-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>> [[TMP17]]
>>   ; AVX512F-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>> <8 x float>*
>>   ; AVX512F-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>   ; AVX512F-NEXT:    ret void
>>   ;
>>   ; AVX512VL-LABEL: @gather_load_div(
>> -; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> 
>> poison, float* [[TMP1:%.*]], i64 0
>> -; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>> -; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> -; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> 
>> poison, float* [[TMP1]], i64 0
>> -; AVX512VL-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>> -; AVX512VL-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>> -; AVX512VL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> -; AVX512VL-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> 
>> poison, float* [[TMP1]], i64 0
>> -; AVX512VL-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> -; AVX512VL-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>> [[TMP13]], float* [[TMP8]], i64 7
>> -; AVX512VL-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>> -; AVX512VL-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x 
>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>> 33, i64 30, i64 27, i64 23>
>> -; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> -; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>> [[TMP17]]
>> +; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> 
>> poison, float* [[TMP1:%.*]], i64 0
>> +; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>> +; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> 
>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>> 30, i64 27, i64 23>
>> +; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> 
>> poison, float* [[TMP1]], i64 0
>> +; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>> +; AVX512VL-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> 
>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>> +; AVX512VL-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> 
>> poison, float* [[TMP1]], i64 0
>> +; AVX512VL-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>> +; AVX512VL-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> 
>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>> +; AVX512VL-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>> float* [[TMP1]], i64 20
>> +; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, 
>> i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 
>> 10, i32 11, i32 undef, i32 undef, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>> +; AVX512VL-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>> [[TMP14]], float* [[TMP10]], i64 7
>> +; AVX512VL-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>> +; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>> [[TMP17]]
>>   ; AVX512VL-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>> <8 x float>*
>>   ; AVX512VL-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>   ; AVX512VL-NEXT:    ret void
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>> index a4a388e9d095c..6946ab292cdf5 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>> @@ -21,11 +21,11 @@ define void @foo(%class.e* %this, %struct.a* %p, 
>> i32 %add7) {
>>   ; CHECK-NEXT:    i32 2, label [[SW_BB]]
>>   ; CHECK-NEXT:    ]
>>   ; CHECK:       sw.bb:
>> -; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>> -; CHECK-NEXT:    [[TMP3:%.*]] = load <2 x i32>, <2 x i32>* [[TMP2]], 
>> align 4
>>   ; CHECK-NEXT:    [[SHRINK_SHUFFLE:%.*]] = shufflevector <4 x i32> 
>> [[SHUFFLE]], <4 x i32> poison, <2 x i32> <i32 2, i32 0>
>> -; CHECK-NEXT:    [[TMP4:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]], 
>> <i32 -1, i32 -1>
>> -; CHECK-NEXT:    [[TMP5:%.*]] = add <2 x i32> [[TMP3]], [[TMP4]]
>> +; CHECK-NEXT:    [[TMP2:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]], 
>> <i32 -1, i32 -1>
>> +; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>> +; CHECK-NEXT:    [[TMP4:%.*]] = load <2 x i32>, <2 x i32>* [[TMP3]], 
>> align 4
>> +; CHECK-NEXT:    [[TMP5:%.*]] = add <2 x i32> [[TMP4]], [[TMP2]]
>>   ; CHECK-NEXT:    br label [[SW_EPILOG]]
>>   ; CHECK:       sw.epilog:
>>   ; CHECK-NEXT:    [[TMP6:%.*]] = phi <2 x i32> [ undef, 
>> [[ENTRY:%.*]] ], [ [[TMP5]], [[SW_BB]] ]
>>
>> diff  --git 
>> a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll 
>> b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>> index 87709a87b3692..109c27e4f4f4e 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>> @@ -16,8 +16,8 @@ define void @foo() {
>>   ; CHECK-NEXT:    [[TMP3:%.*]] = load double, double* undef, align 8
>>   ; CHECK-NEXT:    br i1 undef, label [[BB3]], label [[BB4:%.*]]
>>   ; CHECK:       bb4:
>> -; CHECK-NEXT:    [[CONV2:%.*]] = uitofp i16 undef to double
>>   ; CHECK-NEXT:    [[TMP4:%.*]] = fpext <4 x float> [[TMP2]] to <4 x 
>> double>
>> +; CHECK-NEXT:    [[CONV2:%.*]] = uitofp i16 undef to double
>>   ; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x double> <double 
>> undef, double poison>, double [[TMP3]], i32 1
>>   ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x double> <double 
>> undef, double poison>, double [[CONV2]], i32 1
>>   ; CHECK-NEXT:    [[TMP7:%.*]] = fsub <2 x double> [[TMP5]], [[TMP6]]
>>
>> diff  --git a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll 
>> b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>> index 33ba97921e878..da18a937a6477 100644
>> --- a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>> +++ b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>> @@ -133,27 +133,27 @@ define void @phi_float32(half %hval, float 
>> %fval) {
>>   ; MAX256-NEXT:    br label [[BB1:%.*]]
>>   ; MAX256:       bb1:
>>   ; MAX256-NEXT:    [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>> -; MAX256-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>> -; MAX256-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>> -; MAX256-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>   ; MAX256-NEXT:    [[TMP0:%.*]] = insertelement <8 x float> poison, 
>> float [[I]], i32 0
>>   ; MAX256-NEXT:    [[SHUFFLE11:%.*]] = shufflevector <8 x float> 
>> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>   ; MAX256-NEXT:    [[TMP1:%.*]] = insertelement <8 x float> poison, 
>> float [[FVAL:%.*]], i32 0
>>   ; MAX256-NEXT:    [[SHUFFLE12:%.*]] = shufflevector <8 x float> 
>> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>   ; MAX256-NEXT:    [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]], 
>> [[SHUFFLE12]]
>> -; MAX256-NEXT:    [[TMP3:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP2]]
>> -; MAX256-NEXT:    [[TMP4:%.*]] = insertelement <8 x float> poison, 
>> float [[I3]], i32 0
>> -; MAX256-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>> -; MAX256-NEXT:    [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>> [[SHUFFLE12]]
>> -; MAX256-NEXT:    [[TMP6:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP5]]
>> -; MAX256-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>> float [[I6]], i32 0
>> -; MAX256-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>> -; MAX256-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>> [[SHUFFLE12]]
>> -; MAX256-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP8]]
>> -; MAX256-NEXT:    [[TMP10:%.*]] = insertelement <8 x float> poison, 
>> float [[I9]], i32 0
>> -; MAX256-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>> -; MAX256-NEXT:    [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>> [[SHUFFLE12]]
>> -; MAX256-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP11]]
>> +; MAX256-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>> +; MAX256-NEXT:    [[TMP3:%.*]] = insertelement <8 x float> poison, 
>> float [[I3]], i32 0
>> +; MAX256-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>> +; MAX256-NEXT:    [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>> [[SHUFFLE12]]
>> +; MAX256-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>> +; MAX256-NEXT:    [[TMP5:%.*]] = insertelement <8 x float> poison, 
>> float [[I6]], i32 0
>> +; MAX256-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>> +; MAX256-NEXT:    [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>> [[SHUFFLE12]]
>> +; MAX256-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>> +; MAX256-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>> float [[I9]], i32 0
>> +; MAX256-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>> +; MAX256-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>> [[SHUFFLE12]]
>> +; MAX256-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP2]]
>> +; MAX256-NEXT:    [[TMP10:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP4]]
>> +; MAX256-NEXT:    [[TMP11:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP6]]
>> +; MAX256-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP8]]
>>   ; MAX256-NEXT:    switch i32 undef, label [[BB5:%.*]] [
>>   ; MAX256-NEXT:    i32 0, label [[BB2:%.*]]
>>   ; MAX256-NEXT:    i32 1, label [[BB3:%.*]]
>> @@ -166,10 +166,10 @@ define void @phi_float32(half %hval, float 
>> %fval) {
>>   ; MAX256:       bb5:
>>   ; MAX256-NEXT:    br label [[BB2]]
>>   ; MAX256:       bb2:
>> -; MAX256-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>> -; MAX256-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>> [[TMP9]], [[BB1]] ]
>> +; MAX256-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>> +; MAX256-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [ 
>> [[TMP11]], [[BB1]] ]
>>   ; MAX256-NEXT:    [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]], 
>> [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>> [[TMP12]], [[BB1]] ]
>> -; MAX256-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]], 
>> [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>> +; MAX256-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]], 
>> [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>>   ; MAX256-NEXT:    [[TMP17:%.*]] = extractelement <8 x float> 
>> [[TMP14]], i32 7
>>   ; MAX256-NEXT:    store float [[TMP17]], float* undef, align 4
>>   ; MAX256-NEXT:    ret void
>> @@ -179,27 +179,27 @@ define void @phi_float32(half %hval, float 
>> %fval) {
>>   ; MAX1024-NEXT:    br label [[BB1:%.*]]
>>   ; MAX1024:       bb1:
>>   ; MAX1024-NEXT:    [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>> -; MAX1024-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>> -; MAX1024-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>> -; MAX1024-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>   ; MAX1024-NEXT:    [[TMP0:%.*]] = insertelement <8 x float> poison, 
>> float [[I]], i32 0
>>   ; MAX1024-NEXT:    [[SHUFFLE11:%.*]] = shufflevector <8 x float> 
>> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>   ; MAX1024-NEXT:    [[TMP1:%.*]] = insertelement <8 x float> poison, 
>> float [[FVAL:%.*]], i32 0
>>   ; MAX1024-NEXT:    [[SHUFFLE12:%.*]] = shufflevector <8 x float> 
>> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>   ; MAX1024-NEXT:    [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]], 
>> [[SHUFFLE12]]
>> -; MAX1024-NEXT:    [[TMP3:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP2]]
>> -; MAX1024-NEXT:    [[TMP4:%.*]] = insertelement <8 x float> poison, 
>> float [[I3]], i32 0
>> -; MAX1024-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>> -; MAX1024-NEXT:    [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>> [[SHUFFLE12]]
>> -; MAX1024-NEXT:    [[TMP6:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP5]]
>> -; MAX1024-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>> float [[I6]], i32 0
>> -; MAX1024-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>> -; MAX1024-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>> [[SHUFFLE12]]
>> -; MAX1024-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP8]]
>> -; MAX1024-NEXT:    [[TMP10:%.*]] = insertelement <8 x float> poison, 
>> float [[I9]], i32 0
>> -; MAX1024-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>> -; MAX1024-NEXT:    [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>> [[SHUFFLE12]]
>> -; MAX1024-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP11]]
>> +; MAX1024-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>> +; MAX1024-NEXT:    [[TMP3:%.*]] = insertelement <8 x float> poison, 
>> float [[I3]], i32 0
>> +; MAX1024-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>> +; MAX1024-NEXT:    [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>> [[SHUFFLE12]]
>> +; MAX1024-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>> +; MAX1024-NEXT:    [[TMP5:%.*]] = insertelement <8 x float> poison, 
>> float [[I6]], i32 0
>> +; MAX1024-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>> +; MAX1024-NEXT:    [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>> [[SHUFFLE12]]
>> +; MAX1024-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>> +; MAX1024-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>> float [[I9]], i32 0
>> +; MAX1024-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>> +; MAX1024-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>> [[SHUFFLE12]]
>> +; MAX1024-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP2]]
>> +; MAX1024-NEXT:    [[TMP10:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP4]]
>> +; MAX1024-NEXT:    [[TMP11:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP6]]
>> +; MAX1024-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, 
>> [[TMP8]]
>>   ; MAX1024-NEXT:    switch i32 undef, label [[BB5:%.*]] [
>>   ; MAX1024-NEXT:    i32 0, label [[BB2:%.*]]
>>   ; MAX1024-NEXT:    i32 1, label [[BB3:%.*]]
>> @@ -212,10 +212,10 @@ define void @phi_float32(half %hval, float 
>> %fval) {
>>   ; MAX1024:       bb5:
>>   ; MAX1024-NEXT:    br label [[BB2]]
>>   ; MAX1024:       bb2:
>> -; MAX1024-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>> -; MAX1024-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>> [[TMP9]], [[BB1]] ]
>> +; MAX1024-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>> +; MAX1024-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]], 
>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [ 
>> [[TMP11]], [[BB1]] ]
>>   ; MAX1024-NEXT:    [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]], 
>> [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>> [[TMP12]], [[BB1]] ]
>> -; MAX1024-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]], 
>> [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>> +; MAX1024-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]], 
>> [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>> [[SHUFFLE12]], [[BB1]] ]
>>   ; MAX1024-NEXT:    [[TMP17:%.*]] = extractelement <8 x float> 
>> [[TMP14]], i32 7
>>   ; MAX1024-NEXT:    store float [[TMP17]], float* undef, align 4
>>   ; MAX1024-NEXT:    ret void
>>
>>
>>          _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits