[llvm] d65cc85 - [SLP]Do not schedule instructions with constants/argument/phi operands and external users.

Sat Mar 19 10:07:12 PDT 2022

Ok, this is definitely wrong.  But so is the existing code.  I plan on 
fixing the generic case shortly, but I'm going to leave your special 
case to you to fix or revert.  I don't understand the invariants of this 
patch enough to be comfortable making a fix.

Here's a test case for the special case you added (also committed in 
bdbcca61):

; Variant of test10 block invariant operands to the udivs
; FIXME: This is wrong, we're hoisting a faulting udiv above an infinite 
loop.
define void @test11(i64 %x, i64 %y, i64* %b, i64* %c) {
; CHECK-LABEL: @test11(
; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x i64> poison, i64 
[[X:%.*]], i32 0
; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x i64> [[TMP1]], i64 
[[Y:%.*]], i32 1
; CHECK-NEXT:    [[TMP3:%.*]] = udiv <2 x i64> <i64 200, i64 200>, [[TMP2]]
; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
; CHECK-NEXT:    store i64 [[TMP4]], i64* [[B:%.*]], align 4
; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @may_inf_loop_ro()
; CHECK-NEXT:    [[CA2:%.*]] = getelementptr i64, i64* [[C:%.*]], i32 1
; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i64* [[C]] to <2 x i64>*
; CHECK-NEXT:    [[TMP7:%.*]] = load <2 x i64>, <2 x i64>* [[TMP6]], align 4
; CHECK-NEXT:    [[TMP8:%.*]] = add <2 x i64> [[TMP3]], [[TMP7]]
; CHECK-NEXT:    [[B2:%.*]] = getelementptr i64, i64* [[B]], i32 1
; CHECK-NEXT:    [[TMP9:%.*]] = bitcast i64* [[B]] to <2 x i64>*
; CHECK-NEXT:    store <2 x i64> [[TMP8]], <2 x i64>* [[TMP9]], align 4
; CHECK-NEXT:    ret void
;
   %u1 = udiv i64 200, %x
   store i64 %u1, i64* %b
   call i64 @may_inf_loop_ro()
   %u2 = udiv i64 200, %y

   %c1 = load i64, i64* %c
   %ca2 = getelementptr i64, i64* %c, i32 1
   %c2 = load i64, i64* %ca2
   %add1 = add i64 %u1, %c1
   %add2 = add i64 %u2, %c2

   store i64 %add1, i64* %b
   %b2 = getelementptr i64, i64* %b, i32 1
   store i64 %add2, i64* %b2
   ret void
}

On 3/18/22 13:27, Philip Reames wrote:
> I added a comment to the existing code in 1093949cf which more fully 
> explains the missing dependency and hidden assumption.
>
> I am not 100% sure your code has the same problem.  I'd suggest 
> exploring combinations such as a potentially faulting udiv following a 
> readnone infinite loop call with block-invariant operands.  I don't 
> have a particular test case for you because massaging the code into 
> actually reordering is quite involved. I tried, but did not manage to 
> create one with a few minutes of trying.
>
> Philip
>
> On 3/18/22 10:26, Philip Reames via llvm-commits wrote:
>> FYI, I'm pretty sure this patch is wrong. The case which I believe it 
>> gets wrong involves a bundle containing a readonly call which is not 
>> guaranteed to return. (i.e. may contain an infinite loop)  If I'm 
>> reading the code correctly, it may reorder such a call earlier in the 
>> basic block - including reordering of two such calls in the process.
>>
>> This is the same bug which existed in D118538 which is why I noticed it.
>>
>> If this case isn't possible for some reason, please add test coverage 
>> and clarify comments as to why.
>>
>> Philip
>>
>> On 3/17/22 11:04, Alexey Bataev via llvm-commits wrote:
>>> Author: Alexey Bataev
>>> Date: 2022-03-17T11:03:45-07:00
>>> New Revision: d65cc8597792ab04142cd2214c46c5c167191bcd
>>>
>>> URL: 
>>> https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd
>>> DIFF: 
>>> https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd.diff
>>>
>>> LOG: [SLP]Do not schedule instructions with constants/argument/phi 
>>> operands and external users.
>>>
>>> No need to schedule entry nodes where all instructions are not memory
>>> read/write instructions and their operands are either constants, or
>>> arguments, or phis, or instructions from others blocks, or their users
>>> are phis or from the other blocks.
>>> The resulting vector instructions can be placed at
>>> the beginning of the basic block without scheduling (if operands does
>>> not need to be scheduled) or at the end of the block (if users are
>>> outside of the block).
>>> It may save some compile time and scheduling resources.
>>>
>>> Differential Revision: https://reviews.llvm.org/D121121
>>>
>>> Added:
>>>
>>> Modified:
>>>      llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>>      llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>>      llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>>      llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>>      llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>>      llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>>      llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>>
>>> Removed:
>>>
>>>
>>> ################################################################################ 
>>>
>>> diff  --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp 
>>> b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> index 48382a12fcf3c..9ab31198adaab 100644
>>> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> @@ -776,6 +776,57 @@ static void 
>>> reorderScalars(SmallVectorImpl<Value *> &Scalars,
>>>         Scalars[Mask[I]] = Prev[I];
>>>   }
>>>   +/// Checks if the provided value does not require scheduling. It 
>>> does not
>>> +/// require scheduling if this is not an instruction or it is an 
>>> instruction
>>> +/// that does not read/write memory and all operands are either not 
>>> instructions
>>> +/// or phi nodes or instructions from
>>> diff erent blocks.
>>> +static bool areAllOperandsNonInsts(Value *V) {
>>> +  auto *I = dyn_cast<Instruction>(V);
>>> +  if (!I)
>>> +    return true;
>>> +  return !I->mayReadOrWriteMemory() && all_of(I->operands(), 
>>> [I](Value *V) {
>>> +    auto *IO = dyn_cast<Instruction>(V);
>>> +    if (!IO)
>>> +      return true;
>>> +    return isa<PHINode>(IO) || IO->getParent() != I->getParent();
>>> +  });
>>> +}
>>> +
>>> +/// Checks if the provided value does not require scheduling. It 
>>> does not
>>> +/// require scheduling if this is not an instruction or it is an 
>>> instruction
>>> +/// that does not read/write memory and all users are phi nodes or 
>>> instructions
>>> +/// from the
>>> diff erent blocks.
>>> +static bool isUsedOutsideBlock(Value *V) {
>>> +  auto *I = dyn_cast<Instruction>(V);
>>> +  if (!I)
>>> +    return true;
>>> +  // Limits the number of uses to save compile time.
>>> +  constexpr int UsesLimit = 8;
>>> +  return !I->mayReadOrWriteMemory() && 
>>> !I->hasNUsesOrMore(UsesLimit) &&
>>> +         all_of(I->users(), [I](User *U) {
>>> +           auto *IU = dyn_cast<Instruction>(U);
>>> +           if (!IU)
>>> +             return true;
>>> +           return IU->getParent() != I->getParent() || 
>>> isa<PHINode>(IU);
>>> +         });
>>> +}
>>> +
>>> +/// Checks if the specified value does not require scheduling. It 
>>> does not
>>> +/// require scheduling if all operands and all users do not need to 
>>> be scheduled
>>> +/// in the current basic block.
>>> +static bool doesNotNeedToBeScheduled(Value *V) {
>>> +  return areAllOperandsNonInsts(V) && isUsedOutsideBlock(V);
>>> +}
>>> +
>>> +/// Checks if the specified array of instructions does not require 
>>> scheduling.
>>> +/// It is so if all either instructions have operands that do not 
>>> require
>>> +/// scheduling or their users do not require scheduling since they 
>>> are phis or
>>> +/// in other basic blocks.
>>> +static bool doesNotNeedToSchedule(ArrayRef<Value *> VL) {
>>> +  return !VL.empty() &&
>>> +         (all_of(VL, isUsedOutsideBlock) || all_of(VL, 
>>> areAllOperandsNonInsts));
>>> +}
>>> +
>>>   namespace slpvectorizer {
>>>     /// Bottom Up SLP Vectorizer.
>>> @@ -2359,15 +2410,21 @@ class BoUpSLP {
>>>           ScalarToTreeEntry[V] = Last;
>>>         }
>>>         // Update the scheduler bundle to point to this TreeEntry.
>>> -      unsigned Lane = 0;
>>> -      for (ScheduleData *BundleMember = Bundle.getValue(); 
>>> BundleMember;
>>> -           BundleMember = BundleMember->NextInBundle) {
>>> -        BundleMember->TE = Last;
>>> -        BundleMember->Lane = Lane;
>>> -        ++Lane;
>>> -      }
>>> -      assert((!Bundle.getValue() || Lane == VL.size()) &&
>>> +      ScheduleData *BundleMember = Bundle.getValue();
>>> +      assert((BundleMember || isa<PHINode>(S.MainOp) ||
>>> +              isVectorLikeInstWithConstOps(S.MainOp) ||
>>> +              doesNotNeedToSchedule(VL)) &&
>>>                "Bundle and VL out of sync");
>>> +      if (BundleMember) {
>>> +        for (Value *V : VL) {
>>> +          if (doesNotNeedToBeScheduled(V))
>>> +            continue;
>>> +          assert(BundleMember && "Unexpected end of bundle.");
>>> +          BundleMember->TE = Last;
>>> +          BundleMember = BundleMember->NextInBundle;
>>> +        }
>>> +      }
>>> +      assert(!BundleMember && "Bundle and VL out of sync");
>>>       } else {
>>>         MustGather.insert(VL.begin(), VL.end());
>>>       }
>>> @@ -2504,7 +2561,6 @@ class BoUpSLP {
>>>         clearDependencies();
>>>         OpValue = OpVal;
>>>         TE = nullptr;
>>> -      Lane = -1;
>>>       }
>>>         /// Verify basic self consistency properties
>>> @@ -2544,7 +2600,7 @@ class BoUpSLP {
>>>       /// Returns true if it represents an instruction bundle and 
>>> not only a
>>>       /// single instruction.
>>>       bool isPartOfBundle() const {
>>> -      return NextInBundle != nullptr || FirstInBundle != this;
>>> +      return NextInBundle != nullptr || FirstInBundle != this || TE;
>>>       }
>>>         /// Returns true if it is ready for scheduling, i.e. it has 
>>> no more
>>> @@ -2649,9 +2705,6 @@ class BoUpSLP {
>>>       /// Note that this is negative as long as Dependencies is not 
>>> calculated.
>>>       int UnscheduledDeps = InvalidDeps;
>>>   -    /// The lane of this node in the TreeEntry.
>>> -    int Lane = -1;
>>> -
>>>       /// True if this instruction is scheduled (or considered as 
>>> scheduled in the
>>>       /// dry-run).
>>>       bool IsScheduled = false;
>>> @@ -2669,6 +2722,21 @@ class BoUpSLP {
>>>     friend struct DOTGraphTraits<BoUpSLP *>;
>>>       /// Contains all scheduling data for a basic block.
>>> +  /// It does not schedules instructions, which are not memory 
>>> read/write
>>> +  /// instructions and their operands are either constants, or 
>>> arguments, or
>>> +  /// phis, or instructions from others blocks, or their users are 
>>> phis or from
>>> +  /// the other blocks. The resulting vector instructions can be 
>>> placed at the
>>> +  /// beginning of the basic block without scheduling (if operands 
>>> does not need
>>> +  /// to be scheduled) or at the end of the block (if users are 
>>> outside of the
>>> +  /// block). It allows to save some compile time and memory used 
>>> by the
>>> +  /// compiler.
>>> +  /// ScheduleData is assigned for each instruction in between the 
>>> boundaries of
>>> +  /// the tree entry, even for those, which are not part of the 
>>> graph. It is
>>> +  /// required to correctly follow the dependencies between the 
>>> instructions and
>>> +  /// their correct scheduling. The ScheduleData is not allocated 
>>> for the
>>> +  /// instructions, which do not require scheduling, like phis, 
>>> nodes with
>>> +  /// extractelements/insertelements only or nodes with 
>>> instructions, with
>>> +  /// uses/operands outside of the block.
>>>     struct BlockScheduling {
>>>       BlockScheduling(BasicBlock *BB)
>>>           : BB(BB), ChunkSize(BB->size()), ChunkPos(ChunkSize) {}
>>> @@ -2696,7 +2764,7 @@ class BoUpSLP {
>>>         if (BB != I->getParent())
>>>           // Avoid lookup if can't possibly be in map.
>>>           return nullptr;
>>> -      ScheduleData *SD = ScheduleDataMap[I];
>>> +      ScheduleData *SD = ScheduleDataMap.lookup(I);
>>>         if (SD && isInSchedulingRegion(SD))
>>>           return SD;
>>>         return nullptr;
>>> @@ -2713,7 +2781,7 @@ class BoUpSLP {
>>>           return getScheduleData(V);
>>>         auto I = ExtraScheduleDataMap.find(V);
>>>         if (I != ExtraScheduleDataMap.end()) {
>>> -        ScheduleData *SD = I->second[Key];
>>> +        ScheduleData *SD = I->second.lookup(Key);
>>>           if (SD && isInSchedulingRegion(SD))
>>>             return SD;
>>>         }
>>> @@ -2735,7 +2803,7 @@ class BoUpSLP {
>>>              BundleMember = BundleMember->NextInBundle) {
>>>           if (BundleMember->Inst != BundleMember->OpValue)
>>>             continue;
>>> -
>>> +
>>>           // Handle the def-use chain dependencies.
>>>             // Decrement the unscheduled counter and insert to ready 
>>> list if ready.
>>> @@ -2760,7 +2828,9 @@ class BoUpSLP {
>>>           // reordered during buildTree(). We therefore need to get 
>>> its operands
>>>           // through the TreeEntry.
>>>           if (TreeEntry *TE = BundleMember->TE) {
>>> -          int Lane = BundleMember->Lane;
>>> +          // Need to search for the lane since the tree entry can 
>>> be reordered.
>>> +          int Lane = std::distance(TE->Scalars.begin(),
>>> +                                   find(TE->Scalars, 
>>> BundleMember->Inst));
>>>             assert(Lane >= 0 && "Lane not set");
>>>               // Since vectorization tree is being built recursively 
>>> this assertion
>>> @@ -2769,7 +2839,7 @@ class BoUpSLP {
>>>             // where their second (immediate) operand is not added. 
>>> Since
>>>             // immediates do not affect scheduler behavior this is 
>>> considered
>>>             // okay.
>>> -          auto *In = TE->getMainOp();
>>> +          auto *In = BundleMember->Inst;
>>>             assert(In &&
>>>                    (isa<ExtractValueInst>(In) || 
>>> isa<ExtractElementInst>(In) ||
>>>                     In->getNumOperands() == TE->getNumOperands()) &&
>>> @@ -2814,7 +2884,8 @@ class BoUpSLP {
>>>           for (auto *I = ScheduleStart; I != ScheduleEnd; I = 
>>> I->getNextNode()) {
>>>           auto *SD = getScheduleData(I);
>>> -        assert(SD && "primary scheduledata must exist in window");
>>> +        if (!SD)
>>> +          continue;
>>>           assert(isInSchedulingRegion(SD) &&
>>>                  "primary schedule data not in window?");
>>>           assert(isInSchedulingRegion(SD->FirstInBundle) &&
>>> @@ -3856,6 +3927,22 @@ static LoadsState 
>>> canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
>>>     return LoadsState::Gather;
>>>   }
>>>   +/// \return true if the specified list of values has only one 
>>> instruction that
>>> +/// requires scheduling, false otherwise.
>>> +static bool needToScheduleSingleInstruction(ArrayRef<Value *> VL) {
>>> +  Value *NeedsScheduling = nullptr;
>>> +  for (Value *V : VL) {
>>> +    if (doesNotNeedToBeScheduled(V))
>>> +      continue;
>>> +    if (!NeedsScheduling) {
>>> +      NeedsScheduling = V;
>>> +      continue;
>>> +    }
>>> +    return false;
>>> +  }
>>> +  return NeedsScheduling;
>>> +}
>>> +
>>>   void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
>>>                               const EdgeInfo &UserTreeIdx) {
>>>     assert((allConstant(VL) || allSameType(VL)) && "Invalid types!");
>>> @@ -6396,6 +6483,44 @@ void BoUpSLP::setInsertPointAfterBundle(const 
>>> TreeEntry *E) {
>>>       return !E->isOpcodeOrAlt(I) || I->getParent() == BB;
>>>     }));
>>>   +  auto &&FindLastInst = [E, Front]() {
>>> +    Instruction *LastInst = Front;
>>> +    for (Value *V : E->Scalars) {
>>> +      auto *I = dyn_cast<Instruction>(V);
>>> +      if (!I)
>>> +        continue;
>>> +      if (LastInst->comesBefore(I))
>>> +        LastInst = I;
>>> +    }
>>> +    return LastInst;
>>> +  };
>>> +
>>> +  auto &&FindFirstInst = [E, Front]() {
>>> +    Instruction *FirstInst = Front;
>>> +    for (Value *V : E->Scalars) {
>>> +      auto *I = dyn_cast<Instruction>(V);
>>> +      if (!I)
>>> +        continue;
>>> +      if (I->comesBefore(FirstInst))
>>> +        FirstInst = I;
>>> +    }
>>> +    return FirstInst;
>>> +  };
>>> +
>>> +  // Set the insert point to the beginning of the basic block if 
>>> the entry
>>> +  // should not be scheduled.
>>> +  if (E->State != TreeEntry::NeedToGather &&
>>> +      doesNotNeedToSchedule(E->Scalars)) {
>>> +    BasicBlock::iterator InsertPt;
>>> +    if (all_of(E->Scalars, isUsedOutsideBlock))
>>> +      InsertPt = FindLastInst()->getIterator();
>>> +    else
>>> +      InsertPt = FindFirstInst()->getIterator();
>>> +    Builder.SetInsertPoint(BB, InsertPt);
>>> +    Builder.SetCurrentDebugLocation(Front->getDebugLoc());
>>> +    return;
>>> +  }
>>> +
>>>     // The last instruction in the bundle in program order.
>>>     Instruction *LastInst = nullptr;
>>>   @@ -6404,8 +6529,10 @@ void 
>>> BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
>>>     // VL.back() and iterate over schedule data until we reach the 
>>> end of the
>>>     // bundle. The end of the bundle is marked by null ScheduleData.
>>>     if (BlocksSchedules.count(BB)) {
>>> -    auto *Bundle =
>>> - BlocksSchedules[BB]->getScheduleData(E->isOneOf(E->Scalars.back()));
>>> +    Value *V = E->isOneOf(E->Scalars.back());
>>> +    if (doesNotNeedToBeScheduled(V))
>>> +      V = *find_if_not(E->Scalars, doesNotNeedToBeScheduled);
>>> +    auto *Bundle = BlocksSchedules[BB]->getScheduleData(V);
>>>       if (Bundle && Bundle->isPartOfBundle())
>>>         for (; Bundle; Bundle = Bundle->NextInBundle)
>>>           if (Bundle->OpValue == Bundle->Inst)
>>> @@ -6430,15 +6557,8 @@ void BoUpSLP::setInsertPointAfterBundle(const 
>>> TreeEntry *E) {
>>>     // not ideal. However, this should be exceedingly rare since it 
>>> requires that
>>>     // we both exit early from buildTree_rec and that the bundle be 
>>> out-of-order
>>>     // (causing us to iterate all the way to the end of the block).
>>> -  if (!LastInst) {
>>> -    SmallPtrSet<Value *, 16> Bundle(E->Scalars.begin(), 
>>> E->Scalars.end());
>>> -    for (auto &I : make_range(BasicBlock::iterator(Front), 
>>> BB->end())) {
>>> -      if (Bundle.erase(&I) && E->isOpcodeOrAlt(&I))
>>> -        LastInst = &I;
>>> -      if (Bundle.empty())
>>> -        break;
>>> -    }
>>> -  }
>>> +  if (!LastInst)
>>> +    LastInst = FindLastInst();
>>>     assert(LastInst && "Failed to find last instruction in bundle");
>>>       // Set the insertion point after the last instruction in the 
>>> bundle. Set the
>>> @@ -7631,9 +7751,11 @@ void BoUpSLP::optimizeGatherSequence() {
>>>     BoUpSLP::ScheduleData *
>>>   BoUpSLP::BlockScheduling::buildBundle(ArrayRef<Value *> VL) {
>>> -  ScheduleData *Bundle = nullptr;
>>> +  ScheduleData *Bundle = nullptr;
>>>     ScheduleData *PrevInBundle = nullptr;
>>>     for (Value *V : VL) {
>>> +    if (doesNotNeedToBeScheduled(V))
>>> +      continue;
>>>       ScheduleData *BundleMember = getScheduleData(V);
>>>       assert(BundleMember &&
>>>              "no ScheduleData for bundle member "
>>> @@ -7661,7 +7783,8 @@ 
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>>> BoUpSLP *SLP,
>>>                                               const 
>>> InstructionsState &S) {
>>>     // No need to schedule PHIs, insertelement, extractelement and 
>>> extractvalue
>>>     // instructions.
>>> -  if (isa<PHINode>(S.OpValue) || 
>>> isVectorLikeInstWithConstOps(S.OpValue))
>>> +  if (isa<PHINode>(S.OpValue) || 
>>> isVectorLikeInstWithConstOps(S.OpValue) ||
>>> +      doesNotNeedToSchedule(VL))
>>>       return nullptr;
>>>       // Initialize the instruction bundle.
>>> @@ -7707,6 +7830,8 @@ 
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>>> BoUpSLP *SLP,
>>>     // Make sure that the scheduling region contains all
>>>     // instructions of the bundle.
>>>     for (Value *V : VL) {
>>> +    if (doesNotNeedToBeScheduled(V))
>>> +      continue;
>>>       if (!extendSchedulingRegion(V, S)) {
>>>         // If the scheduling region got new instructions at the 
>>> lower end (or it
>>>         // is a new region for the first bundle). This makes it 
>>> necessary to
>>> @@ -7721,6 +7846,8 @@ 
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>>> BoUpSLP *SLP,
>>>       bool ReSchedule = false;
>>>     for (Value *V : VL) {
>>> +    if (doesNotNeedToBeScheduled(V))
>>> +      continue;
>>>       ScheduleData *BundleMember = getScheduleData(V);
>>>       assert(BundleMember &&
>>>              "no ScheduleData for bundle member (maybe not in same 
>>> basic block)");
>>> @@ -7750,14 +7877,18 @@ 
>>> BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, 
>>> BoUpSLP *SLP,
>>>     void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value 
>>> *> VL,
>>>                                                   Value *OpValue) {
>>> -  if (isa<PHINode>(OpValue) || isVectorLikeInstWithConstOps(OpValue))
>>> +  if (isa<PHINode>(OpValue) || 
>>> isVectorLikeInstWithConstOps(OpValue) ||
>>> +      doesNotNeedToSchedule(VL))
>>>       return;
>>>   +  if (doesNotNeedToBeScheduled(OpValue))
>>> +    OpValue = *find_if_not(VL, doesNotNeedToBeScheduled);
>>>     ScheduleData *Bundle = getScheduleData(OpValue);
>>>     LLVM_DEBUG(dbgs() << "SLP:  cancel scheduling of " << *Bundle << 
>>> "\n");
>>>     assert(!Bundle->IsScheduled &&
>>>            "Can't cancel bundle which is already scheduled");
>>> -  assert(Bundle->isSchedulingEntity() && Bundle->isPartOfBundle() &&
>>> +  assert(Bundle->isSchedulingEntity() &&
>>> +         (Bundle->isPartOfBundle() || 
>>> needToScheduleSingleInstruction(VL)) &&
>>>            "tried to unbundle something which is not a bundle");
>>>       // Remove the bundle from the ready list.
>>> @@ -7771,6 +7902,7 @@ void 
>>> BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,
>>>       BundleMember->FirstInBundle = BundleMember;
>>>       ScheduleData *Next = BundleMember->NextInBundle;
>>>       BundleMember->NextInBundle = nullptr;
>>> +    BundleMember->TE = nullptr;
>>>       if (BundleMember->unscheduledDepsInBundle() == 0) {
>>>         ReadyInsts.insert(BundleMember);
>>>       }
>>> @@ -7794,6 +7926,7 @@ bool 
>>> BoUpSLP::BlockScheduling::extendSchedulingRegion(Value *V,
>>>     Instruction *I = dyn_cast<Instruction>(V);
>>>     assert(I && "bundle member must be an instruction");
>>>     assert(!isa<PHINode>(I) && !isVectorLikeInstWithConstOps(I) &&
>>> +         !doesNotNeedToBeScheduled(I) &&
>>>            "phi nodes/insertelements/extractelements/extractvalues 
>>> don't need to "
>>>            "be scheduled");
>>>     auto &&CheckScheduleForI = [this, &S](Instruction *I) -> bool {
>>> @@ -7870,7 +8003,10 @@ void 
>>> BoUpSLP::BlockScheduling::initScheduleData(Instruction *FromI,
>>>                                                   ScheduleData 
>>> *NextLoadStore) {
>>>     ScheduleData *CurrentLoadStore = PrevLoadStore;
>>>     for (Instruction *I = FromI; I != ToI; I = I->getNextNode()) {
>>> -    ScheduleData *SD = ScheduleDataMap[I];
>>> +    // No need to allocate data for non-schedulable instructions.
>>> +    if (doesNotNeedToBeScheduled(I))
>>> +      continue;
>>> +    ScheduleData *SD = ScheduleDataMap.lookup(I);
>>>       if (!SD) {
>>>         SD = allocateScheduleDataChunks();
>>>         ScheduleDataMap[I] = SD;
>>> @@ -8054,8 +8190,10 @@ void BoUpSLP::scheduleBlock(BlockScheduling 
>>> *BS) {
>>>     for (auto *I = BS->ScheduleStart; I != BS->ScheduleEnd;
>>>          I = I->getNextNode()) {
>>>       BS->doForAllOpcodes(I, [this, &Idx, &NumToSchedule, 
>>> BS](ScheduleData *SD) {
>>> +      TreeEntry *SDTE = getTreeEntry(SD->Inst);
>>>         assert((isVectorLikeInstWithConstOps(SD->Inst) ||
>>> -              SD->isPartOfBundle() == (getTreeEntry(SD->Inst) != 
>>> nullptr)) &&
>>> +              SD->isPartOfBundle() ==
>>> +                  (SDTE && !doesNotNeedToSchedule(SDTE->Scalars))) &&
>>>                "scheduler and vectorizer bundle mismatch");
>>>         SD->FirstInBundle->SchedulingPriority = Idx++;
>>>         if (SD->isSchedulingEntity()) {
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> index 536f72a73684e..ec7b03af83f8b 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>> @@ -36,6 +36,7 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture 
>>> readonly %a, i16* nocapture re
>>>   ; GENERIC-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; GENERIC-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; GENERIC-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>>> i16* [[A_ADDR_0101]], i64 8
>>>   ; GENERIC-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to 
>>> <8 x i16>*
>>>   ; GENERIC-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* 
>>> [[TMP0]], align 2
>>>   ; GENERIC-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x 
>>> i32>
>>> @@ -85,7 +86,6 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture 
>>> readonly %a, i16* nocapture re
>>>   ; GENERIC-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], 
>>> align 2
>>>   ; GENERIC-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>   ; GENERIC-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>>> i16* [[A_ADDR_0101]], i64 8
>>>   ; GENERIC-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> 
>>> [[TMP6]], i64 7
>>>   ; GENERIC-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>   ; GENERIC-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds 
>>> i16, i16* [[G]], i64 [[TMP29]]
>>> @@ -111,6 +111,7 @@ define i32 @gather_reduce_8x16_i32(i16* 
>>> nocapture readonly %a, i16* nocapture re
>>>   ; KRYO-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; KRYO-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; KRYO-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>>> [[A_ADDR_0101]], i64 8
>>>   ; KRYO-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 
>>> x i16>*
>>>   ; KRYO-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* 
>>> [[TMP0]], align 2
>>>   ; KRYO-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>> @@ -160,7 +161,6 @@ define i32 @gather_reduce_8x16_i32(i16* 
>>> nocapture readonly %a, i16* nocapture re
>>>   ; KRYO-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], 
>>> align 2
>>>   ; KRYO-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>   ; KRYO-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>>> [[A_ADDR_0101]], i64 8
>>>   ; KRYO-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], 
>>> i64 7
>>>   ; KRYO-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>   ; KRYO-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, 
>>> i16* [[G]], i64 [[TMP29]]
>>> @@ -297,6 +297,7 @@ define i32 @gather_reduce_8x16_i64(i16* 
>>> nocapture readonly %a, i16* nocapture re
>>>   ; GENERIC-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; GENERIC-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; GENERIC-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>>> i16* [[A_ADDR_0101]], i64 8
>>>   ; GENERIC-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to 
>>> <8 x i16>*
>>>   ; GENERIC-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* 
>>> [[TMP0]], align 2
>>>   ; GENERIC-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x 
>>> i32>
>>> @@ -346,7 +347,6 @@ define i32 @gather_reduce_8x16_i64(i16* 
>>> nocapture readonly %a, i16* nocapture re
>>>   ; GENERIC-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], 
>>> align 2
>>>   ; GENERIC-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>   ; GENERIC-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, 
>>> i16* [[A_ADDR_0101]], i64 8
>>>   ; GENERIC-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> 
>>> [[TMP6]], i64 7
>>>   ; GENERIC-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>   ; GENERIC-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds 
>>> i16, i16* [[G]], i64 [[TMP29]]
>>> @@ -372,6 +372,7 @@ define i32 @gather_reduce_8x16_i64(i16* 
>>> nocapture readonly %a, i16* nocapture re
>>>   ; KRYO-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; KRYO-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], 
>>> [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>   ; KRYO-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ 
>>> [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], 
>>> [[FOR_BODY_PREHEADER]] ]
>>> +; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>>> [[A_ADDR_0101]], i64 8
>>>   ; KRYO-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 
>>> x i16>*
>>>   ; KRYO-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* 
>>> [[TMP0]], align 2
>>>   ; KRYO-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>> @@ -421,7 +422,6 @@ define i32 @gather_reduce_8x16_i64(i16* 
>>> nocapture readonly %a, i16* nocapture re
>>>   ; KRYO-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], 
>>> align 2
>>>   ; KRYO-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>   ; KRYO-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>> -; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* 
>>> [[A_ADDR_0101]], i64 8
>>>   ; KRYO-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], 
>>> i64 7
>>>   ; KRYO-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>   ; KRYO-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, 
>>> i16* [[G]], i64 [[TMP29]]
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> index e9c502b6982cd..01d743fcbfe97 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>> @@ -35,41 +35,14 @@ define void @PR28330(i32 %n) {
>>>   ;
>>>   ; MAX-COST-LABEL: @PR28330(
>>>   ; MAX-COST-NEXT:  entry:
>>> -; MAX-COST-NEXT:    [[P0:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
>>> -; MAX-COST-NEXT:    [[P1:%.*]] = icmp eq i8 [[P0]], 0
>>> -; MAX-COST-NEXT:    [[P2:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 2), align 2
>>> -; MAX-COST-NEXT:    [[P3:%.*]] = icmp eq i8 [[P2]], 0
>>> -; MAX-COST-NEXT:    [[P4:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
>>> -; MAX-COST-NEXT:    [[P5:%.*]] = icmp eq i8 [[P4]], 0
>>> -; MAX-COST-NEXT:    [[P6:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
>>> -; MAX-COST-NEXT:    [[P7:%.*]] = icmp eq i8 [[P6]], 0
>>> -; MAX-COST-NEXT:    [[P8:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>>> -; MAX-COST-NEXT:    [[P9:%.*]] = icmp eq i8 [[P8]], 0
>>> -; MAX-COST-NEXT:    [[P10:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>>> -; MAX-COST-NEXT:    [[P11:%.*]] = icmp eq i8 [[P10]], 0
>>> -; MAX-COST-NEXT:    [[P12:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>>> -; MAX-COST-NEXT:    [[P13:%.*]] = icmp eq i8 [[P12]], 0
>>> -; MAX-COST-NEXT:    [[P14:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>>> -; MAX-COST-NEXT:    [[P15:%.*]] = icmp eq i8 [[P14]], 0
>>> +; MAX-COST-NEXT:    [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast 
>>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) 
>>> to <8 x i8>*), align 1
>>> +; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]], 
>>> zeroinitializer
>>>   ; MAX-COST-NEXT:    br label [[FOR_BODY:%.*]]
>>>   ; MAX-COST:       for.body:
>>> -; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[P34:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> -; MAX-COST-NEXT:    [[P19:%.*]] = select i1 [[P1]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P20:%.*]] = add i32 [[P17]], [[P19]]
>>> -; MAX-COST-NEXT:    [[P21:%.*]] = select i1 [[P3]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P22:%.*]] = add i32 [[P20]], [[P21]]
>>> -; MAX-COST-NEXT:    [[P23:%.*]] = select i1 [[P5]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P24:%.*]] = add i32 [[P22]], [[P23]]
>>> -; MAX-COST-NEXT:    [[P25:%.*]] = select i1 [[P7]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P26:%.*]] = add i32 [[P24]], [[P25]]
>>> -; MAX-COST-NEXT:    [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P28:%.*]] = add i32 [[P26]], [[P27]]
>>> -; MAX-COST-NEXT:    [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P30:%.*]] = add i32 [[P28]], [[P29]]
>>> -; MAX-COST-NEXT:    [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P32:%.*]] = add i32 [[P30]], [[P31]]
>>> -; MAX-COST-NEXT:    [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P34]] = add i32 [[P32]], [[P33]]
>>> +; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> +; MAX-COST-NEXT:    [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x 
>>> i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, 
>>> i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, 
>>> i32 -80, i32 -80, i32 -80, i32 -80>
>>> +; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 
>>> @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>>> +; MAX-COST-NEXT:    [[OP_EXTRA]] = add i32 [[TMP3]], [[P17]]
>>>   ; MAX-COST-NEXT:    br label [[FOR_BODY]]
>>>   ;
>>>   entry:
>>> @@ -139,30 +112,14 @@ define void @PR32038(i32 %n) {
>>>   ;
>>>   ; MAX-COST-LABEL: @PR32038(
>>>   ; MAX-COST-NEXT:  entry:
>>> -; MAX-COST-NEXT:    [[TMP0:%.*]] = load <4 x i8>, <4 x i8>* bitcast 
>>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) 
>>> to <4 x i8>*), align 1
>>> -; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <4 x i8> [[TMP0]], 
>>> zeroinitializer
>>> -; MAX-COST-NEXT:    [[P8:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>>> -; MAX-COST-NEXT:    [[P9:%.*]] = icmp eq i8 [[P8]], 0
>>> -; MAX-COST-NEXT:    [[P10:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>>> -; MAX-COST-NEXT:    [[P11:%.*]] = icmp eq i8 [[P10]], 0
>>> -; MAX-COST-NEXT:    [[P12:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>>> -; MAX-COST-NEXT:    [[P13:%.*]] = icmp eq i8 [[P12]], 0
>>> -; MAX-COST-NEXT:    [[P14:%.*]] = load i8, i8* getelementptr 
>>> inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>>> -; MAX-COST-NEXT:    [[P15:%.*]] = icmp eq i8 [[P14]], 0
>>> +; MAX-COST-NEXT:    [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast 
>>> (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) 
>>> to <8 x i8>*), align 1
>>> +; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]], 
>>> zeroinitializer
>>>   ; MAX-COST-NEXT:    br label [[FOR_BODY:%.*]]
>>>   ; MAX-COST:       for.body:
>>> -; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[P34:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> -; MAX-COST-NEXT:    [[TMP2:%.*]] = select <4 x i1> [[TMP1]], <4 x 
>>> i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, 
>>> i32 -80, i32 -80, i32 -80>
>>> -; MAX-COST-NEXT:    [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 
>>> @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
>>> -; MAX-COST-NEXT:    [[TMP4:%.*]] = add i32 [[TMP3]], [[P27]]
>>> -; MAX-COST-NEXT:    [[TMP5:%.*]] = add i32 [[TMP4]], [[P29]]
>>> -; MAX-COST-NEXT:    [[OP_EXTRA:%.*]] = add i32 [[TMP5]], -5
>>> -; MAX-COST-NEXT:    [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]
>>> -; MAX-COST-NEXT:    [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>>> -; MAX-COST-NEXT:    [[P34]] = add i32 [[P32]], [[P33]]
>>> +; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]], 
>>> [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>> +; MAX-COST-NEXT:    [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x 
>>> i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, 
>>> i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, 
>>> i32 -80, i32 -80, i32 -80, i32 -80>
>>> +; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 
>>> @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>>> +; MAX-COST-NEXT:    [[OP_EXTRA]] = add i32 [[TMP3]], -5
>>>   ; MAX-COST-NEXT:    br label [[FOR_BODY]]
>>>   ;
>>>   entry:
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> index 39f2f885bc26b..c1451090d23c0 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>> @@ -14,14 +14,14 @@ define void @patatino(i64 %n, i64 %i, %struct.S* 
>>> %p) !dbg !7 {
>>>   ; CHECK-NEXT:    call void @llvm.dbg.value(metadata %struct.S* 
>>> [[P:%.*]], metadata [[META20:![0-9]+]], metadata !DIExpression()), 
>>> !dbg [[DBG25:![0-9]+]]
>>>   ; CHECK-NEXT:    [[X1:%.*]] = getelementptr inbounds 
>>> [[STRUCT_S:%.*]], %struct.S* [[P]], i64 [[N]], i32 0, !dbg 
>>> [[DBG26:![0-9]+]]
>>>   ; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, 
>>> metadata [[META21:![0-9]+]], metadata !DIExpression()), !dbg 
>>> [[DBG27:![0-9]+]]
>>> -; CHECK-NEXT:    [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>>> %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG28:![0-9]+]]
>>> +; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, 
>>> metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg 
>>> [[DBG28:![0-9]+]]
>>> +; CHECK-NEXT:    [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>>> %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG29:![0-9]+]]
>>>   ; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i64* [[X1]] to <2 x i64>*, 
>>> !dbg [[DBG26]]
>>> -; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, <2 x i64>* 
>>> [[TMP0]], align 8, !dbg [[DBG26]], !tbaa [[TBAA29:![0-9]+]]
>>> -; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, 
>>> metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg 
>>> [[DBG33:![0-9]+]]
>>> +; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, <2 x i64>* 
>>> [[TMP0]], align 8, !dbg [[DBG26]], !tbaa [[TBAA30:![0-9]+]]
>>>   ; CHECK-NEXT:    [[X5:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>>> %struct.S* [[P]], i64 [[I]], i32 0, !dbg [[DBG34:![0-9]+]]
>>>   ; CHECK-NEXT:    [[Y7:%.*]] = getelementptr inbounds [[STRUCT_S]], 
>>> %struct.S* [[P]], i64 [[I]], i32 1, !dbg [[DBG35:![0-9]+]]
>>>   ; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i64* [[X5]] to <2 x i64>*, 
>>> !dbg [[DBG36:![0-9]+]]
>>> -; CHECK-NEXT:    store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]], 
>>> align 8, !dbg [[DBG36]], !tbaa [[TBAA29]]
>>> +; CHECK-NEXT:    store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]], 
>>> align 8, !dbg [[DBG36]], !tbaa [[TBAA30]]
>>>   ; CHECK-NEXT:    ret void, !dbg [[DBG37:![0-9]+]]
>>>   ;
>>>   entry:
>>>
>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> index 7f51dcae484ca..d15494e092c25 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>> @@ -9,11 +9,11 @@ define void @test() #0 {
>>>   ; CHECK:       loop:
>>>   ; CHECK-NEXT:    [[DUMMY_PHI:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], 
>>> [ [[OP_EXTRA1:%.*]], [[LOOP]] ]
>>>   ; CHECK-NEXT:    [[TMP0:%.*]] = phi i64 [ 2, [[ENTRY]] ], [ 
>>> [[TMP3:%.*]], [[LOOP]] ]
>>> -; CHECK-NEXT:    [[DUMMY_ADD:%.*]] = add i16 0, 0
>>>   ; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <4 x i64> poison, 
>>> i64 [[TMP0]], i32 0
>>>   ; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <4 x i64> 
>>> [[TMP1]], <4 x i64> poison, <4 x i32> zeroinitializer
>>>   ; CHECK-NEXT:    [[TMP2:%.*]] = add <4 x i64> [[SHUFFLE]], <i64 3, 
>>> i64 2, i64 1, i64 0>
>>>   ; CHECK-NEXT:    [[TMP3]] = extractelement <4 x i64> [[TMP2]], i32 3
>>> +; CHECK-NEXT:    [[DUMMY_ADD:%.*]] = add i16 0, 0
>>>   ; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <4 x i64> [[TMP2]], 
>>> i32 0
>>>   ; CHECK-NEXT:    [[DUMMY_SHL:%.*]] = shl i64 [[TMP4]], 32
>>>   ; CHECK-NEXT:    [[TMP5:%.*]] = add <4 x i64> <i64 1, i64 1, i64 
>>> 1, i64 1>, [[TMP2]]
>>>
>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> index 7ab610f994264..f878bda14ad84 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>> @@ -10,10 +10,10 @@ define void @mainTest(i32 %param, i32 * %vals, 
>>> i32 %len) {
>>>   ; CHECK-NEXT:    [[TMP1:%.*]] = phi <2 x i32> [ [[TMP7:%.*]], 
>>> [[BCI_15]] ], [ [[TMP0]], [[BCI_15_PREHEADER:%.*]] ]
>>>   ; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <2 x i32> 
>>> [[TMP1]], <2 x i32> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, 
>>> i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 
>>> 0, i32 0, i32 1>
>>>   ; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <16 x i32> 
>>> [[SHUFFLE]], i32 0
>>> -; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <16 x i32> 
>>> [[SHUFFLE]], i32 15
>>> -; CHECK-NEXT:    store atomic i32 [[TMP3]], i32* [[VALS:%.*]] 
>>> unordered, align 4
>>> -; CHECK-NEXT:    [[TMP4:%.*]] = add <16 x i32> [[SHUFFLE]], <i32 
>>> 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 
>>> 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>>> -; CHECK-NEXT:    [[TMP5:%.*]] = call i32 
>>> @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP4]])
>>> +; CHECK-NEXT:    [[TMP3:%.*]] = add <16 x i32> [[SHUFFLE]], <i32 
>>> 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 
>>> 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>>> +; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <16 x i32> 
>>> [[SHUFFLE]], i32 15
>>> +; CHECK-NEXT:    store atomic i32 [[TMP4]], i32* [[VALS:%.*]] 
>>> unordered, align 4
>>> +; CHECK-NEXT:    [[TMP5:%.*]] = call i32 
>>> @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP3]])
>>>   ; CHECK-NEXT:    [[OP_EXTRA:%.*]] = and i32 [[TMP5]], [[TMP2]]
>>>   ; CHECK-NEXT:    [[V44:%.*]] = add i32 [[TMP2]], 16
>>>   ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x i32> poison, 
>>> i32 [[V44]], i32 0
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> index de371d8895c7d..94739340c8b5a 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>> @@ -29,10 +29,10 @@ define void @exceed(double %0, double %1) {
>>>   ; CHECK-NEXT:    [[IXX22:%.*]] = fsub double undef, undef
>>>   ; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <2 x double> 
>>> [[TMP6]], i32 0
>>>   ; CHECK-NEXT:    [[IX2:%.*]] = fmul double [[TMP8]], [[TMP8]]
>>> -; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <2 x double> 
>>> [[TMP2]], double [[TMP1]], i32 1
>>> -; CHECK-NEXT:    [[TMP10:%.*]] = fadd fast <2 x double> [[TMP6]], 
>>> [[TMP9]]
>>> -; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], 
>>> [[TMP5]]
>>> -; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], 
>>> [[TMP11]]
>>> +; CHECK-NEXT:    [[TMP9:%.*]] = fadd fast <2 x double> [[TMP3]], 
>>> [[TMP5]]
>>> +; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <2 x double> 
>>> [[TMP2]], double [[TMP1]], i32 1
>>> +; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <2 x double> [[TMP6]], 
>>> [[TMP10]]
>>> +; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast <2 x double> [[TMP11]], 
>>> [[TMP9]]
>>>   ; CHECK-NEXT:    [[IXX101:%.*]] = fsub double undef, undef
>>>   ; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <2 x double> 
>>> poison, double [[TMP1]], i32 1
>>>   ; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x double> 
>>> [[TMP13]], double [[TMP7]], i32 0
>>>
>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> index 80cb197982d48..8dc4a8936b722 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>> @@ -58,10 +58,10 @@ define void @test(ptr %r, ptr %p, ptr %q) #0 {
>>>     define void @test2(i64* %a, i64* %b) {
>>>   ; CHECK-LABEL: @test2(
>>> -; CHECK-NEXT:    [[A2:%.*]] = getelementptr inbounds i64, ptr 
>>> [[A:%.*]], i64 2
>>> -; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr 
>>> [[A]], i32 0
>>> +; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr 
>>> [[A:%.*]], i32 0
>>>   ; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x ptr> [[TMP1]], 
>>> ptr [[B:%.*]], i32 1
>>>   ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr i64, <2 x ptr> 
>>> [[TMP2]], <2 x i64> <i64 1, i64 3>
>>> +; CHECK-NEXT:    [[A2:%.*]] = getelementptr inbounds i64, ptr 
>>> [[A]], i64 2
>>>   ; CHECK-NEXT:    [[TMP4:%.*]] = ptrtoint <2 x ptr> [[TMP3]] to <2 
>>> x i64>
>>>   ; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x ptr> [[TMP3]], 
>>> i32 0
>>>   ; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x i64>, ptr [[TMP5]], align 8
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> index f6dd7526e6e76..35a6c63d29b6c 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias 
>>> nocapture %0, float* noalias nocaptu
>>>   ; AVX2-NEXT:    ret void
>>>   ;
>>>   ; AVX512F-LABEL: @gather_load_div(
>>> -; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> 
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512F-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> 
>>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512F-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> -; AVX512F-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512F-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512F-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x 
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>>> [[TMP17]]
>>> +; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> 
>>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>>> 30, i64 27, i64 23>
>>> +; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512F-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> 
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512F-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512F-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> 
>>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512F-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> +; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512F-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>>> [[TMP17]]
>>>   ; AVX512F-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>>> <8 x float>*
>>>   ; AVX512F-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>   ; AVX512F-NEXT:    ret void
>>>   ;
>>>   ; AVX512VL-LABEL: @gather_load_div(
>>> -; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x 
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512VL-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x 
>>> float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512VL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> -; AVX512VL-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512VL-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512VL-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x 
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>>> [[TMP17]]
>>> +; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x 
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>>> 33, i64 30, i64 27, i64 23>
>>> +; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512VL-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x 
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512VL-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512VL-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x 
>>> float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512VL-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> +; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512VL-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>>> [[TMP17]]
>>>   ; AVX512VL-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>>> <8 x float>*
>>>   ; AVX512VL-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>   ; AVX512VL-NEXT:    ret void
>>>
>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> index fd1c612a0696e..47f4391fd3b21 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias 
>>> nocapture %0, float* noalias nocaptu
>>>   ; AVX2-NEXT:    ret void
>>>   ;
>>>   ; AVX512F-LABEL: @gather_load_div(
>>> -; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> 
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512F-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> 
>>> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512F-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> -; AVX512F-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512F-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512F-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512F-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512F-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x 
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>>> [[TMP17]]
>>> +; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> 
>>> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 
>>> 30, i64 27, i64 23>
>>> +; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512F-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> 
>>> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512F-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512F-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512F-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> 
>>> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512F-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> +; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512F-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512F-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>>> [[TMP17]]
>>>   ; AVX512F-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>>> <8 x float>*
>>>   ; AVX512F-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>   ; AVX512F-NEXT:    ret void
>>>   ;
>>>   ; AVX512VL-LABEL: @gather_load_div(
>>> -; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> -; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> -; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x 
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> -; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> 
>>> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> -; AVX512VL-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x 
>>> float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>> -; AVX512VL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> -; AVX512VL-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> -; AVX512VL-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> 
>>> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> 
>>> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> -; AVX512VL-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> 
>>> [[TMP13]], float* [[TMP8]], i64 7
>>> -; AVX512VL-NEXT:    [[TMP15:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> -; AVX512VL-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x 
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>>> 33, i64 30, i64 27, i64 23>
>>> -; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> -; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], 
>>> [[TMP17]]
>>> +; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> 
>>> poison, float* [[TMP1:%.*]], i64 0
>>> +; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>> +; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x 
>>> float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 
>>> 33, i64 30, i64 27, i64 23>
>>> +; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> 
>>> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>> +; AVX512VL-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x 
>>> float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>> +; AVX512VL-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> 
>>> poison, float* [[TMP1]], i64 0
>>> +; AVX512VL-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> 
>>> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>> +; AVX512VL-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x 
>>> float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>> +; AVX512VL-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, 
>>> float* [[TMP1]], i64 20
>>> +; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> 
>>> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 
>>> 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> 
>>> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, 
>>> i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> 
>>> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, 
>>> i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> 
>>> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, 
>>> i32 3, i32 4, i32 8, i32 9, i32 undef>
>>> +; AVX512VL-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> 
>>> [[TMP14]], float* [[TMP10]], i64 7
>>> +; AVX512VL-NEXT:    [[TMP16:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 
>>> x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 
>>> true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> 
>>> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x 
>>> i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, 
>>> i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>> +; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], 
>>> [[TMP17]]
>>>   ; AVX512VL-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to 
>>> <8 x float>*
>>>   ; AVX512VL-NEXT:    store <8 x float> [[TMP18]], <8 x float>* 
>>> [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>   ; AVX512VL-NEXT:    ret void
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> index a4a388e9d095c..6946ab292cdf5 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>> @@ -21,11 +21,11 @@ define void @foo(%class.e* %this, %struct.a* %p, 
>>> i32 %add7) {
>>>   ; CHECK-NEXT:    i32 2, label [[SW_BB]]
>>>   ; CHECK-NEXT:    ]
>>>   ; CHECK:       sw.bb:
>>> -; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>>> -; CHECK-NEXT:    [[TMP3:%.*]] = load <2 x i32>, <2 x i32>* 
>>> [[TMP2]], align 4
>>>   ; CHECK-NEXT:    [[SHRINK_SHUFFLE:%.*]] = shufflevector <4 x i32> 
>>> [[SHUFFLE]], <4 x i32> poison, <2 x i32> <i32 2, i32 0>
>>> -; CHECK-NEXT:    [[TMP4:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]], 
>>> <i32 -1, i32 -1>
>>> -; CHECK-NEXT:    [[TMP5:%.*]] = add <2 x i32> [[TMP3]], [[TMP4]]
>>> +; CHECK-NEXT:    [[TMP2:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]], 
>>> <i32 -1, i32 -1>
>>> +; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>>> +; CHECK-NEXT:    [[TMP4:%.*]] = load <2 x i32>, <2 x i32>* 
>>> [[TMP3]], align 4
>>> +; CHECK-NEXT:    [[TMP5:%.*]] = add <2 x i32> [[TMP4]], [[TMP2]]
>>>   ; CHECK-NEXT:    br label [[SW_EPILOG]]
>>>   ; CHECK:       sw.epilog:
>>>   ; CHECK-NEXT:    [[TMP6:%.*]] = phi <2 x i32> [ undef, 
>>> [[ENTRY:%.*]] ], [ [[TMP5]], [[SW_BB]] ]
>>>
>>> diff  --git 
>>> a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> index 87709a87b3692..109c27e4f4f4e 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>> @@ -16,8 +16,8 @@ define void @foo() {
>>>   ; CHECK-NEXT:    [[TMP3:%.*]] = load double, double* undef, align 8
>>>   ; CHECK-NEXT:    br i1 undef, label [[BB3]], label [[BB4:%.*]]
>>>   ; CHECK:       bb4:
>>> -; CHECK-NEXT:    [[CONV2:%.*]] = uitofp i16 undef to double
>>>   ; CHECK-NEXT:    [[TMP4:%.*]] = fpext <4 x float> [[TMP2]] to <4 x 
>>> double>
>>> +; CHECK-NEXT:    [[CONV2:%.*]] = uitofp i16 undef to double
>>>   ; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x double> <double 
>>> undef, double poison>, double [[TMP3]], i32 1
>>>   ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x double> <double 
>>> undef, double poison>, double [[CONV2]], i32 1
>>>   ; CHECK-NEXT:    [[TMP7:%.*]] = fsub <2 x double> [[TMP5]], [[TMP6]]
>>>
>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll 
>>> b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> index 33ba97921e878..da18a937a6477 100644
>>> --- a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> +++ b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>> @@ -133,27 +133,27 @@ define void @phi_float32(half %hval, float 
>>> %fval) {
>>>   ; MAX256-NEXT:    br label [[BB1:%.*]]
>>>   ; MAX256:       bb1:
>>>   ; MAX256-NEXT:    [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>>> -; MAX256-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX256-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX256-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>>   ; MAX256-NEXT:    [[TMP0:%.*]] = insertelement <8 x float> poison, 
>>> float [[I]], i32 0
>>>   ; MAX256-NEXT:    [[SHUFFLE11:%.*]] = shufflevector <8 x float> 
>>> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>>   ; MAX256-NEXT:    [[TMP1:%.*]] = insertelement <8 x float> poison, 
>>> float [[FVAL:%.*]], i32 0
>>>   ; MAX256-NEXT:    [[SHUFFLE12:%.*]] = shufflevector <8 x float> 
>>> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>>   ; MAX256-NEXT:    [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]], 
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT:    [[TMP3:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP2]]
>>> -; MAX256-NEXT:    [[TMP4:%.*]] = insertelement <8 x float> poison, 
>>> float [[I3]], i32 0
>>> -; MAX256-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>>> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX256-NEXT:    [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT:    [[TMP6:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP5]]
>>> -; MAX256-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>>> float [[I6]], i32 0
>>> -; MAX256-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX256-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP8]]
>>> -; MAX256-NEXT:    [[TMP10:%.*]] = insertelement <8 x float> poison, 
>>> float [[I9]], i32 0
>>> -; MAX256-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>>> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX256-NEXT:    [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>>> [[SHUFFLE12]]
>>> -; MAX256-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP11]]
>>> +; MAX256-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX256-NEXT:    [[TMP3:%.*]] = insertelement <8 x float> poison, 
>>> float [[I3]], i32 0
>>> +; MAX256-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>>> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX256-NEXT:    [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>>> [[SHUFFLE12]]
>>> +; MAX256-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX256-NEXT:    [[TMP5:%.*]] = insertelement <8 x float> poison, 
>>> float [[I6]], i32 0
>>> +; MAX256-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>>> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX256-NEXT:    [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>>> [[SHUFFLE12]]
>>> +; MAX256-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX256-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>>> float [[I9]], i32 0
>>> +; MAX256-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX256-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>>> [[SHUFFLE12]]
>>> +; MAX256-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP2]]
>>> +; MAX256-NEXT:    [[TMP10:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP4]]
>>> +; MAX256-NEXT:    [[TMP11:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP6]]
>>> +; MAX256-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP8]]
>>>   ; MAX256-NEXT:    switch i32 undef, label [[BB5:%.*]] [
>>>   ; MAX256-NEXT:    i32 0, label [[BB2:%.*]]
>>>   ; MAX256-NEXT:    i32 1, label [[BB3:%.*]]
>>> @@ -166,10 +166,10 @@ define void @phi_float32(half %hval, float 
>>> %fval) {
>>>   ; MAX256:       bb5:
>>>   ; MAX256-NEXT:    br label [[BB2]]
>>>   ; MAX256:       bb2:
>>> -; MAX256-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>> -; MAX256-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>>> [[TMP9]], [[BB1]] ]
>>> +; MAX256-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX256-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [ 
>>> [[TMP11]], [[BB1]] ]
>>>   ; MAX256-NEXT:    [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]], 
>>> [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>>> [[TMP12]], [[BB1]] ]
>>> -; MAX256-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]], 
>>> [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX256-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]], 
>>> [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>>   ; MAX256-NEXT:    [[TMP17:%.*]] = extractelement <8 x float> 
>>> [[TMP14]], i32 7
>>>   ; MAX256-NEXT:    store float [[TMP17]], float* undef, align 4
>>>   ; MAX256-NEXT:    ret void
>>> @@ -179,27 +179,27 @@ define void @phi_float32(half %hval, float 
>>> %fval) {
>>>   ; MAX1024-NEXT:    br label [[BB1:%.*]]
>>>   ; MAX1024:       bb1:
>>>   ; MAX1024-NEXT:    [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>>> -; MAX1024-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX1024-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>> -; MAX1024-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>>   ; MAX1024-NEXT:    [[TMP0:%.*]] = insertelement <8 x float> 
>>> poison, float [[I]], i32 0
>>>   ; MAX1024-NEXT:    [[SHUFFLE11:%.*]] = shufflevector <8 x float> 
>>> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>>   ; MAX1024-NEXT:    [[TMP1:%.*]] = insertelement <8 x float> 
>>> poison, float [[FVAL:%.*]], i32 0
>>>   ; MAX1024-NEXT:    [[SHUFFLE12:%.*]] = shufflevector <8 x float> 
>>> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>>   ; MAX1024-NEXT:    [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]], 
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT:    [[TMP3:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP2]]
>>> -; MAX1024-NEXT:    [[TMP4:%.*]] = insertelement <8 x float> poison, 
>>> float [[I3]], i32 0
>>> -; MAX1024-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>>> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX1024-NEXT:    [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT:    [[TMP6:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP5]]
>>> -; MAX1024-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>>> float [[I6]], i32 0
>>> -; MAX1024-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX1024-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP8]]
>>> -; MAX1024-NEXT:    [[TMP10:%.*]] = insertelement <8 x float> 
>>> poison, float [[I9]], i32 0
>>> -; MAX1024-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>>> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>>> -; MAX1024-NEXT:    [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>>> [[SHUFFLE12]]
>>> -; MAX1024-NEXT:    [[TMP12:%.*]] = fadd <8 x float> 
>>> zeroinitializer, [[TMP11]]
>>> +; MAX1024-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX1024-NEXT:    [[TMP3:%.*]] = insertelement <8 x float> poison, 
>>> float [[I3]], i32 0
>>> +; MAX1024-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> 
>>> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX1024-NEXT:    [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]], 
>>> [[SHUFFLE12]]
>>> +; MAX1024-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX1024-NEXT:    [[TMP5:%.*]] = insertelement <8 x float> poison, 
>>> float [[I6]], i32 0
>>> +; MAX1024-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> 
>>> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX1024-NEXT:    [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]], 
>>> [[SHUFFLE12]]
>>> +; MAX1024-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>> +; MAX1024-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, 
>>> float [[I9]], i32 0
>>> +; MAX1024-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> 
>>> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>> +; MAX1024-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]], 
>>> [[SHUFFLE12]]
>>> +; MAX1024-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, 
>>> [[TMP2]]
>>> +; MAX1024-NEXT:    [[TMP10:%.*]] = fadd <8 x float> 
>>> zeroinitializer, [[TMP4]]
>>> +; MAX1024-NEXT:    [[TMP11:%.*]] = fadd <8 x float> 
>>> zeroinitializer, [[TMP6]]
>>> +; MAX1024-NEXT:    [[TMP12:%.*]] = fadd <8 x float> 
>>> zeroinitializer, [[TMP8]]
>>>   ; MAX1024-NEXT:    switch i32 undef, label [[BB5:%.*]] [
>>>   ; MAX1024-NEXT:    i32 0, label [[BB2:%.*]]
>>>   ; MAX1024-NEXT:    i32 1, label [[BB3:%.*]]
>>> @@ -212,10 +212,10 @@ define void @phi_float32(half %hval, float 
>>> %fval) {
>>>   ; MAX1024:       bb5:
>>>   ; MAX1024-NEXT:    br label [[BB2]]
>>>   ; MAX1024:       bb2:
>>> -; MAX1024-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>> -; MAX1024-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>>> [[TMP9]], [[BB1]] ]
>>> +; MAX1024-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX1024-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]], 
>>> [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [ 
>>> [[TMP11]], [[BB1]] ]
>>>   ; MAX1024-NEXT:    [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]], 
>>> [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ 
>>> [[TMP12]], [[BB1]] ]
>>> -; MAX1024-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]], 
>>> [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>> +; MAX1024-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]], 
>>> [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ 
>>> [[SHUFFLE12]], [[BB1]] ]
>>>   ; MAX1024-NEXT:    [[TMP17:%.*]] = extractelement <8 x float> 
>>> [[TMP14]], i32 7
>>>   ; MAX1024-NEXT:    store float [[TMP17]], float* undef, align 4
>>>   ; MAX1024-NEXT:    ret void
>>>
>>>
>>>          _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits