[llvm] d65cc85 - [SLP]Do not schedule instructions with constants/argument/phi operands and external users.

Sat Mar 19 13:37:21 PDT 2022

You should probably use isSafeToSpeculativeExecute rather than 
explicitly filtering by instruction type.  I pushed 6253b77d which fixes 
the same issue in the generic scheduling code.

Philip

On 3/19/22 13:23, Alexey Bataev wrote:
> Hi, thanks for the tests, need to filter out divs/rems, will do this on Monday.
>
> Best regards,
> Alexey Bataev
>
>> 19 марта 2022 г., в 13:07, Philip Reames via llvm-commits <llvm-commits at lists.llvm.org> написал(а):
>>
>> Ok, this is definitely wrong.  But so is the existing code.  I plan on fixing the generic case shortly, but I'm going to leave your special case to you to fix or revert.  I don't understand the invariants of this patch enough to be comfortable making a fix.
>>
>> Here's a test case for the special case you added (also committed in bdbcca61):
>>
>> ; Variant of test10 block invariant operands to the udivs
>> ; FIXME: This is wrong, we're hoisting a faulting udiv above an infinite loop.
>> define void @test11(i64 %x, i64 %y, i64* %b, i64* %c) {
>> ; CHECK-LABEL: @test11(
>> ; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
>> ; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x i64> [[TMP1]], i64 [[Y:%.*]], i32 1
>> ; CHECK-NEXT:    [[TMP3:%.*]] = udiv <2 x i64> <i64 200, i64 200>, [[TMP2]]
>> ; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
>> ; CHECK-NEXT:    store i64 [[TMP4]], i64* [[B:%.*]], align 4
>> ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @may_inf_loop_ro()
>> ; CHECK-NEXT:    [[CA2:%.*]] = getelementptr i64, i64* [[C:%.*]], i32 1
>> ; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i64* [[C]] to <2 x i64>*
>> ; CHECK-NEXT:    [[TMP7:%.*]] = load <2 x i64>, <2 x i64>* [[TMP6]], align 4
>> ; CHECK-NEXT:    [[TMP8:%.*]] = add <2 x i64> [[TMP3]], [[TMP7]]
>> ; CHECK-NEXT:    [[B2:%.*]] = getelementptr i64, i64* [[B]], i32 1
>> ; CHECK-NEXT:    [[TMP9:%.*]] = bitcast i64* [[B]] to <2 x i64>*
>> ; CHECK-NEXT:    store <2 x i64> [[TMP8]], <2 x i64>* [[TMP9]], align 4
>> ; CHECK-NEXT:    ret void
>> ;
>>    %u1 = udiv i64 200, %x
>>    store i64 %u1, i64* %b
>>    call i64 @may_inf_loop_ro()
>>    %u2 = udiv i64 200, %y
>>
>>    %c1 = load i64, i64* %c
>>    %ca2 = getelementptr i64, i64* %c, i32 1
>>    %c2 = load i64, i64* %ca2
>>    %add1 = add i64 %u1, %c1
>>    %add2 = add i64 %u2, %c2
>>
>>    store i64 %add1, i64* %b
>>    %b2 = getelementptr i64, i64* %b, i32 1
>>    store i64 %add2, i64* %b2
>>    ret void
>> }
>>
>>> On 3/18/22 13:27, Philip Reames wrote:
>>> I added a comment to the existing code in 1093949cf which more fully explains the missing dependency and hidden assumption.
>>>
>>> I am not 100% sure your code has the same problem.  I'd suggest exploring combinations such as a potentially faulting udiv following a readnone infinite loop call with block-invariant operands.  I don't have a particular test case for you because massaging the code into actually reordering is quite involved. I tried, but did not manage to create one with a few minutes of trying.
>>>
>>> Philip
>>>
>>>> On 3/18/22 10:26, Philip Reames via llvm-commits wrote:
>>>> FYI, I'm pretty sure this patch is wrong. The case which I believe it gets wrong involves a bundle containing a readonly call which is not guaranteed to return. (i.e. may contain an infinite loop)  If I'm reading the code correctly, it may reorder such a call earlier in the basic block - including reordering of two such calls in the process.
>>>>
>>>> This is the same bug which existed in D118538 which is why I noticed it.
>>>>
>>>> If this case isn't possible for some reason, please add test coverage and clarify comments as to why.
>>>>
>>>> Philip
>>>>
>>>> On 3/17/22 11:04, Alexey Bataev via llvm-commits wrote:
>>>>> Author: Alexey Bataev
>>>>> Date: 2022-03-17T11:03:45-07:00
>>>>> New Revision: d65cc8597792ab04142cd2214c46c5c167191bcd
>>>>>
>>>>> URL: https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd
>>>>> DIFF: https://github.com/llvm/llvm-project/commit/d65cc8597792ab04142cd2214c46c5c167191bcd.diff
>>>>>
>>>>> LOG: [SLP]Do not schedule instructions with constants/argument/phi operands and external users.
>>>>>
>>>>> No need to schedule entry nodes where all instructions are not memory
>>>>> read/write instructions and their operands are either constants, or
>>>>> arguments, or phis, or instructions from others blocks, or their users
>>>>> are phis or from the other blocks.
>>>>> The resulting vector instructions can be placed at
>>>>> the beginning of the basic block without scheduling (if operands does
>>>>> not need to be scheduled) or at the end of the block (if users are
>>>>> outside of the block).
>>>>> It may save some compile time and scheduling resources.
>>>>>
>>>>> Differential Revision: https://reviews.llvm.org/D121121
>>>>>
>>>>> Added:
>>>>>
>>>>> Modified:
>>>>>       llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>>>> llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>>>>       llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>>>> llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>>>>       llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>>>>       llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>>>> llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>>>>       llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>>>> llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>>>>       llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>>>> llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>>>> llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>>>>       llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>>>>
>>>>> Removed:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>> diff  --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>>>> index 48382a12fcf3c..9ab31198adaab 100644
>>>>> --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>>>> +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>>>> @@ -776,6 +776,57 @@ static void reorderScalars(SmallVectorImpl<Value *> &Scalars,
>>>>>          Scalars[Mask[I]] = Prev[I];
>>>>>    }
>>>>>    +/// Checks if the provided value does not require scheduling. It does not
>>>>> +/// require scheduling if this is not an instruction or it is an instruction
>>>>> +/// that does not read/write memory and all operands are either not instructions
>>>>> +/// or phi nodes or instructions from
>>>>> diff erent blocks.
>>>>> +static bool areAllOperandsNonInsts(Value *V) {
>>>>> +  auto *I = dyn_cast<Instruction>(V);
>>>>> +  if (!I)
>>>>> +    return true;
>>>>> +  return !I->mayReadOrWriteMemory() && all_of(I->operands(), [I](Value *V) {
>>>>> +    auto *IO = dyn_cast<Instruction>(V);
>>>>> +    if (!IO)
>>>>> +      return true;
>>>>> +    return isa<PHINode>(IO) || IO->getParent() != I->getParent();
>>>>> +  });
>>>>> +}
>>>>> +
>>>>> +/// Checks if the provided value does not require scheduling. It does not
>>>>> +/// require scheduling if this is not an instruction or it is an instruction
>>>>> +/// that does not read/write memory and all users are phi nodes or instructions
>>>>> +/// from the
>>>>> diff erent blocks.
>>>>> +static bool isUsedOutsideBlock(Value *V) {
>>>>> +  auto *I = dyn_cast<Instruction>(V);
>>>>> +  if (!I)
>>>>> +    return true;
>>>>> +  // Limits the number of uses to save compile time.
>>>>> +  constexpr int UsesLimit = 8;
>>>>> +  return !I->mayReadOrWriteMemory() && !I->hasNUsesOrMore(UsesLimit) &&
>>>>> +         all_of(I->users(), [I](User *U) {
>>>>> +           auto *IU = dyn_cast<Instruction>(U);
>>>>> +           if (!IU)
>>>>> +             return true;
>>>>> +           return IU->getParent() != I->getParent() || isa<PHINode>(IU);
>>>>> +         });
>>>>> +}
>>>>> +
>>>>> +/// Checks if the specified value does not require scheduling. It does not
>>>>> +/// require scheduling if all operands and all users do not need to be scheduled
>>>>> +/// in the current basic block.
>>>>> +static bool doesNotNeedToBeScheduled(Value *V) {
>>>>> +  return areAllOperandsNonInsts(V) && isUsedOutsideBlock(V);
>>>>> +}
>>>>> +
>>>>> +/// Checks if the specified array of instructions does not require scheduling.
>>>>> +/// It is so if all either instructions have operands that do not require
>>>>> +/// scheduling or their users do not require scheduling since they are phis or
>>>>> +/// in other basic blocks.
>>>>> +static bool doesNotNeedToSchedule(ArrayRef<Value *> VL) {
>>>>> +  return !VL.empty() &&
>>>>> +         (all_of(VL, isUsedOutsideBlock) || all_of(VL, areAllOperandsNonInsts));
>>>>> +}
>>>>> +
>>>>>    namespace slpvectorizer {
>>>>>      /// Bottom Up SLP Vectorizer.
>>>>> @@ -2359,15 +2410,21 @@ class BoUpSLP {
>>>>>            ScalarToTreeEntry[V] = Last;
>>>>>          }
>>>>>          // Update the scheduler bundle to point to this TreeEntry.
>>>>> -      unsigned Lane = 0;
>>>>> -      for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;
>>>>> -           BundleMember = BundleMember->NextInBundle) {
>>>>> -        BundleMember->TE = Last;
>>>>> -        BundleMember->Lane = Lane;
>>>>> -        ++Lane;
>>>>> -      }
>>>>> -      assert((!Bundle.getValue() || Lane == VL.size()) &&
>>>>> +      ScheduleData *BundleMember = Bundle.getValue();
>>>>> +      assert((BundleMember || isa<PHINode>(S.MainOp) ||
>>>>> +              isVectorLikeInstWithConstOps(S.MainOp) ||
>>>>> +              doesNotNeedToSchedule(VL)) &&
>>>>>                 "Bundle and VL out of sync");
>>>>> +      if (BundleMember) {
>>>>> +        for (Value *V : VL) {
>>>>> +          if (doesNotNeedToBeScheduled(V))
>>>>> +            continue;
>>>>> +          assert(BundleMember && "Unexpected end of bundle.");
>>>>> +          BundleMember->TE = Last;
>>>>> +          BundleMember = BundleMember->NextInBundle;
>>>>> +        }
>>>>> +      }
>>>>> +      assert(!BundleMember && "Bundle and VL out of sync");
>>>>>        } else {
>>>>>          MustGather.insert(VL.begin(), VL.end());
>>>>>        }
>>>>> @@ -2504,7 +2561,6 @@ class BoUpSLP {
>>>>>          clearDependencies();
>>>>>          OpValue = OpVal;
>>>>>          TE = nullptr;
>>>>> -      Lane = -1;
>>>>>        }
>>>>>          /// Verify basic self consistency properties
>>>>> @@ -2544,7 +2600,7 @@ class BoUpSLP {
>>>>>        /// Returns true if it represents an instruction bundle and not only a
>>>>>        /// single instruction.
>>>>>        bool isPartOfBundle() const {
>>>>> -      return NextInBundle != nullptr || FirstInBundle != this;
>>>>> +      return NextInBundle != nullptr || FirstInBundle != this || TE;
>>>>>        }
>>>>>          /// Returns true if it is ready for scheduling, i.e. it has no more
>>>>> @@ -2649,9 +2705,6 @@ class BoUpSLP {
>>>>>        /// Note that this is negative as long as Dependencies is not calculated.
>>>>>        int UnscheduledDeps = InvalidDeps;
>>>>>    -    /// The lane of this node in the TreeEntry.
>>>>> -    int Lane = -1;
>>>>> -
>>>>>        /// True if this instruction is scheduled (or considered as scheduled in the
>>>>>        /// dry-run).
>>>>>        bool IsScheduled = false;
>>>>> @@ -2669,6 +2722,21 @@ class BoUpSLP {
>>>>>      friend struct DOTGraphTraits<BoUpSLP *>;
>>>>>        /// Contains all scheduling data for a basic block.
>>>>> +  /// It does not schedules instructions, which are not memory read/write
>>>>> +  /// instructions and their operands are either constants, or arguments, or
>>>>> +  /// phis, or instructions from others blocks, or their users are phis or from
>>>>> +  /// the other blocks. The resulting vector instructions can be placed at the
>>>>> +  /// beginning of the basic block without scheduling (if operands does not need
>>>>> +  /// to be scheduled) or at the end of the block (if users are outside of the
>>>>> +  /// block). It allows to save some compile time and memory used by the
>>>>> +  /// compiler.
>>>>> +  /// ScheduleData is assigned for each instruction in between the boundaries of
>>>>> +  /// the tree entry, even for those, which are not part of the graph. It is
>>>>> +  /// required to correctly follow the dependencies between the instructions and
>>>>> +  /// their correct scheduling. The ScheduleData is not allocated for the
>>>>> +  /// instructions, which do not require scheduling, like phis, nodes with
>>>>> +  /// extractelements/insertelements only or nodes with instructions, with
>>>>> +  /// uses/operands outside of the block.
>>>>>      struct BlockScheduling {
>>>>>        BlockScheduling(BasicBlock *BB)
>>>>>            : BB(BB), ChunkSize(BB->size()), ChunkPos(ChunkSize) {}
>>>>> @@ -2696,7 +2764,7 @@ class BoUpSLP {
>>>>>          if (BB != I->getParent())
>>>>>            // Avoid lookup if can't possibly be in map.
>>>>>            return nullptr;
>>>>> -      ScheduleData *SD = ScheduleDataMap[I];
>>>>> +      ScheduleData *SD = ScheduleDataMap.lookup(I);
>>>>>          if (SD && isInSchedulingRegion(SD))
>>>>>            return SD;
>>>>>          return nullptr;
>>>>> @@ -2713,7 +2781,7 @@ class BoUpSLP {
>>>>>            return getScheduleData(V);
>>>>>          auto I = ExtraScheduleDataMap.find(V);
>>>>>          if (I != ExtraScheduleDataMap.end()) {
>>>>> -        ScheduleData *SD = I->second[Key];
>>>>> +        ScheduleData *SD = I->second.lookup(Key);
>>>>>            if (SD && isInSchedulingRegion(SD))
>>>>>              return SD;
>>>>>          }
>>>>> @@ -2735,7 +2803,7 @@ class BoUpSLP {
>>>>>               BundleMember = BundleMember->NextInBundle) {
>>>>>            if (BundleMember->Inst != BundleMember->OpValue)
>>>>>              continue;
>>>>> -
>>>>> +
>>>>>            // Handle the def-use chain dependencies.
>>>>>              // Decrement the unscheduled counter and insert to ready list if ready.
>>>>> @@ -2760,7 +2828,9 @@ class BoUpSLP {
>>>>>            // reordered during buildTree(). We therefore need to get its operands
>>>>>            // through the TreeEntry.
>>>>>            if (TreeEntry *TE = BundleMember->TE) {
>>>>> -          int Lane = BundleMember->Lane;
>>>>> +          // Need to search for the lane since the tree entry can be reordered.
>>>>> +          int Lane = std::distance(TE->Scalars.begin(),
>>>>> +                                   find(TE->Scalars, BundleMember->Inst));
>>>>>              assert(Lane >= 0 && "Lane not set");
>>>>>                // Since vectorization tree is being built recursively this assertion
>>>>> @@ -2769,7 +2839,7 @@ class BoUpSLP {
>>>>>              // where their second (immediate) operand is not added. Since
>>>>>              // immediates do not affect scheduler behavior this is considered
>>>>>              // okay.
>>>>> -          auto *In = TE->getMainOp();
>>>>> +          auto *In = BundleMember->Inst;
>>>>>              assert(In &&
>>>>>                     (isa<ExtractValueInst>(In) || isa<ExtractElementInst>(In) ||
>>>>>                      In->getNumOperands() == TE->getNumOperands()) &&
>>>>> @@ -2814,7 +2884,8 @@ class BoUpSLP {
>>>>>            for (auto *I = ScheduleStart; I != ScheduleEnd; I = I->getNextNode()) {
>>>>>            auto *SD = getScheduleData(I);
>>>>> -        assert(SD && "primary scheduledata must exist in window");
>>>>> +        if (!SD)
>>>>> +          continue;
>>>>>            assert(isInSchedulingRegion(SD) &&
>>>>>                   "primary schedule data not in window?");
>>>>>            assert(isInSchedulingRegion(SD->FirstInBundle) &&
>>>>> @@ -3856,6 +3927,22 @@ static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0,
>>>>>      return LoadsState::Gather;
>>>>>    }
>>>>>    +/// \return true if the specified list of values has only one instruction that
>>>>> +/// requires scheduling, false otherwise.
>>>>> +static bool needToScheduleSingleInstruction(ArrayRef<Value *> VL) {
>>>>> +  Value *NeedsScheduling = nullptr;
>>>>> +  for (Value *V : VL) {
>>>>> +    if (doesNotNeedToBeScheduled(V))
>>>>> +      continue;
>>>>> +    if (!NeedsScheduling) {
>>>>> +      NeedsScheduling = V;
>>>>> +      continue;
>>>>> +    }
>>>>> +    return false;
>>>>> +  }
>>>>> +  return NeedsScheduling;
>>>>> +}
>>>>> +
>>>>>    void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
>>>>>                                const EdgeInfo &UserTreeIdx) {
>>>>>      assert((allConstant(VL) || allSameType(VL)) && "Invalid types!");
>>>>> @@ -6396,6 +6483,44 @@ void BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
>>>>>        return !E->isOpcodeOrAlt(I) || I->getParent() == BB;
>>>>>      }));
>>>>>    +  auto &&FindLastInst = [E, Front]() {
>>>>> +    Instruction *LastInst = Front;
>>>>> +    for (Value *V : E->Scalars) {
>>>>> +      auto *I = dyn_cast<Instruction>(V);
>>>>> +      if (!I)
>>>>> +        continue;
>>>>> +      if (LastInst->comesBefore(I))
>>>>> +        LastInst = I;
>>>>> +    }
>>>>> +    return LastInst;
>>>>> +  };
>>>>> +
>>>>> +  auto &&FindFirstInst = [E, Front]() {
>>>>> +    Instruction *FirstInst = Front;
>>>>> +    for (Value *V : E->Scalars) {
>>>>> +      auto *I = dyn_cast<Instruction>(V);
>>>>> +      if (!I)
>>>>> +        continue;
>>>>> +      if (I->comesBefore(FirstInst))
>>>>> +        FirstInst = I;
>>>>> +    }
>>>>> +    return FirstInst;
>>>>> +  };
>>>>> +
>>>>> +  // Set the insert point to the beginning of the basic block if the entry
>>>>> +  // should not be scheduled.
>>>>> +  if (E->State != TreeEntry::NeedToGather &&
>>>>> +      doesNotNeedToSchedule(E->Scalars)) {
>>>>> +    BasicBlock::iterator InsertPt;
>>>>> +    if (all_of(E->Scalars, isUsedOutsideBlock))
>>>>> +      InsertPt = FindLastInst()->getIterator();
>>>>> +    else
>>>>> +      InsertPt = FindFirstInst()->getIterator();
>>>>> +    Builder.SetInsertPoint(BB, InsertPt);
>>>>> +    Builder.SetCurrentDebugLocation(Front->getDebugLoc());
>>>>> +    return;
>>>>> +  }
>>>>> +
>>>>>      // The last instruction in the bundle in program order.
>>>>>      Instruction *LastInst = nullptr;
>>>>>    @@ -6404,8 +6529,10 @@ void BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
>>>>>      // VL.back() and iterate over schedule data until we reach the end of the
>>>>>      // bundle. The end of the bundle is marked by null ScheduleData.
>>>>>      if (BlocksSchedules.count(BB)) {
>>>>> -    auto *Bundle =
>>>>> - BlocksSchedules[BB]->getScheduleData(E->isOneOf(E->Scalars.back()));
>>>>> +    Value *V = E->isOneOf(E->Scalars.back());
>>>>> +    if (doesNotNeedToBeScheduled(V))
>>>>> +      V = *find_if_not(E->Scalars, doesNotNeedToBeScheduled);
>>>>> +    auto *Bundle = BlocksSchedules[BB]->getScheduleData(V);
>>>>>        if (Bundle && Bundle->isPartOfBundle())
>>>>>          for (; Bundle; Bundle = Bundle->NextInBundle)
>>>>>            if (Bundle->OpValue == Bundle->Inst)
>>>>> @@ -6430,15 +6557,8 @@ void BoUpSLP::setInsertPointAfterBundle(const TreeEntry *E) {
>>>>>      // not ideal. However, this should be exceedingly rare since it requires that
>>>>>      // we both exit early from buildTree_rec and that the bundle be out-of-order
>>>>>      // (causing us to iterate all the way to the end of the block).
>>>>> -  if (!LastInst) {
>>>>> -    SmallPtrSet<Value *, 16> Bundle(E->Scalars.begin(), E->Scalars.end());
>>>>> -    for (auto &I : make_range(BasicBlock::iterator(Front), BB->end())) {
>>>>> -      if (Bundle.erase(&I) && E->isOpcodeOrAlt(&I))
>>>>> -        LastInst = &I;
>>>>> -      if (Bundle.empty())
>>>>> -        break;
>>>>> -    }
>>>>> -  }
>>>>> +  if (!LastInst)
>>>>> +    LastInst = FindLastInst();
>>>>>      assert(LastInst && "Failed to find last instruction in bundle");
>>>>>        // Set the insertion point after the last instruction in the bundle. Set the
>>>>> @@ -7631,9 +7751,11 @@ void BoUpSLP::optimizeGatherSequence() {
>>>>>      BoUpSLP::ScheduleData *
>>>>>    BoUpSLP::BlockScheduling::buildBundle(ArrayRef<Value *> VL) {
>>>>> -  ScheduleData *Bundle = nullptr;
>>>>> +  ScheduleData *Bundle = nullptr;
>>>>>      ScheduleData *PrevInBundle = nullptr;
>>>>>      for (Value *V : VL) {
>>>>> +    if (doesNotNeedToBeScheduled(V))
>>>>> +      continue;
>>>>>        ScheduleData *BundleMember = getScheduleData(V);
>>>>>        assert(BundleMember &&
>>>>>               "no ScheduleData for bundle member "
>>>>> @@ -7661,7 +7783,8 @@ BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, BoUpSLP *SLP,
>>>>>                                                const InstructionsState &S) {
>>>>>      // No need to schedule PHIs, insertelement, extractelement and extractvalue
>>>>>      // instructions.
>>>>> -  if (isa<PHINode>(S.OpValue) || isVectorLikeInstWithConstOps(S.OpValue))
>>>>> +  if (isa<PHINode>(S.OpValue) || isVectorLikeInstWithConstOps(S.OpValue) ||
>>>>> +      doesNotNeedToSchedule(VL))
>>>>>        return nullptr;
>>>>>        // Initialize the instruction bundle.
>>>>> @@ -7707,6 +7830,8 @@ BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, BoUpSLP *SLP,
>>>>>      // Make sure that the scheduling region contains all
>>>>>      // instructions of the bundle.
>>>>>      for (Value *V : VL) {
>>>>> +    if (doesNotNeedToBeScheduled(V))
>>>>> +      continue;
>>>>>        if (!extendSchedulingRegion(V, S)) {
>>>>>          // If the scheduling region got new instructions at the lower end (or it
>>>>>          // is a new region for the first bundle). This makes it necessary to
>>>>> @@ -7721,6 +7846,8 @@ BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, BoUpSLP *SLP,
>>>>>        bool ReSchedule = false;
>>>>>      for (Value *V : VL) {
>>>>> +    if (doesNotNeedToBeScheduled(V))
>>>>> +      continue;
>>>>>        ScheduleData *BundleMember = getScheduleData(V);
>>>>>        assert(BundleMember &&
>>>>>               "no ScheduleData for bundle member (maybe not in same basic block)");
>>>>> @@ -7750,14 +7877,18 @@ BoUpSLP::BlockScheduling::tryScheduleBundle(ArrayRef<Value *> VL, BoUpSLP *SLP,
>>>>>      void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,
>>>>>                                                    Value *OpValue) {
>>>>> -  if (isa<PHINode>(OpValue) || isVectorLikeInstWithConstOps(OpValue))
>>>>> +  if (isa<PHINode>(OpValue) || isVectorLikeInstWithConstOps(OpValue) ||
>>>>> +      doesNotNeedToSchedule(VL))
>>>>>        return;
>>>>>    +  if (doesNotNeedToBeScheduled(OpValue))
>>>>> +    OpValue = *find_if_not(VL, doesNotNeedToBeScheduled);
>>>>>      ScheduleData *Bundle = getScheduleData(OpValue);
>>>>>      LLVM_DEBUG(dbgs() << "SLP:  cancel scheduling of " << *Bundle << "\n");
>>>>>      assert(!Bundle->IsScheduled &&
>>>>>             "Can't cancel bundle which is already scheduled");
>>>>> -  assert(Bundle->isSchedulingEntity() && Bundle->isPartOfBundle() &&
>>>>> +  assert(Bundle->isSchedulingEntity() &&
>>>>> +         (Bundle->isPartOfBundle() || needToScheduleSingleInstruction(VL)) &&
>>>>>             "tried to unbundle something which is not a bundle");
>>>>>        // Remove the bundle from the ready list.
>>>>> @@ -7771,6 +7902,7 @@ void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,
>>>>>        BundleMember->FirstInBundle = BundleMember;
>>>>>        ScheduleData *Next = BundleMember->NextInBundle;
>>>>>        BundleMember->NextInBundle = nullptr;
>>>>> +    BundleMember->TE = nullptr;
>>>>>        if (BundleMember->unscheduledDepsInBundle() == 0) {
>>>>>          ReadyInsts.insert(BundleMember);
>>>>>        }
>>>>> @@ -7794,6 +7926,7 @@ bool BoUpSLP::BlockScheduling::extendSchedulingRegion(Value *V,
>>>>>      Instruction *I = dyn_cast<Instruction>(V);
>>>>>      assert(I && "bundle member must be an instruction");
>>>>>      assert(!isa<PHINode>(I) && !isVectorLikeInstWithConstOps(I) &&
>>>>> +         !doesNotNeedToBeScheduled(I) &&
>>>>>             "phi nodes/insertelements/extractelements/extractvalues don't need to "
>>>>>             "be scheduled");
>>>>>      auto &&CheckScheduleForI = [this, &S](Instruction *I) -> bool {
>>>>> @@ -7870,7 +8003,10 @@ void BoUpSLP::BlockScheduling::initScheduleData(Instruction *FromI,
>>>>>                                                    ScheduleData *NextLoadStore) {
>>>>>      ScheduleData *CurrentLoadStore = PrevLoadStore;
>>>>>      for (Instruction *I = FromI; I != ToI; I = I->getNextNode()) {
>>>>> -    ScheduleData *SD = ScheduleDataMap[I];
>>>>> +    // No need to allocate data for non-schedulable instructions.
>>>>> +    if (doesNotNeedToBeScheduled(I))
>>>>> +      continue;
>>>>> +    ScheduleData *SD = ScheduleDataMap.lookup(I);
>>>>>        if (!SD) {
>>>>>          SD = allocateScheduleDataChunks();
>>>>>          ScheduleDataMap[I] = SD;
>>>>> @@ -8054,8 +8190,10 @@ void BoUpSLP::scheduleBlock(BlockScheduling *BS) {
>>>>>      for (auto *I = BS->ScheduleStart; I != BS->ScheduleEnd;
>>>>>           I = I->getNextNode()) {
>>>>>        BS->doForAllOpcodes(I, [this, &Idx, &NumToSchedule, BS](ScheduleData *SD) {
>>>>> +      TreeEntry *SDTE = getTreeEntry(SD->Inst);
>>>>>          assert((isVectorLikeInstWithConstOps(SD->Inst) ||
>>>>> -              SD->isPartOfBundle() == (getTreeEntry(SD->Inst) != nullptr)) &&
>>>>> +              SD->isPartOfBundle() ==
>>>>> +                  (SDTE && !doesNotNeedToSchedule(SDTE->Scalars))) &&
>>>>>                 "scheduler and vectorizer bundle mismatch");
>>>>>          SD->FirstInBundle->SchedulingPriority = Idx++;
>>>>>          if (SD->isSchedulingEntity()) {
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>>>> index 536f72a73684e..ec7b03af83f8b 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll
>>>>> @@ -36,6 +36,7 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; GENERIC-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; GENERIC-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; GENERIC-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], [[FOR_BODY_PREHEADER]] ]
>>>>> +; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; GENERIC-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 x i16>*
>>>>>    ; GENERIC-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* [[TMP0]], align 2
>>>>>    ; GENERIC-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>>>> @@ -85,7 +86,6 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; GENERIC-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], align 2
>>>>>    ; GENERIC-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>>>    ; GENERIC-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>>>> -; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; GENERIC-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], i64 7
>>>>>    ; GENERIC-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>>>    ; GENERIC-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, i16* [[G]], i64 [[TMP29]]
>>>>> @@ -111,6 +111,7 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; KRYO-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; KRYO-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; KRYO-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], [[FOR_BODY_PREHEADER]] ]
>>>>> +; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; KRYO-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 x i16>*
>>>>>    ; KRYO-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* [[TMP0]], align 2
>>>>>    ; KRYO-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>>>> @@ -160,7 +161,6 @@ define i32 @gather_reduce_8x16_i32(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; KRYO-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], align 2
>>>>>    ; KRYO-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>>>    ; KRYO-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>>>> -; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; KRYO-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], i64 7
>>>>>    ; KRYO-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>>>    ; KRYO-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, i16* [[G]], i64 [[TMP29]]
>>>>> @@ -297,6 +297,7 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; GENERIC-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; GENERIC-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; GENERIC-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], [[FOR_BODY_PREHEADER]] ]
>>>>> +; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; GENERIC-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 x i16>*
>>>>>    ; GENERIC-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* [[TMP0]], align 2
>>>>>    ; GENERIC-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>>>> @@ -346,7 +347,6 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; GENERIC-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], align 2
>>>>>    ; GENERIC-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>>>    ; GENERIC-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>>>> -; GENERIC-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; GENERIC-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], i64 7
>>>>>    ; GENERIC-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>>>    ; GENERIC-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, i16* [[G]], i64 [[TMP29]]
>>>>> @@ -372,6 +372,7 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; KRYO-NEXT:    [[I_0103:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; KRYO-NEXT:    [[SUM_0102:%.*]] = phi i32 [ [[ADD66]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
>>>>>    ; KRYO-NEXT:    [[A_ADDR_0101:%.*]] = phi i16* [ [[INCDEC_PTR58:%.*]], [[FOR_BODY]] ], [ [[A:%.*]], [[FOR_BODY_PREHEADER]] ]
>>>>> +; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; KRYO-NEXT:    [[TMP0:%.*]] = bitcast i16* [[A_ADDR_0101]] to <8 x i16>*
>>>>>    ; KRYO-NEXT:    [[TMP1:%.*]] = load <8 x i16>, <8 x i16>* [[TMP0]], align 2
>>>>>    ; KRYO-NEXT:    [[TMP2:%.*]] = zext <8 x i16> [[TMP1]] to <8 x i32>
>>>>> @@ -421,7 +422,6 @@ define i32 @gather_reduce_8x16_i64(i16* nocapture readonly %a, i16* nocapture re
>>>>>    ; KRYO-NEXT:    [[TMP27:%.*]] = load i16, i16* [[ARRAYIDX55]], align 2
>>>>>    ; KRYO-NEXT:    [[CONV56:%.*]] = zext i16 [[TMP27]] to i32
>>>>>    ; KRYO-NEXT:    [[ADD57:%.*]] = add nsw i32 [[ADD48]], [[CONV56]]
>>>>> -; KRYO-NEXT:    [[INCDEC_PTR58]] = getelementptr inbounds i16, i16* [[A_ADDR_0101]], i64 8
>>>>>    ; KRYO-NEXT:    [[TMP28:%.*]] = extractelement <8 x i32> [[TMP6]], i64 7
>>>>>    ; KRYO-NEXT:    [[TMP29:%.*]] = sext i32 [[TMP28]] to i64
>>>>>    ; KRYO-NEXT:    [[ARRAYIDX64:%.*]] = getelementptr inbounds i16, i16* [[G]], i64 [[TMP29]]
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>>>> index e9c502b6982cd..01d743fcbfe97 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
>>>>> @@ -35,41 +35,14 @@ define void @PR28330(i32 %n) {
>>>>>    ;
>>>>>    ; MAX-COST-LABEL: @PR28330(
>>>>>    ; MAX-COST-NEXT:  entry:
>>>>> -; MAX-COST-NEXT:    [[P0:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
>>>>> -; MAX-COST-NEXT:    [[P1:%.*]] = icmp eq i8 [[P0]], 0
>>>>> -; MAX-COST-NEXT:    [[P2:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 2), align 2
>>>>> -; MAX-COST-NEXT:    [[P3:%.*]] = icmp eq i8 [[P2]], 0
>>>>> -; MAX-COST-NEXT:    [[P4:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
>>>>> -; MAX-COST-NEXT:    [[P5:%.*]] = icmp eq i8 [[P4]], 0
>>>>> -; MAX-COST-NEXT:    [[P6:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
>>>>> -; MAX-COST-NEXT:    [[P7:%.*]] = icmp eq i8 [[P6]], 0
>>>>> -; MAX-COST-NEXT:    [[P8:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>>>>> -; MAX-COST-NEXT:    [[P9:%.*]] = icmp eq i8 [[P8]], 0
>>>>> -; MAX-COST-NEXT:    [[P10:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>>>>> -; MAX-COST-NEXT:    [[P11:%.*]] = icmp eq i8 [[P10]], 0
>>>>> -; MAX-COST-NEXT:    [[P12:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>>>>> -; MAX-COST-NEXT:    [[P13:%.*]] = icmp eq i8 [[P12]], 0
>>>>> -; MAX-COST-NEXT:    [[P14:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>>>>> -; MAX-COST-NEXT:    [[P15:%.*]] = icmp eq i8 [[P14]], 0
>>>>> +; MAX-COST-NEXT:    [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <8 x i8>*), align 1
>>>>> +; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]], zeroinitializer
>>>>>    ; MAX-COST-NEXT:    br label [[FOR_BODY:%.*]]
>>>>>    ; MAX-COST:       for.body:
>>>>> -; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[P34:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>>>> -; MAX-COST-NEXT:    [[P19:%.*]] = select i1 [[P1]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P20:%.*]] = add i32 [[P17]], [[P19]]
>>>>> -; MAX-COST-NEXT:    [[P21:%.*]] = select i1 [[P3]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P22:%.*]] = add i32 [[P20]], [[P21]]
>>>>> -; MAX-COST-NEXT:    [[P23:%.*]] = select i1 [[P5]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P24:%.*]] = add i32 [[P22]], [[P23]]
>>>>> -; MAX-COST-NEXT:    [[P25:%.*]] = select i1 [[P7]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P26:%.*]] = add i32 [[P24]], [[P25]]
>>>>> -; MAX-COST-NEXT:    [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P28:%.*]] = add i32 [[P26]], [[P27]]
>>>>> -; MAX-COST-NEXT:    [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P30:%.*]] = add i32 [[P28]], [[P29]]
>>>>> -; MAX-COST-NEXT:    [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P32:%.*]] = add i32 [[P30]], [[P31]]
>>>>> -; MAX-COST-NEXT:    [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P34]] = add i32 [[P32]], [[P33]]
>>>>> +; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>>>> +; MAX-COST-NEXT:    [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80>
>>>>> +; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>>>>> +; MAX-COST-NEXT:    [[OP_EXTRA]] = add i32 [[TMP3]], [[P17]]
>>>>>    ; MAX-COST-NEXT:    br label [[FOR_BODY]]
>>>>>    ;
>>>>>    entry:
>>>>> @@ -139,30 +112,14 @@ define void @PR32038(i32 %n) {
>>>>>    ;
>>>>>    ; MAX-COST-LABEL: @PR32038(
>>>>>    ; MAX-COST-NEXT:  entry:
>>>>> -; MAX-COST-NEXT:    [[TMP0:%.*]] = load <4 x i8>, <4 x i8>* bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <4 x i8>*), align 1
>>>>> -; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <4 x i8> [[TMP0]], zeroinitializer
>>>>> -; MAX-COST-NEXT:    [[P8:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
>>>>> -; MAX-COST-NEXT:    [[P9:%.*]] = icmp eq i8 [[P8]], 0
>>>>> -; MAX-COST-NEXT:    [[P10:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
>>>>> -; MAX-COST-NEXT:    [[P11:%.*]] = icmp eq i8 [[P10]], 0
>>>>> -; MAX-COST-NEXT:    [[P12:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
>>>>> -; MAX-COST-NEXT:    [[P13:%.*]] = icmp eq i8 [[P12]], 0
>>>>> -; MAX-COST-NEXT:    [[P14:%.*]] = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
>>>>> -; MAX-COST-NEXT:    [[P15:%.*]] = icmp eq i8 [[P14]], 0
>>>>> +; MAX-COST-NEXT:    [[TMP0:%.*]] = load <8 x i8>, <8 x i8>* bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <8 x i8>*), align 1
>>>>> +; MAX-COST-NEXT:    [[TMP1:%.*]] = icmp eq <8 x i8> [[TMP0]], zeroinitializer
>>>>>    ; MAX-COST-NEXT:    br label [[FOR_BODY:%.*]]
>>>>>    ; MAX-COST:       for.body:
>>>>> -; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[P34:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>>>> -; MAX-COST-NEXT:    [[TMP2:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, i32 -80, i32 -80, i32 -80>
>>>>> -; MAX-COST-NEXT:    [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
>>>>> -; MAX-COST-NEXT:    [[TMP4:%.*]] = add i32 [[TMP3]], [[P27]]
>>>>> -; MAX-COST-NEXT:    [[TMP5:%.*]] = add i32 [[TMP4]], [[P29]]
>>>>> -; MAX-COST-NEXT:    [[OP_EXTRA:%.*]] = add i32 [[TMP5]], -5
>>>>> -; MAX-COST-NEXT:    [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]
>>>>> -; MAX-COST-NEXT:    [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
>>>>> -; MAX-COST-NEXT:    [[P34]] = add i32 [[P32]], [[P33]]
>>>>> +; MAX-COST-NEXT:    [[P17:%.*]] = phi i32 [ [[OP_EXTRA:%.*]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
>>>>> +; MAX-COST-NEXT:    [[TMP2:%.*]] = select <8 x i1> [[TMP1]], <8 x i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80>
>>>>> +; MAX-COST-NEXT:    [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP2]])
>>>>> +; MAX-COST-NEXT:    [[OP_EXTRA]] = add i32 [[TMP3]], -5
>>>>>    ; MAX-COST-NEXT:    br label [[FOR_BODY]]
>>>>>    ;
>>>>>    entry:
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>>>> index 39f2f885bc26b..c1451090d23c0 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/spillcost-di.ll
>>>>> @@ -14,14 +14,14 @@ define void @patatino(i64 %n, i64 %i, %struct.S* %p) !dbg !7 {
>>>>>    ; CHECK-NEXT:    call void @llvm.dbg.value(metadata %struct.S* [[P:%.*]], metadata [[META20:![0-9]+]], metadata !DIExpression()), !dbg [[DBG25:![0-9]+]]
>>>>>    ; CHECK-NEXT:    [[X1:%.*]] = getelementptr inbounds [[STRUCT_S:%.*]], %struct.S* [[P]], i64 [[N]], i32 0, !dbg [[DBG26:![0-9]+]]
>>>>>    ; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, metadata [[META21:![0-9]+]], metadata !DIExpression()), !dbg [[DBG27:![0-9]+]]
>>>>> -; CHECK-NEXT:    [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]], %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG28:![0-9]+]]
>>>>> +; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg [[DBG28:![0-9]+]]
>>>>> +; CHECK-NEXT:    [[Y3:%.*]] = getelementptr inbounds [[STRUCT_S]], %struct.S* [[P]], i64 [[N]], i32 1, !dbg [[DBG29:![0-9]+]]
>>>>>    ; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i64* [[X1]] to <2 x i64>*, !dbg [[DBG26]]
>>>>> -; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, <2 x i64>* [[TMP0]], align 8, !dbg [[DBG26]], !tbaa [[TBAA29:![0-9]+]]
>>>>> -; CHECK-NEXT:    call void @llvm.dbg.value(metadata i64 undef, metadata [[META22:![0-9]+]], metadata !DIExpression()), !dbg [[DBG33:![0-9]+]]
>>>>> +; CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i64>, <2 x i64>* [[TMP0]], align 8, !dbg [[DBG26]], !tbaa [[TBAA30:![0-9]+]]
>>>>>    ; CHECK-NEXT:    [[X5:%.*]] = getelementptr inbounds [[STRUCT_S]], %struct.S* [[P]], i64 [[I]], i32 0, !dbg [[DBG34:![0-9]+]]
>>>>>    ; CHECK-NEXT:    [[Y7:%.*]] = getelementptr inbounds [[STRUCT_S]], %struct.S* [[P]], i64 [[I]], i32 1, !dbg [[DBG35:![0-9]+]]
>>>>>    ; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i64* [[X5]] to <2 x i64>*, !dbg [[DBG36:![0-9]+]]
>>>>> -; CHECK-NEXT:    store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]], align 8, !dbg [[DBG36]], !tbaa [[TBAA29]]
>>>>> +; CHECK-NEXT:    store <2 x i64> [[TMP1]], <2 x i64>* [[TMP2]], align 8, !dbg [[DBG36]], !tbaa [[TBAA30]]
>>>>>    ; CHECK-NEXT:    ret void, !dbg [[DBG37:![0-9]+]]
>>>>>    ;
>>>>>    entry:
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>>>> index 7f51dcae484ca..d15494e092c25 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR35628_2.ll
>>>>> @@ -9,11 +9,11 @@ define void @test() #0 {
>>>>>    ; CHECK:       loop:
>>>>>    ; CHECK-NEXT:    [[DUMMY_PHI:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], [ [[OP_EXTRA1:%.*]], [[LOOP]] ]
>>>>>    ; CHECK-NEXT:    [[TMP0:%.*]] = phi i64 [ 2, [[ENTRY]] ], [ [[TMP3:%.*]], [[LOOP]] ]
>>>>> -; CHECK-NEXT:    [[DUMMY_ADD:%.*]] = add i16 0, 0
>>>>>    ; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <4 x i64> poison, i64 [[TMP0]], i32 0
>>>>>    ; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <4 x i64> [[TMP1]], <4 x i64> poison, <4 x i32> zeroinitializer
>>>>>    ; CHECK-NEXT:    [[TMP2:%.*]] = add <4 x i64> [[SHUFFLE]], <i64 3, i64 2, i64 1, i64 0>
>>>>>    ; CHECK-NEXT:    [[TMP3]] = extractelement <4 x i64> [[TMP2]], i32 3
>>>>> +; CHECK-NEXT:    [[DUMMY_ADD:%.*]] = add i16 0, 0
>>>>>    ; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <4 x i64> [[TMP2]], i32 0
>>>>>    ; CHECK-NEXT:    [[DUMMY_SHL:%.*]] = shl i64 [[TMP4]], 32
>>>>>    ; CHECK-NEXT:    [[TMP5:%.*]] = add <4 x i64> <i64 1, i64 1, i64 1, i64 1>, [[TMP2]]
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>>>> index 7ab610f994264..f878bda14ad84 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR40310.ll
>>>>> @@ -10,10 +10,10 @@ define void @mainTest(i32 %param, i32 * %vals, i32 %len) {
>>>>>    ; CHECK-NEXT:    [[TMP1:%.*]] = phi <2 x i32> [ [[TMP7:%.*]], [[BCI_15]] ], [ [[TMP0]], [[BCI_15_PREHEADER:%.*]] ]
>>>>>    ; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 1>
>>>>>    ; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <16 x i32> [[SHUFFLE]], i32 0
>>>>> -; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <16 x i32> [[SHUFFLE]], i32 15
>>>>> -; CHECK-NEXT:    store atomic i32 [[TMP3]], i32* [[VALS:%.*]] unordered, align 4
>>>>> -; CHECK-NEXT:    [[TMP4:%.*]] = add <16 x i32> [[SHUFFLE]], <i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>>>>> -; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP4]])
>>>>> +; CHECK-NEXT:    [[TMP3:%.*]] = add <16 x i32> [[SHUFFLE]], <i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 -1>
>>>>> +; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <16 x i32> [[SHUFFLE]], i32 15
>>>>> +; CHECK-NEXT:    store atomic i32 [[TMP4]], i32* [[VALS:%.*]] unordered, align 4
>>>>> +; CHECK-NEXT:    [[TMP5:%.*]] = call i32 @llvm.vector.reduce.and.v16i32(<16 x i32> [[TMP3]])
>>>>>    ; CHECK-NEXT:    [[OP_EXTRA:%.*]] = and i32 [[TMP5]], [[TMP2]]
>>>>>    ; CHECK-NEXT:    [[V44:%.*]] = add i32 [[TMP2]], 16
>>>>>    ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x i32> poison, i32 [[V44]], i32 0
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>>>> index de371d8895c7d..94739340c8b5a 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll
>>>>> @@ -29,10 +29,10 @@ define void @exceed(double %0, double %1) {
>>>>>    ; CHECK-NEXT:    [[IXX22:%.*]] = fsub double undef, undef
>>>>>    ; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <2 x double> [[TMP6]], i32 0
>>>>>    ; CHECK-NEXT:    [[IX2:%.*]] = fmul double [[TMP8]], [[TMP8]]
>>>>> -; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP1]], i32 1
>>>>> -; CHECK-NEXT:    [[TMP10:%.*]] = fadd fast <2 x double> [[TMP6]], [[TMP9]]
>>>>> -; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]]
>>>>> -; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], [[TMP11]]
>>>>> +; CHECK-NEXT:    [[TMP9:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]]
>>>>> +; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP1]], i32 1
>>>>> +; CHECK-NEXT:    [[TMP11:%.*]] = fadd fast <2 x double> [[TMP6]], [[TMP10]]
>>>>> +; CHECK-NEXT:    [[TMP12:%.*]] = fmul fast <2 x double> [[TMP11]], [[TMP9]]
>>>>>    ; CHECK-NEXT:    [[IXX101:%.*]] = fsub double undef, undef
>>>>>    ; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <2 x double> poison, double [[TMP1]], i32 1
>>>>>    ; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x double> [[TMP13]], double [[TMP7]], i32 0
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>>>> index 80cb197982d48..8dc4a8936b722 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/opaque-ptr.ll
>>>>> @@ -58,10 +58,10 @@ define void @test(ptr %r, ptr %p, ptr %q) #0 {
>>>>>      define void @test2(i64* %a, i64* %b) {
>>>>>    ; CHECK-LABEL: @test2(
>>>>> -; CHECK-NEXT:    [[A2:%.*]] = getelementptr inbounds i64, ptr [[A:%.*]], i64 2
>>>>> -; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr [[A]], i32 0
>>>>> +; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x ptr> poison, ptr [[A:%.*]], i32 0
>>>>>    ; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x ptr> [[TMP1]], ptr [[B:%.*]], i32 1
>>>>>    ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr i64, <2 x ptr> [[TMP2]], <2 x i64> <i64 1, i64 3>
>>>>> +; CHECK-NEXT:    [[A2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 2
>>>>>    ; CHECK-NEXT:    [[TMP4:%.*]] = ptrtoint <2 x ptr> [[TMP3]] to <2 x i64>
>>>>>    ; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <2 x ptr> [[TMP3]], i32 0
>>>>>    ; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x i64>, ptr [[TMP5]], align 8
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>>>> index f6dd7526e6e76..35a6c63d29b6c 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
>>>>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias nocapture %0, float* noalias nocaptu
>>>>>    ; AVX2-NEXT:    ret void
>>>>>    ;
>>>>>    ; AVX512F-LABEL: @gather_load_div(
>>>>> -; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> -; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> -; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> -; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512F-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> -; AVX512F-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>>>> -; AVX512F-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> -; AVX512F-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512F-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> [[TMP13]], float* [[TMP8]], i64 7
>>>>> -; AVX512F-NEXT:    [[TMP15:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> -; AVX512F-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> -; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], [[TMP17]]
>>>>> +; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> +; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> +; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> +; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> +; AVX512F-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> +; AVX512F-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512F-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> +; AVX512F-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>>>> +; AVX512F-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> +; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> [[TMP14]], float* [[TMP10]], i64 7
>>>>> +; AVX512F-NEXT:    [[TMP16:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], [[TMP17]]
>>>>>    ; AVX512F-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to <8 x float>*
>>>>>    ; AVX512F-NEXT:    store <8 x float> [[TMP18]], <8 x float>* [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>>>    ; AVX512F-NEXT:    ret void
>>>>>    ;
>>>>>    ; AVX512VL-LABEL: @gather_load_div(
>>>>> -; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> -; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> -; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> -; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512VL-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> -; AVX512VL-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>>>> -; AVX512VL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> -; AVX512VL-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512VL-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> [[TMP13]], float* [[TMP8]], i64 7
>>>>> -; AVX512VL-NEXT:    [[TMP15:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> -; AVX512VL-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> -; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], [[TMP17]]
>>>>> +; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> +; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> +; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> +; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> +; AVX512VL-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> +; AVX512VL-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512VL-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> +; AVX512VL-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>>>> +; AVX512VL-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> +; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> [[TMP14]], float* [[TMP10]], i64 7
>>>>> +; AVX512VL-NEXT:    [[TMP16:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], [[TMP17]]
>>>>>    ; AVX512VL-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to <8 x float>*
>>>>>    ; AVX512VL-NEXT:    store <8 x float> [[TMP18]], <8 x float>* [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>>>    ; AVX512VL-NEXT:    ret void
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>>>> index fd1c612a0696e..47f4391fd3b21 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
>>>>> @@ -749,47 +749,47 @@ define void @gather_load_div(float* noalias nocapture %0, float* noalias nocaptu
>>>>>    ; AVX2-NEXT:    ret void
>>>>>    ;
>>>>>    ; AVX512F-LABEL: @gather_load_div(
>>>>> -; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> -; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> -; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> -; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512F-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> -; AVX512F-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>>>> -; AVX512F-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> -; AVX512F-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512F-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> -; AVX512F-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> [[TMP13]], float* [[TMP8]], i64 7
>>>>> -; AVX512F-NEXT:    [[TMP15:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> -; AVX512F-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> -; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], [[TMP17]]
>>>>> +; AVX512F-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> +; AVX512F-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> +; AVX512F-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> +; AVX512F-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512F-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> +; AVX512F-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> +; AVX512F-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512F-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> +; AVX512F-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>>>> +; AVX512F-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> +; AVX512F-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> +; AVX512F-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> [[TMP14]], float* [[TMP10]], i64 7
>>>>> +; AVX512F-NEXT:    [[TMP16:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512F-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512F-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], [[TMP17]]
>>>>>    ; AVX512F-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to <8 x float>*
>>>>>    ; AVX512F-NEXT:    store <8 x float> [[TMP18]], <8 x float>* [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>>>    ; AVX512F-NEXT:    ret void
>>>>>    ;
>>>>>    ; AVX512VL-LABEL: @gather_load_div(
>>>>> -; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <4 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> -; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP3]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> -; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> -; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512VL-NEXT:    [[TMP6:%.*]] = shufflevector <2 x float*> [[TMP5]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> -; AVX512VL-NEXT:    [[TMP7:%.*]] = getelementptr float, <2 x float*> [[TMP6]], <2 x i64> <i64 8, i64 5>
>>>>> -; AVX512VL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> -; AVX512VL-NEXT:    [[TMP9:%.*]] = insertelement <8 x float*> poison, float* [[TMP1]], i64 0
>>>>> -; AVX512VL-NEXT:    [[TMP10:%.*]] = shufflevector <4 x float*> [[TMP4]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> [[TMP10]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <8 x float*> [[TMP11]], <8 x float*> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> -; AVX512VL-NEXT:    [[TMP14:%.*]] = insertelement <8 x float*> [[TMP13]], float* [[TMP8]], i64 7
>>>>> -; AVX512VL-NEXT:    [[TMP15:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP14]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP9]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> -; AVX512VL-NEXT:    [[TMP16:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> -; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP16]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> -; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP15]], [[TMP17]]
>>>>> +; AVX512VL-NEXT:    [[TMP3:%.*]] = insertelement <8 x float*> poison, float* [[TMP1:%.*]], i64 0
>>>>> +; AVX512VL-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> poison, <8 x i32> zeroinitializer
>>>>> +; AVX512VL-NEXT:    [[TMP4:%.*]] = getelementptr float, <8 x float*> [[SHUFFLE]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
>>>>> +; AVX512VL-NEXT:    [[TMP5:%.*]] = insertelement <4 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512VL-NEXT:    [[SHUFFLE1:%.*]] = shufflevector <4 x float*> [[TMP5]], <4 x float*> poison, <4 x i32> zeroinitializer
>>>>> +; AVX512VL-NEXT:    [[TMP6:%.*]] = getelementptr float, <4 x float*> [[SHUFFLE1]], <4 x i64> <i64 10, i64 3, i64 14, i64 17>
>>>>> +; AVX512VL-NEXT:    [[TMP7:%.*]] = insertelement <2 x float*> poison, float* [[TMP1]], i64 0
>>>>> +; AVX512VL-NEXT:    [[TMP8:%.*]] = shufflevector <2 x float*> [[TMP7]], <2 x float*> poison, <2 x i32> zeroinitializer
>>>>> +; AVX512VL-NEXT:    [[TMP9:%.*]] = getelementptr float, <2 x float*> [[TMP8]], <2 x i64> <i64 8, i64 5>
>>>>> +; AVX512VL-NEXT:    [[TMP10:%.*]] = getelementptr inbounds float, float* [[TMP1]], i64 20
>>>>> +; AVX512VL-NEXT:    [[TMP11:%.*]] = shufflevector <4 x float*> [[TMP6]], <4 x float*> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP12:%.*]] = shufflevector <8 x float*> [[TMP3]], <8 x float*> [[TMP11]], <8 x i32> <i32 0, i32 8, i32 9, i32 10, i32 11, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP13:%.*]] = shufflevector <2 x float*> [[TMP9]], <2 x float*> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP14:%.*]] = shufflevector <8 x float*> [[TMP12]], <8 x float*> [[TMP13]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 8, i32 9, i32 undef>
>>>>> +; AVX512VL-NEXT:    [[TMP15:%.*]] = insertelement <8 x float*> [[TMP14]], float* [[TMP10]], i64 7
>>>>> +; AVX512VL-NEXT:    [[TMP16:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP15]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512VL-NEXT:    [[TMP17:%.*]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float*> [[TMP4]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), !tbaa [[TBAA0]]
>>>>> +; AVX512VL-NEXT:    [[TMP18:%.*]] = fdiv <8 x float> [[TMP16]], [[TMP17]]
>>>>>    ; AVX512VL-NEXT:    [[TMP19:%.*]] = bitcast float* [[TMP0:%.*]] to <8 x float>*
>>>>>    ; AVX512VL-NEXT:    store <8 x float> [[TMP18]], <8 x float>* [[TMP19]], align 4, !tbaa [[TBAA0]]
>>>>>    ; AVX512VL-NEXT:    ret void
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>>>> index a4a388e9d095c..6946ab292cdf5 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/shrink_after_reorder2.ll
>>>>> @@ -21,11 +21,11 @@ define void @foo(%class.e* %this, %struct.a* %p, i32 %add7) {
>>>>>    ; CHECK-NEXT:    i32 2, label [[SW_BB]]
>>>>>    ; CHECK-NEXT:    ]
>>>>>    ; CHECK:       sw.bb:
>>>>> -; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>>>>> -; CHECK-NEXT:    [[TMP3:%.*]] = load <2 x i32>, <2 x i32>* [[TMP2]], align 4
>>>>>    ; CHECK-NEXT:    [[SHRINK_SHUFFLE:%.*]] = shufflevector <4 x i32> [[SHUFFLE]], <4 x i32> poison, <2 x i32> <i32 2, i32 0>
>>>>> -; CHECK-NEXT:    [[TMP4:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]], <i32 -1, i32 -1>
>>>>> -; CHECK-NEXT:    [[TMP5:%.*]] = add <2 x i32> [[TMP3]], [[TMP4]]
>>>>> +; CHECK-NEXT:    [[TMP2:%.*]] = xor <2 x i32> [[SHRINK_SHUFFLE]], <i32 -1, i32 -1>
>>>>> +; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i32* [[G]] to <2 x i32>*
>>>>> +; CHECK-NEXT:    [[TMP4:%.*]] = load <2 x i32>, <2 x i32>* [[TMP3]], align 4
>>>>> +; CHECK-NEXT:    [[TMP5:%.*]] = add <2 x i32> [[TMP4]], [[TMP2]]
>>>>>    ; CHECK-NEXT:    br label [[SW_EPILOG]]
>>>>>    ; CHECK:       sw.epilog:
>>>>>    ; CHECK-NEXT:    [[TMP6:%.*]] = phi <2 x i32> [ undef, [[ENTRY:%.*]] ], [ [[TMP5]], [[SW_BB]] ]
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>>>> index 87709a87b3692..109c27e4f4f4e 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll
>>>>> @@ -16,8 +16,8 @@ define void @foo() {
>>>>>    ; CHECK-NEXT:    [[TMP3:%.*]] = load double, double* undef, align 8
>>>>>    ; CHECK-NEXT:    br i1 undef, label [[BB3]], label [[BB4:%.*]]
>>>>>    ; CHECK:       bb4:
>>>>> -; CHECK-NEXT:    [[CONV2:%.*]] = uitofp i16 undef to double
>>>>>    ; CHECK-NEXT:    [[TMP4:%.*]] = fpext <4 x float> [[TMP2]] to <4 x double>
>>>>> +; CHECK-NEXT:    [[CONV2:%.*]] = uitofp i16 undef to double
>>>>>    ; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP3]], i32 1
>>>>>    ; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[CONV2]], i32 1
>>>>>    ; CHECK-NEXT:    [[TMP7:%.*]] = fsub <2 x double> [[TMP5]], [[TMP6]]
>>>>>
>>>>> diff  --git a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>>>> index 33ba97921e878..da18a937a6477 100644
>>>>> --- a/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>>>> +++ b/llvm/test/Transforms/SLPVectorizer/slp-max-phi-size.ll
>>>>> @@ -133,27 +133,27 @@ define void @phi_float32(half %hval, float %fval) {
>>>>>    ; MAX256-NEXT:    br label [[BB1:%.*]]
>>>>>    ; MAX256:       bb1:
>>>>>    ; MAX256-NEXT:    [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>>>>> -; MAX256-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>>>> -; MAX256-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>>>> -; MAX256-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>>>>    ; MAX256-NEXT:    [[TMP0:%.*]] = insertelement <8 x float> poison, float [[I]], i32 0
>>>>>    ; MAX256-NEXT:    [[SHUFFLE11:%.*]] = shufflevector <8 x float> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>>    ; MAX256-NEXT:    [[TMP1:%.*]] = insertelement <8 x float> poison, float [[FVAL:%.*]], i32 0
>>>>>    ; MAX256-NEXT:    [[SHUFFLE12:%.*]] = shufflevector <8 x float> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>>    ; MAX256-NEXT:    [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]], [[SHUFFLE12]]
>>>>> -; MAX256-NEXT:    [[TMP3:%.*]] = fadd <8 x float> zeroinitializer, [[TMP2]]
>>>>> -; MAX256-NEXT:    [[TMP4:%.*]] = insertelement <8 x float> poison, float [[I3]], i32 0
>>>>> -; MAX256-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> -; MAX256-NEXT:    [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]], [[SHUFFLE12]]
>>>>> -; MAX256-NEXT:    [[TMP6:%.*]] = fadd <8 x float> zeroinitializer, [[TMP5]]
>>>>> -; MAX256-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, float [[I6]], i32 0
>>>>> -; MAX256-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> -; MAX256-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]], [[SHUFFLE12]]
>>>>> -; MAX256-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, [[TMP8]]
>>>>> -; MAX256-NEXT:    [[TMP10:%.*]] = insertelement <8 x float> poison, float [[I9]], i32 0
>>>>> -; MAX256-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> -; MAX256-NEXT:    [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]], [[SHUFFLE12]]
>>>>> -; MAX256-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, [[TMP11]]
>>>>> +; MAX256-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>>>> +; MAX256-NEXT:    [[TMP3:%.*]] = insertelement <8 x float> poison, float [[I3]], i32 0
>>>>> +; MAX256-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> +; MAX256-NEXT:    [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]], [[SHUFFLE12]]
>>>>> +; MAX256-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>>>> +; MAX256-NEXT:    [[TMP5:%.*]] = insertelement <8 x float> poison, float [[I6]], i32 0
>>>>> +; MAX256-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> +; MAX256-NEXT:    [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]], [[SHUFFLE12]]
>>>>> +; MAX256-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>>>> +; MAX256-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, float [[I9]], i32 0
>>>>> +; MAX256-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> +; MAX256-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]], [[SHUFFLE12]]
>>>>> +; MAX256-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, [[TMP2]]
>>>>> +; MAX256-NEXT:    [[TMP10:%.*]] = fadd <8 x float> zeroinitializer, [[TMP4]]
>>>>> +; MAX256-NEXT:    [[TMP11:%.*]] = fadd <8 x float> zeroinitializer, [[TMP6]]
>>>>> +; MAX256-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, [[TMP8]]
>>>>>    ; MAX256-NEXT:    switch i32 undef, label [[BB5:%.*]] [
>>>>>    ; MAX256-NEXT:    i32 0, label [[BB2:%.*]]
>>>>>    ; MAX256-NEXT:    i32 1, label [[BB3:%.*]]
>>>>> @@ -166,10 +166,10 @@ define void @phi_float32(half %hval, float %fval) {
>>>>>    ; MAX256:       bb5:
>>>>>    ; MAX256-NEXT:    br label [[BB2]]
>>>>>    ; MAX256:       bb2:
>>>>> -; MAX256-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>> -; MAX256-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ [[TMP9]], [[BB1]] ]
>>>>> +; MAX256-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>> +; MAX256-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [ [[TMP11]], [[BB1]] ]
>>>>>    ; MAX256-NEXT:    [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]], [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ [[TMP12]], [[BB1]] ]
>>>>> -; MAX256-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]], [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>> +; MAX256-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]], [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>>    ; MAX256-NEXT:    [[TMP17:%.*]] = extractelement <8 x float> [[TMP14]], i32 7
>>>>>    ; MAX256-NEXT:    store float [[TMP17]], float* undef, align 4
>>>>>    ; MAX256-NEXT:    ret void
>>>>> @@ -179,27 +179,27 @@ define void @phi_float32(half %hval, float %fval) {
>>>>>    ; MAX1024-NEXT:    br label [[BB1:%.*]]
>>>>>    ; MAX1024:       bb1:
>>>>>    ; MAX1024-NEXT:    [[I:%.*]] = fpext half [[HVAL:%.*]] to float
>>>>> -; MAX1024-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>>>> -; MAX1024-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>>>> -; MAX1024-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>>>>    ; MAX1024-NEXT:    [[TMP0:%.*]] = insertelement <8 x float> poison, float [[I]], i32 0
>>>>>    ; MAX1024-NEXT:    [[SHUFFLE11:%.*]] = shufflevector <8 x float> [[TMP0]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>>    ; MAX1024-NEXT:    [[TMP1:%.*]] = insertelement <8 x float> poison, float [[FVAL:%.*]], i32 0
>>>>>    ; MAX1024-NEXT:    [[SHUFFLE12:%.*]] = shufflevector <8 x float> [[TMP1]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>>    ; MAX1024-NEXT:    [[TMP2:%.*]] = fmul <8 x float> [[SHUFFLE11]], [[SHUFFLE12]]
>>>>> -; MAX1024-NEXT:    [[TMP3:%.*]] = fadd <8 x float> zeroinitializer, [[TMP2]]
>>>>> -; MAX1024-NEXT:    [[TMP4:%.*]] = insertelement <8 x float> poison, float [[I3]], i32 0
>>>>> -; MAX1024-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> [[TMP4]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> -; MAX1024-NEXT:    [[TMP5:%.*]] = fmul <8 x float> [[SHUFFLE]], [[SHUFFLE12]]
>>>>> -; MAX1024-NEXT:    [[TMP6:%.*]] = fadd <8 x float> zeroinitializer, [[TMP5]]
>>>>> -; MAX1024-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, float [[I6]], i32 0
>>>>> -; MAX1024-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> -; MAX1024-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE5]], [[SHUFFLE12]]
>>>>> -; MAX1024-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, [[TMP8]]
>>>>> -; MAX1024-NEXT:    [[TMP10:%.*]] = insertelement <8 x float> poison, float [[I9]], i32 0
>>>>> -; MAX1024-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> [[TMP10]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> -; MAX1024-NEXT:    [[TMP11:%.*]] = fmul <8 x float> [[SHUFFLE8]], [[SHUFFLE12]]
>>>>> -; MAX1024-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, [[TMP11]]
>>>>> +; MAX1024-NEXT:    [[I3:%.*]] = fpext half [[HVAL]] to float
>>>>> +; MAX1024-NEXT:    [[TMP3:%.*]] = insertelement <8 x float> poison, float [[I3]], i32 0
>>>>> +; MAX1024-NEXT:    [[SHUFFLE:%.*]] = shufflevector <8 x float> [[TMP3]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> +; MAX1024-NEXT:    [[TMP4:%.*]] = fmul <8 x float> [[SHUFFLE]], [[SHUFFLE12]]
>>>>> +; MAX1024-NEXT:    [[I6:%.*]] = fpext half [[HVAL]] to float
>>>>> +; MAX1024-NEXT:    [[TMP5:%.*]] = insertelement <8 x float> poison, float [[I6]], i32 0
>>>>> +; MAX1024-NEXT:    [[SHUFFLE5:%.*]] = shufflevector <8 x float> [[TMP5]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> +; MAX1024-NEXT:    [[TMP6:%.*]] = fmul <8 x float> [[SHUFFLE5]], [[SHUFFLE12]]
>>>>> +; MAX1024-NEXT:    [[I9:%.*]] = fpext half [[HVAL]] to float
>>>>> +; MAX1024-NEXT:    [[TMP7:%.*]] = insertelement <8 x float> poison, float [[I9]], i32 0
>>>>> +; MAX1024-NEXT:    [[SHUFFLE8:%.*]] = shufflevector <8 x float> [[TMP7]], <8 x float> poison, <8 x i32> zeroinitializer
>>>>> +; MAX1024-NEXT:    [[TMP8:%.*]] = fmul <8 x float> [[SHUFFLE8]], [[SHUFFLE12]]
>>>>> +; MAX1024-NEXT:    [[TMP9:%.*]] = fadd <8 x float> zeroinitializer, [[TMP2]]
>>>>> +; MAX1024-NEXT:    [[TMP10:%.*]] = fadd <8 x float> zeroinitializer, [[TMP4]]
>>>>> +; MAX1024-NEXT:    [[TMP11:%.*]] = fadd <8 x float> zeroinitializer, [[TMP6]]
>>>>> +; MAX1024-NEXT:    [[TMP12:%.*]] = fadd <8 x float> zeroinitializer, [[TMP8]]
>>>>>    ; MAX1024-NEXT:    switch i32 undef, label [[BB5:%.*]] [
>>>>>    ; MAX1024-NEXT:    i32 0, label [[BB2:%.*]]
>>>>>    ; MAX1024-NEXT:    i32 1, label [[BB3:%.*]]
>>>>> @@ -212,10 +212,10 @@ define void @phi_float32(half %hval, float %fval) {
>>>>>    ; MAX1024:       bb5:
>>>>>    ; MAX1024-NEXT:    br label [[BB2]]
>>>>>    ; MAX1024:       bb2:
>>>>> -; MAX1024-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP6]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>> -; MAX1024-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP9]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ [[TMP9]], [[BB1]] ]
>>>>> +; MAX1024-NEXT:    [[TMP13:%.*]] = phi <8 x float> [ [[TMP10]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>> +; MAX1024-NEXT:    [[TMP14:%.*]] = phi <8 x float> [ [[TMP11]], [[BB3]] ], [ [[SHUFFLE12]], [[BB4]] ], [ [[TMP11]], [[BB5]] ], [ [[TMP11]], [[BB1]] ]
>>>>>    ; MAX1024-NEXT:    [[TMP15:%.*]] = phi <8 x float> [ [[TMP12]], [[BB3]] ], [ [[TMP12]], [[BB4]] ], [ [[SHUFFLE12]], [[BB5]] ], [ [[TMP12]], [[BB1]] ]
>>>>> -; MAX1024-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP3]], [[BB3]] ], [ [[TMP3]], [[BB4]] ], [ [[TMP3]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>> +; MAX1024-NEXT:    [[TMP16:%.*]] = phi <8 x float> [ [[TMP9]], [[BB3]] ], [ [[TMP9]], [[BB4]] ], [ [[TMP9]], [[BB5]] ], [ [[SHUFFLE12]], [[BB1]] ]
>>>>>    ; MAX1024-NEXT:    [[TMP17:%.*]] = extractelement <8 x float> [[TMP14]], i32 7
>>>>>    ; MAX1024-NEXT:    store float [[TMP17]], float* undef, align 4
>>>>>    ; MAX1024-NEXT:    ret void
>>>>>
>>>>>
>>>>>           _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits