[llvm] fad70c3 - [ARM] Improve WLS lowering

David Green via llvm-commits llvm-commits at lists.llvm.org
Thu Mar 11 09:56:36 PST 2021


Author: David Green
Date: 2021-03-11T17:56:19Z
New Revision: fad70c306854fd21045cad3f6d9226894eb709e6

URL: https://github.com/llvm/llvm-project/commit/fad70c306854fd21045cad3f6d9226894eb709e6
DIFF: https://github.com/llvm/llvm-project/commit/fad70c306854fd21045cad3f6d9226894eb709e6.diff

LOG: [ARM] Improve WLS lowering

Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.

Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:

  %wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
  %wls0 = extractvalue { i32, i1 } %wls, 0
  %wls1 = extractvalue { i32, i1 } %wls, 1
  br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
  %lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
  ..
  %iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
  %cmp = icmp ne i32 %iv.next, 0
  br i1 %cmp, label %loop, label %loop.exit

The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).

These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.

  %1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
  t2B %bb.1
...
bb.2.loop:
  %2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
  ...
  %3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
  t2B %bb.3

The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.

Differential Revision: https://reviews.llvm.org/D97729

Added: 
    

Modified: 
    llvm/docs/LangRef.rst
    llvm/include/llvm/IR/Intrinsics.td
    llvm/lib/CodeGen/HardwareLoops.cpp
    llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
    llvm/lib/Target/ARM/ARMBaseInstrInfo.h
    llvm/lib/Target/ARM/ARMBlockPlacement.cpp
    llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
    llvm/lib/Target/ARM/ARMISelLowering.cpp
    llvm/lib/Target/ARM/ARMISelLowering.h
    llvm/lib/Target/ARM/ARMInstrThumb2.td
    llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
    llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
    llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
    llvm/lib/Target/ARM/MVETailPredUtils.h
    llvm/lib/Target/ARM/MVETailPredication.cpp
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/add_reduce.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-default.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize-strd-lr.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/branch-targets.ll
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/loop-guards.ll
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/predicated-liveout.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-non-loop.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-while.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmaxmin_vpred_r.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-negative-offset.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/while.mir
    llvm/test/CodeGen/Thumb2/LowOverheadLoops/wlstp.mir
    llvm/test/CodeGen/Thumb2/block-placement.mir
    llvm/test/CodeGen/Thumb2/mve-float16regloops.ll
    llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
    llvm/test/CodeGen/Thumb2/mve-postinc-distribute.ll
    llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll
    llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll
    llvm/test/Transforms/HardwareLoops/ARM/do-rem.ll
    llvm/test/Transforms/HardwareLoops/ARM/simple-do.ll
    llvm/test/Transforms/HardwareLoops/ARM/structure.ll
    llvm/test/Transforms/HardwareLoops/loop-guards.ll
    llvm/test/Transforms/HardwareLoops/scalar-while.ll

Removed: 
    


################################################################################
diff  --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 666f7d5b8c5b..d53795ef5607 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -15965,6 +15965,46 @@ set up the hardware-loop count with a target specific instruction, usually a
 move of this value to a special register or a hardware-loop instruction.
 The result is the conditional value of whether the given count is not zero.
 
+
+'``llvm.test.start.loop.iterations.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This is an overloaded intrinsic.
+
+::
+
+      declare {i32, i1} @llvm.test.start.loop.iterations.i32(i32)
+      declare {i64, i1} @llvm.test.start.loop.iterations.i64(i64)
+
+Overview:
+"""""""""
+
+The '``llvm.test.start.loop.iterations.*``' intrinsics are similar to the
+'``llvm.test.set.loop.iterations.*``' and '``llvm.start.loop.iterations.*``'
+intrinsics, used to specify the hardware-loop trip count, but also produce a
+value identical to the input that can be used as the input to the loop. The
+second i1 output controls entry to a while-loop.
+
+Arguments:
+""""""""""
+
+The integer operand is the loop trip count of the hardware-loop, and thus
+not e.g. the loop back-edge taken count.
+
+Semantics:
+""""""""""
+
+The '``llvm.test.start.loop.iterations.*``' intrinsics do not perform any
+arithmetic on their operand. It's a hint to the backend that can use this to
+set up the hardware-loop count with a target specific instruction, usually a
+move of this value to a special register or a hardware-loop instruction.
+The result is a pair of the input and a conditional value of whether the
+given count is not zero.
+
+
 '``llvm.loop.decrement.reg.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 

diff  --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 668877c5f592..9102b97480bd 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1601,6 +1601,12 @@ def int_start_loop_iterations :
 def int_test_set_loop_iterations :
   DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;
 
+// Same as the above, but produces an extra value (the same as the input
+// operand) to be fed into the loop.
+def int_test_start_loop_iterations :
+  DefaultAttrsIntrinsic<[llvm_anyint_ty, llvm_i1_ty], [LLVMMatchType<0>],
+                        [IntrNoDuplicate]>;
+
 // Decrement loop counter by the given argument. Return false if the loop
 // should exit.
 def int_loop_decrement :

diff  --git a/llvm/lib/CodeGen/HardwareLoops.cpp b/llvm/lib/CodeGen/HardwareLoops.cpp
index 61392186f7cd..8cb04a13fd62 100644
--- a/llvm/lib/CodeGen/HardwareLoops.cpp
+++ b/llvm/lib/CodeGen/HardwareLoops.cpp
@@ -434,25 +434,32 @@ Value* HardwareLoop::InsertIterationSetup(Value *LoopCountInit) {
   IRBuilder<> Builder(BeginBB->getTerminator());
   Type *Ty = LoopCountInit->getType();
   bool UsePhi = UsePHICounter || ForceHardwareLoopPHI;
-  Intrinsic::ID ID = UseLoopGuard ? Intrinsic::test_set_loop_iterations
-                                  : (UsePhi ? Intrinsic::start_loop_iterations
-                                           : Intrinsic::set_loop_iterations);
+  Intrinsic::ID ID = UseLoopGuard
+                         ? (UsePhi ? Intrinsic::test_start_loop_iterations
+                                   : Intrinsic::test_set_loop_iterations)
+                         : (UsePhi ? Intrinsic::start_loop_iterations
+                                   : Intrinsic::set_loop_iterations);
   Function *LoopIter = Intrinsic::getDeclaration(M, ID, Ty);
-  Value *SetCount = Builder.CreateCall(LoopIter, LoopCountInit);
+  Value *LoopSetup = Builder.CreateCall(LoopIter, LoopCountInit);
 
   // Use the return value of the intrinsic to control the entry of the loop.
   if (UseLoopGuard) {
     assert((isa<BranchInst>(BeginBB->getTerminator()) &&
             cast<BranchInst>(BeginBB->getTerminator())->isConditional()) &&
            "Expected conditional branch");
+
+    Value *SetCount =
+        UsePhi ? Builder.CreateExtractValue(LoopSetup, 1) : LoopSetup;
     auto *LoopGuard = cast<BranchInst>(BeginBB->getTerminator());
     LoopGuard->setCondition(SetCount);
     if (LoopGuard->getSuccessor(0) != L->getLoopPreheader())
       LoopGuard->swapSuccessors();
   }
-  LLVM_DEBUG(dbgs() << "HWLoops: Inserted loop counter: "
-             << *SetCount << "\n");
-  return UseLoopGuard ? LoopCountInit : SetCount;
+  LLVM_DEBUG(dbgs() << "HWLoops: Inserted loop counter: " << *LoopSetup
+                    << "\n");
+  if (UsePhi && UseLoopGuard)
+    LoopSetup = Builder.CreateExtractValue(LoopSetup, 0);
+  return !UsePhi ? LoopCountInit : LoopSetup;
 }
 
 void HardwareLoop::InsertLoopDec() {

diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
index 21b6caf13366..ed67b1eeb78e 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
@@ -6126,8 +6126,8 @@ ARMBaseInstrInfo::getOutliningType(MachineBasicBlock::iterator &MIT,
   // Be conservative with ARMv8.1 MVE instructions.
   if (Opc == ARM::t2BF_LabelPseudo || Opc == ARM::t2DoLoopStart ||
       Opc == ARM::t2DoLoopStartTP || Opc == ARM::t2WhileLoopStart ||
-      Opc == ARM::t2LoopDec || Opc == ARM::t2LoopEnd ||
-      Opc == ARM::t2LoopEndDec)
+      Opc == ARM::t2WhileLoopStartLR || Opc == ARM::t2LoopDec ||
+      Opc == ARM::t2LoopEnd || Opc == ARM::t2LoopEndDec)
     return outliner::InstrType::Illegal;
 
   const MCInstrDesc &MCID = MI.getDesc();

diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
index c0d94b675c24..c308d8fde05d 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
@@ -646,7 +646,8 @@ static inline bool isJumpTableBranchOpcode(int Opc) {
 
 static inline bool isLowOverheadTerminatorOpcode(int Opc) {
   return Opc == ARM::t2DoLoopStartTP || Opc == ARM::t2WhileLoopStart ||
-         Opc == ARM::t2LoopEnd || Opc == ARM::t2LoopEndDec;
+         Opc == ARM::t2WhileLoopStartLR || Opc == ARM::t2LoopEnd ||
+         Opc == ARM::t2LoopEndDec;
 }
 
 static inline

diff  --git a/llvm/lib/Target/ARM/ARMBlockPlacement.cpp b/llvm/lib/Target/ARM/ARMBlockPlacement.cpp
index 581b4b9857af..155a6cb18d85 100644
--- a/llvm/lib/Target/ARM/ARMBlockPlacement.cpp
+++ b/llvm/lib/Target/ARM/ARMBlockPlacement.cpp
@@ -83,9 +83,9 @@ bool ARMBlockPlacement::runOnMachineFunction(MachineFunction &MF) {
       continue;
 
     for (auto &Terminator : Preheader->terminators()) {
-      if (Terminator.getOpcode() != ARM::t2WhileLoopStart)
+      if (Terminator.getOpcode() != ARM::t2WhileLoopStartLR)
         continue;
-      MachineBasicBlock *LoopExit = Terminator.getOperand(1).getMBB();
+      MachineBasicBlock *LoopExit = Terminator.getOperand(2).getMBB();
       // We don't want to move the function's entry block.
       if (!LoopExit->getPrevNode())
         continue;
@@ -99,7 +99,7 @@ bool ARMBlockPlacement::runOnMachineFunction(MachineFunction &MF) {
       // that were previously not backwards to become backwards
       bool CanMove = true;
       for (auto &LoopExitTerminator : LoopExit->terminators()) {
-        if (LoopExitTerminator.getOpcode() != ARM::t2WhileLoopStart)
+        if (LoopExitTerminator.getOpcode() != ARM::t2WhileLoopStartLR)
           continue;
         // An example loop structure where the LoopExit can't be moved, since
         // bb1's WLS will become backwards once it's moved after bb3 bb1: -
@@ -111,7 +111,7 @@ bool ARMBlockPlacement::runOnMachineFunction(MachineFunction &MF) {
         //      WLS bb1
         // bb4:          - Header
         MachineBasicBlock *LoopExit2 =
-            LoopExitTerminator.getOperand(1).getMBB();
+            LoopExitTerminator.getOperand(2).getMBB();
         // If the WLS from LoopExit to LoopExit2 is already backwards then
         // moving LoopExit won't affect it, so it can be moved. If LoopExit2 is
         // after the Preheader then moving will keep it as a forward branch, so

diff  --git a/llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp b/llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
index 007daa01b1d0..e13a6723abd0 100644
--- a/llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
+++ b/llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp
@@ -3778,13 +3778,26 @@ void ARMDAGToDAGISel::Select(SDNode *N) {
       return;
     // Other cases are autogenerated.
     break;
-  case ARMISD::WLS:
+  case ARMISD::WLSSETUP: {
+    SDNode *New = CurDAG->getMachineNode(ARM::t2WhileLoopSetup, dl, MVT::i32,
+                                         N->getOperand(0));
+    ReplaceUses(N, New);
+    CurDAG->RemoveDeadNode(N);
+    return;
+  }
+  case ARMISD::WLS: {
+    SDNode *New = CurDAG->getMachineNode(ARM::t2WhileLoopStart, dl, MVT::Other,
+                                         N->getOperand(1), N->getOperand(2),
+                                         N->getOperand(0));
+    ReplaceUses(N, New);
+    CurDAG->RemoveDeadNode(N);
+    return;
+  }
   case ARMISD::LE: {
     SDValue Ops[] = { N->getOperand(1),
                       N->getOperand(2),
                       N->getOperand(0) };
-    unsigned Opc = N->getOpcode() == ARMISD::WLS ?
-      ARM::t2WhileLoopStart : ARM::t2LoopEnd;
+    unsigned Opc = ARM::t2LoopEnd;
     SDNode *New = CurDAG->getMachineNode(Opc, dl, MVT::Other, Ops);
     ReplaceUses(N, New);
     CurDAG->RemoveDeadNode(N);

diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 48a499ee39fc..51cb35daa5df 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -1806,6 +1806,7 @@ const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {
   case ARMISD::VST3LN_UPD:    return "ARMISD::VST3LN_UPD";
   case ARMISD::VST4LN_UPD:    return "ARMISD::VST4LN_UPD";
   case ARMISD::WLS:           return "ARMISD::WLS";
+  case ARMISD::WLSSETUP:      return "ARMISD::WLSSETUP";
   case ARMISD::LE:            return "ARMISD::LE";
   case ARMISD::LOOP_DEC:      return "ARMISD::LOOP_DEC";
   case ARMISD::CSINV:         return "ARMISD::CSINV";
@@ -16193,7 +16194,7 @@ static SDValue SearchLoopIntrinsic(SDValue N, ISD::CondCode &CC, int &Imm,
   }
   case ISD::INTRINSIC_W_CHAIN: {
     unsigned IntOp = cast<ConstantSDNode>(N.getOperand(1))->getZExtValue();
-    if (IntOp != Intrinsic::test_set_loop_iterations &&
+    if (IntOp != Intrinsic::test_start_loop_iterations &&
         IntOp != Intrinsic::loop_decrement_reg)
       return SDValue();
     return N;
@@ -16208,7 +16209,7 @@ static SDValue PerformHWLoopCombine(SDNode *N,
 
   // The hwloop intrinsics that we're interested are used for control-flow,
   // either for entering or exiting the loop:
-  // - test.set.loop.iterations will test whether its operand is zero. If it
+  // - test.start.loop.iterations will test whether its operand is zero. If it
   //   is zero, the proceeding branch should not enter the loop.
   // - loop.decrement.reg also tests whether its operand is zero. If it is
   //   zero, the proceeding branch should not branch back to the beginning of
@@ -16283,21 +16284,25 @@ static SDValue PerformHWLoopCombine(SDNode *N,
     DAG.ReplaceAllUsesOfValueWith(SDValue(Br, 0), NewBr);
   };
 
-  if (IntOp == Intrinsic::test_set_loop_iterations) {
+  if (IntOp == Intrinsic::test_start_loop_iterations) {
     SDValue Res;
+    SDValue Setup = DAG.getNode(ARMISD::WLSSETUP, dl, MVT::i32, Elements);
     // We expect this 'instruction' to branch when the counter is zero.
     if (IsTrueIfZero(CC, Imm)) {
-      SDValue Ops[] = { Chain, Elements, Dest };
+      SDValue Ops[] = {Chain, Setup, Dest};
       Res = DAG.getNode(ARMISD::WLS, dl, MVT::Other, Ops);
     } else {
       // The logic is the reverse of what we need for WLS, so find the other
       // basic block target: the target of the proceeding br.
       UpdateUncondBr(Br, Dest, DAG);
 
-      SDValue Ops[] = { Chain, Elements, OtherTarget };
+      SDValue Ops[] = {Chain, Setup, OtherTarget};
       Res = DAG.getNode(ARMISD::WLS, dl, MVT::Other, Ops);
     }
-    DAG.ReplaceAllUsesOfValueWith(Int.getValue(1), Int.getOperand(0));
+    // Update LR count to the new value
+    DAG.ReplaceAllUsesOfValueWith(Int.getValue(0), Setup);
+    // Update chain
+    DAG.ReplaceAllUsesOfValueWith(Int.getValue(2), Int.getOperand(0));
     return Res;
   } else {
     SDValue Size = DAG.getTargetConstant(

diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.h b/llvm/lib/Target/ARM/ARMISelLowering.h
index 87b9878862f8..bedf83574425 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.h
+++ b/llvm/lib/Target/ARM/ARMISelLowering.h
@@ -130,7 +130,8 @@ class VectorType;
       WIN__CHKSTK,  // Windows' __chkstk call to do stack probing.
       WIN__DBZCHK,  // Windows' divide by zero check
 
-      WLS,          // Low-overhead loops, While Loop Start
+      WLS,          // Low-overhead loops, While Loop Start branch. See t2WhileLoopStart
+      WLSSETUP,     // Setup for the iteration count of a WLS. See t2WhileLoopSetup.
       LOOP_DEC,     // Really a part of LE, performs the sub
       LE,           // Low-overhead loops, Loop End
 

diff  --git a/llvm/lib/Target/ARM/ARMInstrThumb2.td b/llvm/lib/Target/ARM/ARMInstrThumb2.td
index 90dcc3bcac4d..f562e8952f27 100644
--- a/llvm/lib/Target/ARM/ARMInstrThumb2.td
+++ b/llvm/lib/Target/ARM/ARMInstrThumb2.td
@@ -5457,33 +5457,65 @@ def t2LE : t2LOL<(outs ), (ins lelabel_u11:$label), "le", "$label"> {
 
 let Predicates = [IsThumb2, HasV8_1MMainline, HasLOB] in {
 
+// t2DoLoopStart a pseudo for DLS hardware loops. Lowered into a DLS in
+// ARMLowOverheadLoops if possible, or reverted to a Mov if not.
 let usesCustomInserter = 1 in
 def t2DoLoopStart :
   t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts), 4, IIC_Br,
   [(set GPRlr:$X, (int_start_loop_iterations rGPR:$elts))]>;
 
+// A pseudo for a DLSTP, created in the MVETPAndVPTOptimizationPass from a
+// t2DoLoopStart if the loops is tail predicated. Holds both the element
+// count and trip count of the loop, picking the correct one during
+// ARMLowOverheadLoops when it is converted to a DLSTP or DLS as required.
 let isTerminator = 1, hasSideEffects = 1 in
 def t2DoLoopStartTP :
   t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts, rGPR:$count), 4, IIC_Br, []>;
 
+// Setup for a t2WhileLoopStart. A pair of t2WhileLoopSetup and t2WhileLoopStart
+// will be created post-ISel from a llvm.test.start.loop.iterations. This
+// t2WhileLoopSetup to setup LR and t2WhileLoopStart to perform the branch. Not
+// valid after reg alloc, as it should be lowered during MVETPAndVPTOptimisations
+// into a t2WhileLoopStartLR (or expanded).
+def t2WhileLoopSetup :
+  t2PseudoInst<(outs GPRlr:$lr), (ins rGPR:$elts), 4, IIC_Br, []>;
+
+// A pseudo to represent the decrement in a low overhead loop. A t2LoopDec and
+// t2LoopEnd together represent a LE instruction. Ideally these are converted
+// to a t2LoopEndDec which is lowered as a single instruction.
 let hasSideEffects = 0 in
 def t2LoopDec :
   t2PseudoInst<(outs GPRlr:$Rm), (ins GPRlr:$Rn, imm0_7:$size),
                4, IIC_Br, []>, Sched<[WriteBr]>;
 
 let isBranch = 1, isTerminator = 1, hasSideEffects = 1, Defs = [CPSR] in {
-// Set WhileLoopStart and LoopEnd to occupy 8 bytes because they may
-// get converted into t2CMP and t2Bcc.
+// The branch in a t2WhileLoopSetup/t2WhileLoopStart pair, eventually turned
+// into a t2WhileLoopStartLR that does both the LR setup and branch.
 def t2WhileLoopStart :
     t2PseudoInst<(outs),
+                 (ins GPRlr:$elts, brtarget:$target),
+                 4, IIC_Br, []>,
+                 Sched<[WriteBr]>;
+
+// WhileLoopStartLR that sets up LR and branches on zero, equivalent to WLS. It
+// is lowered in the ARMLowOverheadLoops pass providing the branches are within
+// range. WhileLoopStartLR and LoopEnd to occupy 8 bytes because they may get
+// converted into t2CMP and t2Bcc.
+def t2WhileLoopStartLR :
+    t2PseudoInst<(outs GPRlr:$lr),
                  (ins rGPR:$elts, brtarget:$target),
                  8, IIC_Br, []>,
                  Sched<[WriteBr]>;
 
+// t2LoopEnd - the branch half of a t2LoopDec/t2LoopEnd pair.
 def t2LoopEnd :
   t2PseudoInst<(outs), (ins GPRlr:$elts, brtarget:$target),
   8, IIC_Br, []>, Sched<[WriteBr]>;
 
+// The combination of a t2LoopDec and t2LoopEnd, performing both the LR
+// decrement and branch as a single instruction. Is lowered to a LE or
+// LETP in ARMLowOverheadLoops as appropriate, or converted to t2CMP/t2Bcc
+// if the branches are out of range.
 def t2LoopEndDec :
   t2PseudoInst<(outs GPRlr:$Rm), (ins GPRlr:$elts, brtarget:$target),
   8, IIC_Br, []>, Sched<[WriteBr]>;

diff  --git a/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp b/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
index 1924739afbd0..69cdfe269c18 100644
--- a/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
+++ b/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
@@ -31,7 +31,7 @@
 /// during the transform and pseudo instructions are replaced by real ones. In
 /// some cases, when we have to revert to a 'normal' loop, we have to introduce
 /// multiple instructions for a single pseudo (see RevertWhile and
-/// RevertLoopEnd). To handle this situation, t2WhileLoopStart and t2LoopEnd
+/// RevertLoopEnd). To handle this situation, t2WhileLoopStartLR and t2LoopEnd
 /// are defined to be as large as this maximum sequence of replacement
 /// instructions.
 ///
@@ -102,7 +102,7 @@ static bool shouldInspect(MachineInstr &MI) {
 }
 
 static bool isDo(MachineInstr *MI) {
-  return MI->getOpcode() != ARM::t2WhileLoopStart;
+  return MI->getOpcode() != ARM::t2WhileLoopStartLR;
 }
 
 namespace {
@@ -442,7 +442,7 @@ namespace {
     MachineOperand &getLoopStartOperand() {
       if (IsTailPredicationLegal())
         return TPNumElements;
-      return isDo(Start) ? Start->getOperand(1) : Start->getOperand(0);
+      return Start->getOperand(1);
     }
 
     unsigned getStartOpcode() const {
@@ -1064,53 +1064,20 @@ void LowOverheadLoop::Validate(ARMBasicBlockUtils *BBUtils) {
       return false;
     }
 
-    if (Start->getOpcode() == ARM::t2WhileLoopStart &&
+    if (Start->getOpcode() == ARM::t2WhileLoopStartLR &&
         (BBUtils->getOffsetOf(Start) >
-         BBUtils->getOffsetOf(Start->getOperand(1).getMBB()) ||
-         !BBUtils->isBBInRange(Start, Start->getOperand(1).getMBB(), 4094))) {
+             BBUtils->getOffsetOf(Start->getOperand(2).getMBB()) ||
+         !BBUtils->isBBInRange(Start, Start->getOperand(2).getMBB(), 4094))) {
       LLVM_DEBUG(dbgs() << "ARM Loops: WLS offset is out-of-range!\n");
       return false;
     }
     return true;
   };
 
-  // Find a suitable position to insert the loop start instruction. It needs to
-  // be able to safely define LR.
-  auto FindStartInsertionPoint = [](MachineInstr *Start, MachineInstr *Dec,
-                                    MachineBasicBlock::iterator &InsertPt,
-                                    MachineBasicBlock *&InsertBB,
-                                    ReachingDefAnalysis &RDA,
-                                    InstSet &ToRemove) {
-    // For a t2DoLoopStart it is always valid to use the start insertion point.
-    // For WLS we can define LR if LR already contains the same value.
-    if (isDo(Start) || Start->getOperand(0).getReg() == ARM::LR) {
-      InsertPt = MachineBasicBlock::iterator(Start);
-      InsertBB = Start->getParent();
-      return true;
-    }
-
-    // We've found no suitable LR def and Start doesn't use LR directly. Can we
-    // just define LR anyway?
-    if (!RDA.isSafeToDefRegAt(Start, MCRegister::from(ARM::LR)))
-      return false;
-
-    InsertPt = MachineBasicBlock::iterator(Start);
-    InsertBB = Start->getParent();
-    return true;
-  };
-
-  if (!FindStartInsertionPoint(Start, Dec, StartInsertPt, StartInsertBB, RDA,
-                               ToRemove)) {
-    LLVM_DEBUG(dbgs() << "ARM Loops: Unable to find safe insertion point.\n");
-    Revert = true;
-    return;
-  }
-  LLVM_DEBUG(if (StartInsertPt == StartInsertBB->end())
-               dbgs() << "ARM Loops: Will insert LoopStart at end of block\n";
-             else
-               dbgs() << "ARM Loops: Will insert LoopStart at "
-                      << *StartInsertPt
-            );
+  StartInsertPt = MachineBasicBlock::iterator(Start);
+  StartInsertBB = Start->getParent();
+  LLVM_DEBUG(dbgs() << "ARM Loops: Will insert LoopStart at "
+                    << *StartInsertPt);
 
   Revert = !ValidateRanges(Start, End, BBUtils, ML);
   CannotTailPredicate = !ValidateTailPredicate();
@@ -1317,6 +1284,9 @@ bool ARMLowOverheadLoops::ProcessLoop(MachineLoop *ML) {
     return false;
   }
 
+  assert(LoLoop.Start->getOpcode() != ARM::t2WhileLoopStart &&
+         "Expected t2WhileLoopStart to be removed before regalloc!");
+
   // Check that the only instruction using LoopDec is LoopEnd. This can only
   // happen when the Dec and End are separate, not a single t2LoopEndDec.
   // TODO: Check for copy chains that really have no effect.
@@ -1339,11 +1309,11 @@ bool ARMLowOverheadLoops::ProcessLoop(MachineLoop *ML) {
 // another low register.
 void ARMLowOverheadLoops::RevertWhile(MachineInstr *MI) const {
   LLVM_DEBUG(dbgs() << "ARM Loops: Reverting to cmp: " << *MI);
-  MachineBasicBlock *DestBB = MI->getOperand(1).getMBB();
+  MachineBasicBlock *DestBB = MI->getOperand(2).getMBB();
   unsigned BrOpc = BBUtils->isBBInRange(MI, DestBB, 254) ?
     ARM::tBcc : ARM::t2Bcc;
 
-  RevertWhileLoopStart(MI, TII, BrOpc);
+  RevertWhileLoopStartLR(MI, TII, BrOpc);
 }
 
 void ARMLowOverheadLoops::RevertDo(MachineInstr *MI) const {
@@ -1478,7 +1448,7 @@ MachineInstr* ARMLowOverheadLoops::ExpandLoopStart(LowOverheadLoop &LoLoop) {
     MIB.addDef(ARM::LR);
     MIB.add(Count);
     if (!isDo(Start))
-      MIB.add(Start->getOperand(1));
+      MIB.add(Start->getOperand(2));
 
     LLVM_DEBUG(dbgs() << "ARM Loops: Inserted start: " << *MIB);
     NewStart = &*MIB;
@@ -1657,7 +1627,7 @@ void ARMLowOverheadLoops::Expand(LowOverheadLoop &LoLoop) {
   };
 
   if (LoLoop.Revert) {
-    if (LoLoop.Start->getOpcode() == ARM::t2WhileLoopStart)
+    if (LoLoop.Start->getOpcode() == ARM::t2WhileLoopStartLR)
       RevertWhile(LoLoop.Start);
     else
       RevertDo(LoLoop.Start);
@@ -1728,7 +1698,7 @@ bool ARMLowOverheadLoops::RevertNonLoops() {
     Changed = true;
 
     for (auto *Start : Starts) {
-      if (Start->getOpcode() == ARM::t2WhileLoopStart)
+      if (Start->getOpcode() == ARM::t2WhileLoopStartLR)
         RevertWhile(Start);
       else
         RevertDo(Start);

diff  --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index 4b403c830374..92e523eed6be 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -1870,7 +1870,7 @@ bool ARMTTIImpl::isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
       default:
         break;
       case Intrinsic::start_loop_iterations:
-      case Intrinsic::test_set_loop_iterations:
+      case Intrinsic::test_start_loop_iterations:
       case Intrinsic::loop_decrement:
       case Intrinsic::loop_decrement_reg:
         return true;

diff  --git a/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp b/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
index 3336fb0935e6..c246ab3402bd 100644
--- a/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
+++ b/llvm/lib/Target/ARM/MVETPAndVPTOptimisationsPass.cpp
@@ -64,6 +64,7 @@ class MVETPAndVPTOptimisations : public MachineFunctionPass {
   }
 
 private:
+  bool LowerWhileLoopStart(MachineLoop *ML);
   bool MergeLoopEnd(MachineLoop *ML);
   bool ConvertTailPredLoop(MachineLoop *ML, MachineDominatorTree *DT);
   MachineInstr &ReplaceRegisterUseWithVPNOT(MachineBasicBlock &MBB,
@@ -164,7 +165,9 @@ static bool findLoopComponents(MachineLoop *ML, MachineRegisterInfo *MRI,
                           ? LoopPhi->getOperand(3).getReg()
                           : LoopPhi->getOperand(1).getReg();
   LoopStart = LookThroughCOPY(MRI->getVRegDef(StartReg), MRI);
-  if (!LoopStart || LoopStart->getOpcode() != ARM::t2DoLoopStart) {
+  if (!LoopStart || (LoopStart->getOpcode() != ARM::t2DoLoopStart &&
+                     LoopStart->getOpcode() != ARM::t2WhileLoopSetup &&
+                     LoopStart->getOpcode() != ARM::t2WhileLoopStartLR)) {
     LLVM_DEBUG(dbgs() << "  didn't find Start where we expected!\n");
     return false;
   }
@@ -173,6 +176,82 @@ static bool findLoopComponents(MachineLoop *ML, MachineRegisterInfo *MRI,
   return true;
 }
 
+static void RevertWhileLoopSetup(MachineInstr *MI, const TargetInstrInfo *TII) {
+  MachineBasicBlock *MBB = MI->getParent();
+  assert(MI->getOpcode() == ARM::t2WhileLoopSetup &&
+         "Only expected a t2WhileLoopSetup in RevertWhileLoopStart!");
+
+  // Subs
+  MachineInstrBuilder MIB =
+      BuildMI(*MBB, MI, MI->getDebugLoc(), TII->get(ARM::t2SUBri));
+  MIB.add(MI->getOperand(0));
+  MIB.add(MI->getOperand(1));
+  MIB.addImm(0);
+  MIB.addImm(ARMCC::AL);
+  MIB.addReg(ARM::NoRegister);
+  MIB.addReg(ARM::CPSR, RegState::Define);
+
+  // Attempt to find a t2WhileLoopStart and revert to a t2Bcc.
+  for (MachineInstr &I : MBB->terminators()) {
+    if (I.getOpcode() == ARM::t2WhileLoopStart) {
+      MachineInstrBuilder MIB =
+          BuildMI(*MBB, &I, I.getDebugLoc(), TII->get(ARM::t2Bcc));
+      MIB.add(MI->getOperand(1)); // branch target
+      MIB.addImm(ARMCC::EQ);
+      MIB.addReg(ARM::CPSR);
+      I.eraseFromParent();
+      break;
+    }
+  }
+
+  MI->eraseFromParent();
+}
+
+// The Hardware Loop insertion and ISel Lowering produce the pseudos for the
+// start of a while loop:
+//   %a:gprlr = t2WhileLoopSetup %Cnt
+//   t2WhileLoopStart %a, %BB
+// We want to convert those to a single instruction which, like t2LoopEndDec and
+// t2DoLoopStartTP is both a terminator and produces a value:
+//   %a:grplr: t2WhileLoopStartLR %Cnt, %BB
+//
+// Otherwise if we can't, we revert the loop. t2WhileLoopSetup and
+// t2WhileLoopStart are not valid past regalloc.
+bool MVETPAndVPTOptimisations::LowerWhileLoopStart(MachineLoop *ML) {
+  LLVM_DEBUG(dbgs() << "LowerWhileLoopStart on loop "
+                    << ML->getHeader()->getName() << "\n");
+
+  MachineInstr *LoopEnd, *LoopPhi, *LoopStart, *LoopDec;
+  if (!findLoopComponents(ML, MRI, LoopStart, LoopPhi, LoopDec, LoopEnd))
+    return false;
+
+  if (LoopStart->getOpcode() != ARM::t2WhileLoopSetup)
+    return false;
+
+  Register LR = LoopStart->getOperand(0).getReg();
+  auto WLSIt = find_if(MRI->use_nodbg_instructions(LR), [](auto &MI) {
+    return MI.getOpcode() == ARM::t2WhileLoopStart;
+  });
+  if (!MergeEndDec || WLSIt == MRI->use_instr_nodbg_end()) {
+    RevertWhileLoopSetup(LoopStart, TII);
+    RevertLoopDec(LoopStart, TII);
+    RevertLoopEnd(LoopStart, TII);
+    return true;
+  }
+
+  MachineInstrBuilder MI =
+      BuildMI(*WLSIt->getParent(), *WLSIt, WLSIt->getDebugLoc(),
+              TII->get(ARM::t2WhileLoopStartLR), LR)
+          .add(LoopStart->getOperand(1))
+          .add(WLSIt->getOperand(1));
+  (void)MI;
+  LLVM_DEBUG(dbgs() << "Lowered WhileLoopStart into: " << *MI.getInstr());
+
+  WLSIt->eraseFromParent();
+  LoopStart->eraseFromParent();
+  return true;
+}
+
 // This function converts loops with t2LoopEnd and t2LoopEnd instructions into
 // a single t2LoopEndDec instruction. To do that it needs to make sure that LR
 // will be valid to be used for the low overhead loop, which means nothing else
@@ -192,12 +271,19 @@ bool MVETPAndVPTOptimisations::MergeLoopEnd(MachineLoop *ML) {
     return false;
 
   // Check if there is an illegal instruction (a call) in the low overhead loop
-  // and if so revert it now before we get any further.
-  for (MachineBasicBlock *MBB : ML->blocks()) {
+  // and if so revert it now before we get any further. While loops also need to
+  // check the preheaders.
+  SmallPtrSet<MachineBasicBlock *, 4> MBBs(ML->block_begin(), ML->block_end());
+  if (LoopStart->getOpcode() == ARM::t2WhileLoopStartLR)
+    MBBs.insert(ML->getHeader()->pred_begin(), ML->getHeader()->pred_end());
+  for (MachineBasicBlock *MBB : MBBs) {
     for (MachineInstr &MI : *MBB) {
       if (MI.isCall()) {
         LLVM_DEBUG(dbgs() << "Found call in loop, reverting: " << MI);
-        RevertDoLoopStart(LoopStart, TII);
+        if (LoopStart->getOpcode() == ARM::t2DoLoopStart)
+          RevertDoLoopStart(LoopStart, TII);
+        else
+          RevertWhileLoopStartLR(LoopStart, TII);
         RevertLoopDec(LoopDec, TII);
         RevertLoopEnd(LoopEnd, TII);
         return true;
@@ -236,8 +322,16 @@ bool MVETPAndVPTOptimisations::MergeLoopEnd(MachineLoop *ML) {
   };
   if (!CheckUsers(PhiReg, {LoopDec}, MRI) ||
       !CheckUsers(DecReg, {LoopPhi, LoopEnd}, MRI) ||
-      !CheckUsers(StartReg, {LoopPhi}, MRI))
+      !CheckUsers(StartReg, {LoopPhi}, MRI)) {
+    // Don't leave a t2WhileLoopStartLR without the LoopDecEnd.
+    if (LoopStart->getOpcode() == ARM::t2WhileLoopStartLR) {
+      RevertWhileLoopStartLR(LoopStart, TII);
+      RevertLoopDec(LoopDec, TII);
+      RevertLoopEnd(LoopEnd, TII);
+      return true;
+    }
     return false;
+  }
 
   MRI->constrainRegClass(StartReg, &ARM::GPRlrRegClass);
   MRI->constrainRegClass(PhiReg, &ARM::GPRlrRegClass);
@@ -281,7 +375,7 @@ bool MVETPAndVPTOptimisations::ConvertTailPredLoop(MachineLoop *ML,
   MachineInstr *LoopEnd, *LoopPhi, *LoopStart, *LoopDec;
   if (!findLoopComponents(ML, MRI, LoopStart, LoopPhi, LoopDec, LoopEnd))
     return false;
-  if (LoopDec != LoopEnd)
+  if (LoopDec != LoopEnd || LoopStart->getOpcode() != ARM::t2DoLoopStart)
     return false;
 
   SmallVector<MachineInstr *, 4> VCTPs;
@@ -869,6 +963,7 @@ bool MVETPAndVPTOptimisations::runOnMachineFunction(MachineFunction &Fn) {
 
   bool Modified = false;
   for (MachineLoop *ML : MLI->getBase().getLoopsInPreorder()) {
+    Modified |= LowerWhileLoopStart(ML);
     Modified |= MergeLoopEnd(ML);
     Modified |= ConvertTailPredLoop(ML, DT);
   }

diff  --git a/llvm/lib/Target/ARM/MVETailPredUtils.h b/llvm/lib/Target/ARM/MVETailPredUtils.h
index 9ab5d92729fe..4798c6612b79 100644
--- a/llvm/lib/Target/ARM/MVETailPredUtils.h
+++ b/llvm/lib/Target/ARM/MVETailPredUtils.h
@@ -71,26 +71,31 @@ static inline bool isVCTP(const MachineInstr *MI) {
 static inline bool isLoopStart(MachineInstr &MI) {
   return MI.getOpcode() == ARM::t2DoLoopStart ||
          MI.getOpcode() == ARM::t2DoLoopStartTP ||
-         MI.getOpcode() == ARM::t2WhileLoopStart;
+         MI.getOpcode() == ARM::t2WhileLoopStart ||
+         MI.getOpcode() == ARM::t2WhileLoopStartLR;
 }
 
-// WhileLoopStart holds the exit block, so produce a cmp lr, 0 and then a
+// WhileLoopStart holds the exit block, so produce a subs Op0, Op1, 0 and then a
 // beq that branches to the exit branch.
-inline void RevertWhileLoopStart(MachineInstr *MI, const TargetInstrInfo *TII,
-                        unsigned BrOpc = ARM::t2Bcc) {
+inline void RevertWhileLoopStartLR(MachineInstr *MI, const TargetInstrInfo *TII,
+                                   unsigned BrOpc = ARM::t2Bcc) {
   MachineBasicBlock *MBB = MI->getParent();
+  assert(MI->getOpcode() == ARM::t2WhileLoopStartLR &&
+         "Only expected a t2WhileLoopStartLR in RevertWhileLoopStartLR!");
 
-  // Cmp
+  // Subs
   MachineInstrBuilder MIB =
-      BuildMI(*MBB, MI, MI->getDebugLoc(), TII->get(ARM::t2CMPri));
+      BuildMI(*MBB, MI, MI->getDebugLoc(), TII->get(ARM::t2SUBri));
   MIB.add(MI->getOperand(0));
+  MIB.add(MI->getOperand(1));
   MIB.addImm(0);
   MIB.addImm(ARMCC::AL);
   MIB.addReg(ARM::NoRegister);
+  MIB.addReg(ARM::CPSR, RegState::Define);
 
   // Branch
   MIB = BuildMI(*MBB, MI, MI->getDebugLoc(), TII->get(BrOpc));
-  MIB.add(MI->getOperand(1)); // branch target
+  MIB.add(MI->getOperand(2)); // branch target
   MIB.addImm(ARMCC::EQ);      // condition code
   MIB.addReg(ARM::CPSR);
 

diff  --git a/llvm/lib/Target/ARM/MVETailPredication.cpp b/llvm/lib/Target/ARM/MVETailPredication.cpp
index b705208660df..a53f5b1e3009 100644
--- a/llvm/lib/Target/ARM/MVETailPredication.cpp
+++ b/llvm/lib/Target/ARM/MVETailPredication.cpp
@@ -156,7 +156,7 @@ bool MVETailPredication::runOnLoop(Loop *L, LPPassManager&) {
 
       Intrinsic::ID ID = Call->getIntrinsicID();
       if (ID == Intrinsic::start_loop_iterations ||
-          ID == Intrinsic::test_set_loop_iterations)
+          ID == Intrinsic::test_start_loop_iterations)
         return cast<IntrinsicInst>(&I);
     }
     return nullptr;

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/add_reduce.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/add_reduce.mir
index 5bc82d493abb..c130d500bcc2 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/add_reduce.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/add_reduce.mir
@@ -209,7 +209,7 @@ body:             |
     renamable $r12 = t2LDRi12 $sp, 48, 14, $noreg :: (load 4 from %fixed-stack.0, align 8)
     renamable $r5 = t2ADDri renamable $r12, 3, 14, $noreg, $noreg
     renamable $r7, dead $cpsr = tLSRri killed renamable $r5, 2, 14, $noreg
-    t2WhileLoopStart renamable $r7, %bb.3, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $r7, %bb.3, implicit-def dead $cpsr
     tB %bb.1, 14, $noreg
 
   bb.1.for.body.lr.ph:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-default.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-default.mir
index 21cb00cc8bb2..f523af1a7d43 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-default.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-default.mir
@@ -238,7 +238,7 @@ body:             |
   ; CHECK:   liveins: $r1, $r2, $r3, $r5, $r7, $r8, $r12
   ; CHECK:   $r9, $r4 = t2LDRDi8 $r3, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
   ; CHECK:   $r6, $r0 = t2LDRDi8 $r3, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
-  ; CHECK:   t2CMPri renamable $r8, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
+  ; CHECK:   dead $lr = t2SUBri renamable $r8, 0, 14 /* CC::al */, $noreg, def $cpsr
   ; CHECK:   tBcc %bb.1, 0 /* CC::eq */, killed $cpsr
   ; CHECK:   tB %bb.3, 14 /* CC::al */, $noreg
   ; CHECK: bb.3.bb27:
@@ -334,7 +334,7 @@ body:             |
 
     $r9, $r4 = t2LDRDi8 $r3, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
     $r6, $r0 = t2LDRDi8 $r3, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
-    t2WhileLoopStart renamable $r8, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $r8, %bb.1, implicit-def dead $cpsr
     tB %bb.3, 14 /* CC::al */, $noreg
 
   bb.3.bb27:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize-strd-lr.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize-strd-lr.mir
index 14127639be29..6f35f122129b 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize-strd-lr.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize-strd-lr.mir
@@ -30,8 +30,10 @@
     %i23 = load i32, i32* %i22, align 4
     %i24 = getelementptr inbounds i32, i32* %i14, i32 3
     %i25 = load i32, i32* %i24, align 4
-    %i26 = call i1 @llvm.test.set.loop.iterations.i32(i32 %arg3)
-    br i1 %i26, label %bb27, label %bb74
+    %i26 = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %arg3)
+    %i26.0 = extractvalue { i32, i1 } %i26, 0
+    %i26.1 = extractvalue { i32, i1 } %i26, 1
+    br i1 %i26.1, label %bb27, label %bb74
 
   bb27:                                             ; preds = %bb12
     %i28 = getelementptr inbounds i32, i32* %i13, i32 4
@@ -46,7 +48,7 @@
     br label %bb37
 
   bb37:                                             ; preds = %bb37, %bb27
-    %lsr.iv = phi i32 [ %lsr.iv.next, %bb37 ], [ %arg3, %bb27 ]
+    %lsr.iv = phi i32 [ %i70, %bb37 ], [ %i26.0, %bb27 ]
     %i38 = phi i32* [ %i15, %bb27 ], [ %i51, %bb37 ]
     %i39 = phi i32* [ %arg2, %bb27 ], [ %i69, %bb37 ]
     %i40 = phi i32 [ %i25, %bb27 ], [ %i41, %bb37 ]
@@ -81,7 +83,6 @@
     store i32 %i68, i32* %i39, align 4
     %i70 = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
     %i71 = icmp ne i32 %i70, 0
-    %lsr.iv.next = add i32 %lsr.iv, -1
     br i1 %i71, label %bb37, label %bb72
 
   bb72:                                             ; preds = %bb37
@@ -115,7 +116,7 @@
     ret void
   }
 
-  declare i1 @llvm.test.set.loop.iterations.i32(i32) #1
+  declare { i32, i1 } @llvm.test.start.loop.iterations.i32(i32) #1
   declare i32 @llvm.loop.decrement.reg.i32(i32, i32) #1
 
   attributes #0 = { optsize "target-cpu"="cortex-m55" }
@@ -133,7 +134,7 @@ liveins:
   - { reg: '$r2', virtual-reg: '' }
   - { reg: '$r3', virtual-reg: '' }
 frameInfo:
-  stackSize:       76
+  stackSize:       68
   offsetAdjustment: 0
   maxAlignment:    4
   savePoint:       ''
@@ -164,37 +165,31 @@ stack:
   - { id: 7, name: '', type: spill-slot, offset: -68, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 8, name: '', type: spill-slot, offset: -72, size: 4, alignment: 4,
-      stack-id: default, callee-saved-register: '', callee-saved-restored: true,
-      debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 9, name: '', type: spill-slot, offset: -76, size: 4, alignment: 4,
-      stack-id: default, callee-saved-register: '', callee-saved-restored: true,
-      debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 10, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
+  - { id: 8, name: '', type: spill-slot, offset: -4, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 11, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
+  - { id: 9, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r11', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 12, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
+  - { id: 10, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r10', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 13, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
+  - { id: 11, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r9', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 14, name: '', type: spill-slot, offset: -20, size: 4, alignment: 4,
+  - { id: 12, name: '', type: spill-slot, offset: -20, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r8', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 15, name: '', type: spill-slot, offset: -24, size: 4, alignment: 4,
+  - { id: 13, name: '', type: spill-slot, offset: -24, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 16, name: '', type: spill-slot, offset: -28, size: 4, alignment: 4,
+  - { id: 14, name: '', type: spill-slot, offset: -28, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r6', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 17, name: '', type: spill-slot, offset: -32, size: 4, alignment: 4,
+  - { id: 15, name: '', type: spill-slot, offset: -32, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r5', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 18, name: '', type: spill-slot, offset: -36, size: 4, alignment: 4,
+  - { id: 16, name: '', type: spill-slot, offset: -36, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
 callSites:       []
@@ -216,82 +211,70 @@ body:             |
   ; CHECK:   frame-setup CFI_INSTRUCTION offset $r6, -28
   ; CHECK:   frame-setup CFI_INSTRUCTION offset $r5, -32
   ; CHECK:   frame-setup CFI_INSTRUCTION offset $r4, -36
-  ; CHECK:   $sp = frame-setup tSUBspi $sp, 10, 14 /* CC::al */, $noreg
-  ; CHECK:   frame-setup CFI_INSTRUCTION def_cfa_offset 76
-  ; CHECK:   $r7, $r5 = t2LDRDi8 $r0, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i), (load 4 from %ir.i5)
-  ; CHECK:   $r6, $r4 = t2LDRDi8 killed $r0, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i7), (load 4 from %ir.i10)
+  ; CHECK:   $sp = frame-setup tSUBspi $sp, 8, 14 /* CC::al */, $noreg
+  ; CHECK:   frame-setup CFI_INSTRUCTION def_cfa_offset 68
+  ; CHECK:   $r6, $r4 = t2LDRDi8 $r0, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i7), (load 4 from %ir.i10)
+  ; CHECK:   $r7, $r5 = t2LDRDi8 killed $r0, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i), (load 4 from %ir.i5)
   ; CHECK:   renamable $r0 = t2RSBri killed renamable $r6, 31, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   t2STMIA $sp, 14 /* CC::al */, $noreg, killed $r0, $r2, $r3 :: (store 4 into %stack.9), (store 4 into %stack.8), (store 4 into %stack.7)
+  ; CHECK:   t2STMIA $sp, 14 /* CC::al */, $noreg, killed $r0, $r2, $r3 :: (store 4 into %stack.7), (store 4 into %stack.6), (store 4 into %stack.5)
+  ; CHECK:   $r12 = tMOVr killed $r2, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r2 = tLDRspi $sp, 0, 14 /* CC::al */, $noreg :: (load 4 from %stack.7)
   ; CHECK: bb.1.bb12 (align 4):
   ; CHECK:   successors: %bb.2(0x40000000), %bb.5(0x40000000)
-  ; CHECK:   liveins: $r1, $r2, $r3, $r4, $r5, $r7
-  ; CHECK:   $r9, $r8 = t2LDRDi8 $r7, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
-  ; CHECK:   renamable $lr = nuw t2ADDri renamable $r5, 20, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   $r6, $r12 = t2LDRDi8 $r7, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
-  ; CHECK:   t2CMPri renamable $r3, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
-  ; CHECK:   tBcc %bb.5, 0 /* CC::eq */, killed $cpsr
-  ; CHECK:   tB %bb.2, 14 /* CC::al */, $noreg
+  ; CHECK:   liveins: $r1, $r2, $r3, $r4, $r5, $r7, $r12
+  ; CHECK:   $r10, $r0 = t2LDRDi8 $r7, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
+  ; CHECK:   $r6, $r8 = t2LDRDi8 $r7, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
+  ; CHECK:   $lr = t2WLS renamable $r3, %bb.5
   ; CHECK: bb.2.bb27:
   ; CHECK:   successors: %bb.3(0x80000000)
-  ; CHECK:   liveins: $lr, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $r9, $r12
-  ; CHECK:   t2STRDi8 killed $lr, killed $r7, $sp, 12, 14 /* CC::al */, $noreg :: (store 4 into %stack.6), (store 4 into %stack.5)
-  ; CHECK:   renamable $r0 = tLDRi renamable $r5, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i13)
-  ; CHECK:   renamable $r10 = t2LDRi12 renamable $r5, 16, 14 /* CC::al */, $noreg :: (load 4 from %ir.i28)
-  ; CHECK:   tSTRspi killed renamable $r0, $sp, 9, 14 /* CC::al */, $noreg :: (store 4 into %stack.0)
-  ; CHECK:   renamable $r0 = tLDRi renamable $r5, 1, 14 /* CC::al */, $noreg :: (load 4 from %ir.i34)
-  ; CHECK:   tSTRspi killed renamable $r4, $sp, 5, 14 /* CC::al */, $noreg :: (store 4 into %stack.4)
-  ; CHECK:   tSTRspi killed renamable $r0, $sp, 8, 14 /* CC::al */, $noreg :: (store 4 into %stack.1)
-  ; CHECK:   renamable $r0 = tLDRi renamable $r5, 2, 14 /* CC::al */, $noreg :: (load 4 from %ir.i32)
-  ; CHECK:   tSTRspi killed renamable $r0, $sp, 7, 14 /* CC::al */, $noreg :: (store 4 into %stack.2)
-  ; CHECK:   renamable $r0 = tLDRi killed renamable $r5, 3, 14 /* CC::al */, $noreg :: (load 4 from %ir.i30)
-  ; CHECK:   tSTRspi killed renamable $r0, $sp, 6, 14 /* CC::al */, $noreg :: (store 4 into %stack.3)
-  ; CHECK:   $r0 = tMOVr killed $r3, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r3 = tLDRspi $sp, 0, 14 /* CC::al */, $noreg :: (load 4 from %stack.9)
+  ; CHECK:   liveins: $lr, $r0, $r1, $r2, $r4, $r5, $r6, $r7, $r8, $r10, $r12
+  ; CHECK:   renamable $r3 = tLDRi renamable $r5, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i13)
+  ; CHECK:   t2STRDi8 killed $r7, killed $r4, $sp, 12, 14 /* CC::al */, $noreg :: (store 4 into %stack.4), (store 4 into %stack.3)
+  ; CHECK:   tSTRspi killed renamable $r3, $sp, 7, 14 /* CC::al */, $noreg :: (store 4 into %stack.0)
+  ; CHECK:   renamable $r3 = tLDRi renamable $r5, 1, 14 /* CC::al */, $noreg :: (load 4 from %ir.i34)
+  ; CHECK:   renamable $r4 = tLDRi renamable $r5, 4, 14 /* CC::al */, $noreg :: (load 4 from %ir.i28)
+  ; CHECK:   tSTRspi killed renamable $r3, $sp, 6, 14 /* CC::al */, $noreg :: (store 4 into %stack.1)
+  ; CHECK:   $r9, $r3 = t2LDRDi8 $r5, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i32), (load 4 from %ir.i30)
+  ; CHECK:   tSTRspi killed renamable $r5, $sp, 5, 14 /* CC::al */, $noreg :: (store 4 into %stack.2)
   ; CHECK: bb.3.bb37 (align 4):
   ; CHECK:   successors: %bb.3(0x7c000000), %bb.4(0x04000000)
-  ; CHECK:   liveins: $r0, $r1, $r2, $r3, $r6, $r8, $r9, $r10, $r12
-  ; CHECK:   renamable $r4 = tLDRspi $sp, 8, 14 /* CC::al */, $noreg :: (load 4 from %stack.1)
+  ; CHECK:   liveins: $lr, $r0, $r1, $r2, $r3, $r4, $r6, $r8, $r9, $r10, $r12
   ; CHECK:   $r7 = tMOVr killed $r6, 14 /* CC::al */, $noreg
-  ; CHECK:   $r5 = tMOVr $r9, 14 /* CC::al */, $noreg
-  ; CHECK:   $lr = tMOVr $r0, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r6, renamable $r11 = t2SMULL killed $r9, killed renamable $r4, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r4 = tLDRspi $sp, 7, 14 /* CC::al */, $noreg :: (load 4 from %stack.2)
-  ; CHECK:   renamable $r0, dead $cpsr = tSUBi8 killed renamable $r0, 1, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r9, renamable $r1 = t2LDR_POST killed renamable $r1, 4, 14 /* CC::al */, $noreg :: (load 4 from %ir.i38)
-  ; CHECK:   dead renamable $lr = t2SUBri killed renamable $lr, 1, 14 /* CC::al */, $noreg, def $cpsr
-  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL killed renamable $r8, killed renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r4 = tLDRspi $sp, 6, 14 /* CC::al */, $noreg :: (load 4 from %stack.3)
-  ; CHECK:   $r8 = tMOVr $r5, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL renamable $r7, killed renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r4 = tLDRspi $sp, 9, 14 /* CC::al */, $noreg :: (load 4 from %stack.0)
-  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL killed renamable $r12, renamable $r10, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-  ; CHECK:   $r12 = tMOVr $r7, 14 /* CC::al */, $noreg
-  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL renamable $r9, killed renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-  ; CHECK:   early-clobber renamable $r6, dead early-clobber renamable $r11 = MVE_ASRLr killed renamable $r6, killed renamable $r11, renamable $r3, 14 /* CC::al */, $noreg
-  ; CHECK:   early-clobber renamable $r2 = t2STR_POST renamable $r6, killed renamable $r2, 4, 14 /* CC::al */, $noreg :: (store 4 into %ir.i39)
-  ; CHECK:   tBcc %bb.3, 1 /* CC::ne */, killed $cpsr
-  ; CHECK:   tB %bb.4, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r6 = tLDRspi $sp, 6, 14 /* CC::al */, $noreg :: (load 4 from %stack.1)
+  ; CHECK:   $r5 = tMOVr $r10, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r6, renamable $r11 = t2SMULL killed $r10, killed renamable $r6, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL killed renamable $r0, renamable $r9, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r10, renamable $r1 = t2LDR_POST killed renamable $r1, 4, 14 /* CC::al */, $noreg :: (load 4 from %ir.i38)
+  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL renamable $r7, renamable $r3, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r0 = tLDRspi $sp, 7, 14 /* CC::al */, $noreg :: (load 4 from %stack.0)
+  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL killed renamable $r8, renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+  ; CHECK:   renamable $r6, renamable $r11 = t2SMLAL renamable $r10, killed renamable $r0, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+  ; CHECK:   early-clobber renamable $r6, dead early-clobber renamable $r11 = MVE_ASRLr killed renamable $r6, killed renamable $r11, renamable $r2, 14 /* CC::al */, $noreg
+  ; CHECK:   early-clobber renamable $r12 = t2STR_POST renamable $r6, killed renamable $r12, 4, 14 /* CC::al */, $noreg :: (store 4 into %ir.i39)
+  ; CHECK:   $r8 = tMOVr $r7, 14 /* CC::al */, $noreg
+  ; CHECK:   $r0 = tMOVr $r5, 14 /* CC::al */, $noreg
+  ; CHECK:   $lr = t2LEUpdate killed renamable $lr, %bb.3
   ; CHECK: bb.4.bb72:
   ; CHECK:   successors: %bb.5(0x80000000)
-  ; CHECK:   liveins: $r5, $r6, $r7, $r9
-  ; CHECK:   $r12 = tMOVr killed $r7, 14 /* CC::al */, $noreg
-  ; CHECK:   $r7, $r4 = t2LDRDi8 $sp, 16, 14 /* CC::al */, $noreg :: (load 4 from %stack.5), (load 4 from %stack.4)
-  ; CHECK:   $lr = t2ADDri $sp, 4, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   $r8 = tMOVr killed $r5, 14 /* CC::al */, $noreg
-  ; CHECK:   t2LDMIA killed $lr, 14 /* CC::al */, $noreg, def $r2, def $r3, def $lr :: (load 4 from %stack.8), (load 4 from %stack.7), (load 4 from %stack.6)
+  ; CHECK:   liveins: $r2, $r5, $r6, $r7, $r10
+  ; CHECK:   $r0 = tMOVr killed $r5, 14 /* CC::al */, $noreg
+  ; CHECK:   $r8 = tMOVr killed $r7, 14 /* CC::al */, $noreg
+  ; CHECK:   $r12, $r3 = t2LDRDi8 $sp, 4, 14 /* CC::al */, $noreg :: (load 4 from %stack.6), (load 4 from %stack.5)
+  ; CHECK:   renamable $r5 = tLDRspi $sp, 5, 14 /* CC::al */, $noreg :: (load 4 from %stack.2)
+  ; CHECK:   $r7, $r4 = t2LDRDi8 $sp, 12, 14 /* CC::al */, $noreg :: (load 4 from %stack.4), (load 4 from %stack.3)
   ; CHECK: bb.5.bb74:
-  ; CHECK:   successors: %bb.1(0x80000000)
-  ; CHECK:   liveins: $lr, $r2, $r3, $r4, $r6, $r7, $r8, $r9, $r12
-  ; CHECK:   t2STRDi8 killed $r9, killed $r8, $r7, 0, 14 /* CC::al */, $noreg :: (store 4 into %ir.i14), (store 4 into %ir.i81)
-  ; CHECK:   t2STRDi8 killed $r6, killed $r12, $r7, 8, 14 /* CC::al */, $noreg :: (store 4 into %ir.i84), (store 4 into %ir.i88)
+  ; CHECK:   successors: %bb.6(0x04000000), %bb.1(0x7c000000)
+  ; CHECK:   liveins: $r0, $r3, $r4, $r5, $r6, $r7, $r8, $r10, $r12, $r2
+  ; CHECK:   renamable $r5, dead $cpsr = nuw tADDi8 killed renamable $r5, 20, 14 /* CC::al */, $noreg
+  ; CHECK:   t2STRDi8 killed $r10, killed $r0, $r7, 0, 14 /* CC::al */, $noreg :: (store 4 into %ir.i14), (store 4 into %ir.i81)
+  ; CHECK:   t2STRDi8 killed $r6, killed $r8, $r7, 8, 14 /* CC::al */, $noreg :: (store 4 into %ir.i84), (store 4 into %ir.i88)
   ; CHECK:   renamable $r7, dead $cpsr = nuw tADDi8 killed renamable $r7, 16, 14 /* CC::al */, $noreg
   ; CHECK:   renamable $r4, $cpsr = tSUBi8 killed renamable $r4, 1, 14 /* CC::al */, $noreg
-  ; CHECK:   $r5 = tMOVr killed $lr, 14 /* CC::al */, $noreg
-  ; CHECK:   $r1 = tMOVr $r2, 14 /* CC::al */, $noreg
-  ; CHECK:   t2IT 0, 4, implicit-def $itstate
-  ; CHECK:   $sp = frame-destroy tADDspi $sp, 10, 0 /* CC::eq */, $cpsr, implicit $itstate
-  ; CHECK:   $sp = frame-destroy t2LDMIA_RET $sp, 0 /* CC::eq */, killed $cpsr, def $r4, def $r5, def $r6, def $r7, def $r8, def $r9, def $r10, def $r11, def $pc, implicit $sp, implicit killed $r4, implicit killed $r5, implicit killed $r7, implicit killed $itstate
-  ; CHECK:   tB %bb.1, 14 /* CC::al */, $noreg
+  ; CHECK:   $r1 = tMOVr $r12, 14 /* CC::al */, $noreg
+  ; CHECK:   tBcc %bb.1, 1 /* CC::ne */, killed $cpsr
+  ; CHECK: bb.6.bb91:
+  ; CHECK:   $sp = frame-destroy tADDspi $sp, 8, 14 /* CC::al */, $noreg
+  ; CHECK:   $sp = frame-destroy t2LDMIA_RET $sp, 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $r7, def $r8, def $r9, def $r10, def $r11, def $pc
   bb.0.bb:
     successors: %bb.1(0x80000000)
     liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $r9, $r10, $r11, $lr
@@ -307,90 +290,82 @@ body:             |
     frame-setup CFI_INSTRUCTION offset $r6, -28
     frame-setup CFI_INSTRUCTION offset $r5, -32
     frame-setup CFI_INSTRUCTION offset $r4, -36
-    $sp = frame-setup tSUBspi $sp, 10, 14 /* CC::al */, $noreg
-    frame-setup CFI_INSTRUCTION def_cfa_offset 76
-    $r7, $r5 = t2LDRDi8 $r0, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i), (load 4 from %ir.i5)
-    $r6, $r4 = t2LDRDi8 killed $r0, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i7), (load 4 from %ir.i10)
+    $sp = frame-setup tSUBspi $sp, 8, 14 /* CC::al */, $noreg
+    frame-setup CFI_INSTRUCTION def_cfa_offset 68
+    $r6, $r4 = t2LDRDi8 $r0, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i7), (load 4 from %ir.i10)
+    $r7, $r5 = t2LDRDi8 killed $r0, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i), (load 4 from %ir.i5)
     renamable $r0 = t2RSBri killed renamable $r6, 31, 14 /* CC::al */, $noreg, $noreg
-    t2STMIA $sp, 14 /* CC::al */, $noreg, killed $r0, $r2, $r3 :: (store 4 into %stack.9), (store 4 into %stack.8), (store 4 into %stack.7)
+    t2STMIA $sp, 14 /* CC::al */, $noreg, killed $r0, $r2, $r3 :: (store 4 into %stack.7), (store 4 into %stack.6), (store 4 into %stack.5)
+    $r12 = tMOVr killed $r2, 14 /* CC::al */, $noreg
+    renamable $r2 = tLDRspi $sp, 0, 14 /* CC::al */, $noreg :: (load 4 from %stack.7)
 
   bb.1.bb12 (align 4):
     successors: %bb.2(0x40000000), %bb.5(0x40000000)
-    liveins: $r1, $r2, $r3, $r4, $r5, $r7
+    liveins: $r1, $r3, $r4, $r5, $r7, $r12, $r2
 
-    $r9, $r8 = t2LDRDi8 $r7, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
-    renamable $lr = nuw t2ADDri renamable $r5, 20, 14 /* CC::al */, $noreg, $noreg
-    $r6, $r12 = t2LDRDi8 $r7, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
-    t2WhileLoopStart renamable $r3, %bb.5, implicit-def dead $cpsr
+    $r10, $r0 = t2LDRDi8 $r7, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
+    $r6, $r8 = t2LDRDi8 $r7, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
+    renamable $lr = t2WhileLoopStartLR renamable $r3, %bb.5, implicit-def dead $cpsr
     tB %bb.2, 14 /* CC::al */, $noreg
 
   bb.2.bb27:
     successors: %bb.3(0x80000000)
-    liveins: $lr, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8, $r9, $r12
+    liveins: $lr, $r0, $r1, $r4, $r5, $r6, $r7, $r8, $r10, $r12, $r2
 
-    t2STRDi8 killed $lr, killed $r7, $sp, 12, 14 /* CC::al */, $noreg :: (store 4 into %stack.6), (store 4 into %stack.5)
-    renamable $r0 = tLDRi renamable $r5, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i13)
-    renamable $r10 = t2LDRi12 renamable $r5, 16, 14 /* CC::al */, $noreg :: (load 4 from %ir.i28)
-    tSTRspi killed renamable $r0, $sp, 9, 14 /* CC::al */, $noreg :: (store 4 into %stack.0)
-    renamable $r0 = tLDRi renamable $r5, 1, 14 /* CC::al */, $noreg :: (load 4 from %ir.i34)
-    tSTRspi killed renamable $r4, $sp, 5, 14 /* CC::al */, $noreg :: (store 4 into %stack.4)
-    tSTRspi killed renamable $r0, $sp, 8, 14 /* CC::al */, $noreg :: (store 4 into %stack.1)
-    renamable $r0 = tLDRi renamable $r5, 2, 14 /* CC::al */, $noreg :: (load 4 from %ir.i32)
-    tSTRspi killed renamable $r0, $sp, 7, 14 /* CC::al */, $noreg :: (store 4 into %stack.2)
-    renamable $r0 = tLDRi killed renamable $r5, 3, 14 /* CC::al */, $noreg :: (load 4 from %ir.i30)
-    tSTRspi killed renamable $r0, $sp, 6, 14 /* CC::al */, $noreg :: (store 4 into %stack.3)
-    $r0 = tMOVr killed $r3, 14 /* CC::al */, $noreg
-    renamable $r3 = tLDRspi $sp, 0, 14 /* CC::al */, $noreg :: (load 4 from %stack.9)
+    renamable $r3 = tLDRi renamable $r5, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i13)
+    t2STRDi8 killed $r7, killed $r4, $sp, 12, 14 /* CC::al */, $noreg :: (store 4 into %stack.4), (store 4 into %stack.3)
+    tSTRspi killed renamable $r3, $sp, 7, 14 /* CC::al */, $noreg :: (store 4 into %stack.0)
+    renamable $r3 = tLDRi renamable $r5, 1, 14 /* CC::al */, $noreg :: (load 4 from %ir.i34)
+    renamable $r4 = tLDRi renamable $r5, 4, 14 /* CC::al */, $noreg :: (load 4 from %ir.i28)
+    tSTRspi killed renamable $r3, $sp, 6, 14 /* CC::al */, $noreg :: (store 4 into %stack.1)
+    $r9, $r3 = t2LDRDi8 $r5, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i32), (load 4 from %ir.i30)
+    tSTRspi killed renamable $r5, $sp, 5, 14 /* CC::al */, $noreg :: (store 4 into %stack.2)
 
   bb.3.bb37 (align 4):
     successors: %bb.3(0x7c000000), %bb.4(0x04000000)
-    liveins: $r0, $r1, $r2, $r3, $r6, $r8, $r9, $r10, $r12
+    liveins: $lr, $r0, $r1, $r2, $r3, $r4, $r6, $r8, $r9, $r10, $r12
 
-    renamable $r4 = tLDRspi $sp, 8, 14 /* CC::al */, $noreg :: (load 4 from %stack.1)
     $r7 = tMOVr killed $r6, 14 /* CC::al */, $noreg
-    $r5 = tMOVr $r9, 14 /* CC::al */, $noreg
-    $lr = tMOVr $r0, 14 /* CC::al */, $noreg
-    renamable $r6, renamable $r11 = t2SMULL killed $r9, killed renamable $r4, 14 /* CC::al */, $noreg
-    renamable $r4 = tLDRspi $sp, 7, 14 /* CC::al */, $noreg :: (load 4 from %stack.2)
-    renamable $r0, dead $cpsr = tSUBi8 killed renamable $r0, 1, 14 /* CC::al */, $noreg
-    renamable $r9, renamable $r1 = t2LDR_POST killed renamable $r1, 4, 14 /* CC::al */, $noreg :: (load 4 from %ir.i38)
-    renamable $lr = t2LoopDec killed renamable $lr, 1
-    renamable $r6, renamable $r11 = t2SMLAL killed renamable $r8, killed renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-    renamable $r4 = tLDRspi $sp, 6, 14 /* CC::al */, $noreg :: (load 4 from %stack.3)
-    $r8 = tMOVr $r5, 14 /* CC::al */, $noreg
-    renamable $r6, renamable $r11 = t2SMLAL renamable $r7, killed renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-    renamable $r4 = tLDRspi $sp, 9, 14 /* CC::al */, $noreg :: (load 4 from %stack.0)
-    renamable $r6, renamable $r11 = t2SMLAL killed renamable $r12, renamable $r10, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-    $r12 = tMOVr $r7, 14 /* CC::al */, $noreg
-    renamable $r6, renamable $r11 = t2SMLAL renamable $r9, killed renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
-    early-clobber renamable $r6, dead early-clobber renamable $r11 = MVE_ASRLr killed renamable $r6, killed renamable $r11, renamable $r3, 14 /* CC::al */, $noreg
-    early-clobber renamable $r2 = t2STR_POST renamable $r6, killed renamable $r2, 4, 14 /* CC::al */, $noreg :: (store 4 into %ir.i39)
-    t2LoopEnd killed renamable $lr, %bb.3, implicit-def dead $cpsr
+    renamable $r6 = tLDRspi $sp, 6, 14 /* CC::al */, $noreg :: (load 4 from %stack.1)
+    $r5 = tMOVr $r10, 14 /* CC::al */, $noreg
+    renamable $r6, renamable $r11 = t2SMULL killed $r10, killed renamable $r6, 14 /* CC::al */, $noreg
+    renamable $r6, renamable $r11 = t2SMLAL killed renamable $r0, renamable $r9, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+    renamable $r10, renamable $r1 = t2LDR_POST killed renamable $r1, 4, 14 /* CC::al */, $noreg :: (load 4 from %ir.i38)
+    renamable $r6, renamable $r11 = t2SMLAL renamable $r7, renamable $r3, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+    renamable $r0 = tLDRspi $sp, 7, 14 /* CC::al */, $noreg :: (load 4 from %stack.0)
+    renamable $r6, renamable $r11 = t2SMLAL killed renamable $r8, renamable $r4, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+    renamable $r6, renamable $r11 = t2SMLAL renamable $r10, killed renamable $r0, killed renamable $r6, killed renamable $r11, 14 /* CC::al */, $noreg
+    early-clobber renamable $r6, dead early-clobber renamable $r11 = MVE_ASRLr killed renamable $r6, killed renamable $r11, renamable $r2, 14 /* CC::al */, $noreg
+    early-clobber renamable $r12 = t2STR_POST renamable $r6, killed renamable $r12, 4, 14 /* CC::al */, $noreg :: (store 4 into %ir.i39)
+    $r8 = tMOVr $r7, 14 /* CC::al */, $noreg
+    $r0 = tMOVr $r5, 14 /* CC::al */, $noreg
+    renamable $lr = t2LoopEndDec killed renamable $lr, %bb.3, implicit-def dead $cpsr
     tB %bb.4, 14 /* CC::al */, $noreg
 
   bb.4.bb72:
     successors: %bb.5(0x80000000)
-    liveins: $r5, $r6, $r7, $r9
+    liveins: $r5, $r6, $r7, $r10, $r2
 
-    $r12 = tMOVr killed $r7, 14 /* CC::al */, $noreg
-    $r7, $r4 = t2LDRDi8 $sp, 16, 14 /* CC::al */, $noreg :: (load 4 from %stack.5), (load 4 from %stack.4)
-    $lr = t2ADDri $sp, 4, 14 /* CC::al */, $noreg, $noreg
-    $r8 = tMOVr killed $r5, 14 /* CC::al */, $noreg
-    t2LDMIA killed $lr, 14 /* CC::al */, $noreg, def $r2, def $r3, def $lr :: (load 4 from %stack.8), (load 4 from %stack.7), (load 4 from %stack.6)
+    $r0 = tMOVr killed $r5, 14 /* CC::al */, $noreg
+    $r8 = tMOVr killed $r7, 14 /* CC::al */, $noreg
+    $r12, $r3 = t2LDRDi8 $sp, 4, 14 /* CC::al */, $noreg :: (load 4 from %stack.6), (load 4 from %stack.5)
+    renamable $r5 = tLDRspi $sp, 5, 14 /* CC::al */, $noreg :: (load 4 from %stack.2)
+    $r7, $r4 = t2LDRDi8 $sp, 12, 14 /* CC::al */, $noreg :: (load 4 from %stack.4), (load 4 from %stack.3)
 
   bb.5.bb74:
-    successors: %bb.1(0x7c000000)
-    liveins: $lr, $r2, $r3, $r4, $r6, $r7, $r8, $r9, $r12
+    successors: %bb.6(0x04000000), %bb.1(0x7c000000)
+    liveins: $r0, $r3, $r4, $r5, $r6, $r7, $r8, $r10, $r12, $r2
 
-    t2STRDi8 killed $r9, killed $r8, $r7, 0, 14 /* CC::al */, $noreg :: (store 4 into %ir.i14), (store 4 into %ir.i81)
-    t2STRDi8 killed $r6, killed $r12, $r7, 8, 14 /* CC::al */, $noreg :: (store 4 into %ir.i84), (store 4 into %ir.i88)
+    renamable $r5, dead $cpsr = nuw tADDi8 killed renamable $r5, 20, 14 /* CC::al */, $noreg
+    t2STRDi8 killed $r10, killed $r0, $r7, 0, 14 /* CC::al */, $noreg :: (store 4 into %ir.i14), (store 4 into %ir.i81)
+    t2STRDi8 killed $r6, killed $r8, $r7, 8, 14 /* CC::al */, $noreg :: (store 4 into %ir.i84), (store 4 into %ir.i88)
     renamable $r7, dead $cpsr = nuw tADDi8 killed renamable $r7, 16, 14 /* CC::al */, $noreg
     renamable $r4, $cpsr = tSUBi8 killed renamable $r4, 1, 14 /* CC::al */, $noreg
-    $r5 = tMOVr killed $lr, 14 /* CC::al */, $noreg
-    $r1 = tMOVr $r2, 14 /* CC::al */, $noreg
-    t2IT 0, 4, implicit-def $itstate
-    $sp = frame-destroy tADDspi $sp, 10, 0 /* CC::eq */, $cpsr, implicit $itstate
-    $sp = frame-destroy t2LDMIA_RET $sp, 0 /* CC::eq */, killed $cpsr, def $r4, def $r5, def $r6, def $r7, def $r8, def $r9, def $r10, def $r11, def $pc, implicit $sp, implicit killed $r4, implicit killed $r5, implicit killed $r7, implicit killed $itstate
-    tB %bb.1, 14 /* CC::al */, $noreg
+    $r1 = tMOVr $r12, 14 /* CC::al */, $noreg
+    tBcc %bb.1, 1 /* CC::ne */, killed $cpsr
+
+  bb.6.bb91:
+    $sp = frame-destroy tADDspi $sp, 8, 14 /* CC::al */, $noreg
+    $sp = frame-destroy t2LDMIA_RET $sp, 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $r7, def $r8, def $r9, def $r10, def $r11, def $pc
 
 ...

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize.mir
index f9b625c8141e..e3a158536477 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize.mir
@@ -323,7 +323,7 @@ body:             |
 
     $r9, $r4 = t2LDRDi8 $r3, 0, 14 /* CC::al */, $noreg :: (load 4 from %ir.i14), (load 4 from %ir.i20)
     $r6, $r0 = t2LDRDi8 $r3, 8, 14 /* CC::al */, $noreg :: (load 4 from %ir.i22), (load 4 from %ir.i24)
-    t2WhileLoopStart renamable $r8, %bb.5, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $r8, %bb.5, implicit-def dead $cpsr
     tB %bb.2, 14 /* CC::al */, $noreg
 
   bb.2.bb27:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/branch-targets.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/branch-targets.ll
index 624dd4cab949..07cf09c9dfea 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/branch-targets.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/branch-targets.ll
@@ -402,17 +402,18 @@ for.cond.cleanup:
 }
 
 ; CHECK-MID: check_negated_xor_wls
-; CHECK-MID:   t2WhileLoopStart $r2, %bb.3
+; CHECK-MID:   $lr = t2WhileLoopStartLR killed renamable $r2
 ; CHECK-MID:   tB %bb.1
-; CHECK-MID: bb.1.while.body.preheader:
-; CHECK-MID:   $lr = t2LoopDec killed renamable $lr, 1
-; CHECK-MID:   t2LoopEnd renamable $lr, %bb.2, implicit-def dead $cpsr
-; CHECk-MID:   tB %bb.3
-; CHECK-MID: bb.3.while.end:
+; CHECK-MID: bb.1.while.body:
+; CHECK-MID:   renamable $lr = t2LoopEndDec killed renamable $lr, %bb.1
+; CHECk-MID:   tB %bb.2
+; CHECK-MID: bb.2.while.end:
 define void @check_negated_xor_wls(i16* nocapture %a, i16* nocapture readonly %b, i32 %N) {
 entry:
-  %wls = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
-  %xor = xor i1 %wls, 1
+  %wls = call {i32, i1} @llvm.test.start.loop.iterations.i32(i32 %N)
+  %wls0 = extractvalue {i32, i1} %wls, 0
+  %wls1 = extractvalue {i32, i1} %wls, 1
+  %xor = xor i1 %wls1, 1
   br i1 %xor, label %while.end, label %while.body.preheader
 
 while.body.preheader:
@@ -421,7 +422,7 @@ while.body.preheader:
 while.body:
   %a.addr.06 = phi i16* [ %incdec.ptr1, %while.body ], [ %a, %while.body.preheader ]
   %b.addr.05 = phi i16* [ %incdec.ptr, %while.body ], [ %b, %while.body.preheader ]
-  %count = phi i32 [ %N, %while.body.preheader ], [ %count.next, %while.body ]
+  %count = phi i32 [ %wls0, %while.body.preheader ], [ %count.next, %while.body ]
   %incdec.ptr = getelementptr inbounds i16, i16* %b.addr.05, i32 1
   %ld.b = load i16, i16* %b.addr.05, align 2
   %incdec.ptr1 = getelementptr inbounds i16, i16* %a.addr.06, i32 1
@@ -435,17 +436,18 @@ while.end:
 }
 
 ; CHECK-MID: check_negated_cmp_wls
-; CHECK-MID:   t2WhileLoopStart $r2, %bb.3
+; CHECK-MID:   $lr = t2WhileLoopStartLR killed renamable $r2
 ; CHECK-MID:   tB %bb.1
-; CHECK-MID: bb.1.while.body.preheader:
-; CHECK-MID:   $lr = t2LoopDec killed renamable $lr, 1
-; CHECK-MID:   t2LoopEnd renamable $lr, %bb.2
-; CHECk-MID:   tB %bb.3
-; CHECK-MID: bb.3.while.end:
+; CHECK-MID: bb.1.while.body:
+; CHECK-MID:   renamable $lr = t2LoopEndDec killed renamable $lr, %bb.1
+; CHECk-MID:   tB %bb.2
+; CHECK-MID: bb.2.while.end:
 define void @check_negated_cmp_wls(i16* nocapture %a, i16* nocapture readonly %b, i32 %N) {
 entry:
-  %wls = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
-  %cmp = icmp ne i1 %wls, 1
+  %wls = call {i32, i1} @llvm.test.start.loop.iterations.i32(i32 %N)
+  %wls0 = extractvalue {i32, i1} %wls, 0
+  %wls1 = extractvalue {i32, i1} %wls, 1
+  %cmp = icmp ne i1 %wls1, 1
   br i1 %cmp, label %while.end, label %while.body.preheader
 
 while.body.preheader:
@@ -454,7 +456,7 @@ while.body.preheader:
 while.body:
   %a.addr.06 = phi i16* [ %incdec.ptr1, %while.body ], [ %a, %while.body.preheader ]
   %b.addr.05 = phi i16* [ %incdec.ptr, %while.body ], [ %b, %while.body.preheader ]
-  %count = phi i32 [ %N, %while.body.preheader ], [ %count.next, %while.body ]
+  %count = phi i32 [ %wls0, %while.body.preheader ], [ %count.next, %while.body ]
   %incdec.ptr = getelementptr inbounds i16, i16* %b.addr.05, i32 1
   %ld.b = load i16, i16* %b.addr.05, align 2
   %incdec.ptr1 = getelementptr inbounds i16, i16* %a.addr.06, i32 1
@@ -468,11 +470,10 @@ while.end:
 }
 
 ; CHECK-MID: check_negated_reordered_wls
-; CHECK-MID:   t2WhileLoopStart killed $r2, %bb.2
+; CHECK-MID:   $lr = t2WhileLoopStartLR killed renamable $r2
 ; CHECK-MID:   tB %bb.1
 ; CHECK-MID: bb.1.while.body:
-; CHECK-MID:   $lr = t2LoopDec killed renamable $lr, 1
-; CHECK-MID:   t2LoopEnd renamable $lr, %bb.1
+; CHECK-MID:   renamable $lr = t2LoopEndDec killed renamable $lr, %bb.1
 ; CHECk-MID:   tB %bb.2
 ; CHECK-MID: bb.2.while.end:
 define void @check_negated_reordered_wls(i16* nocapture %a, i16* nocapture readonly %b, i32 %N) {
@@ -485,7 +486,7 @@ while.body.preheader:
 while.body:
   %a.addr.06 = phi i16* [ %incdec.ptr1, %while.body ], [ %a, %while.body.preheader ]
   %b.addr.05 = phi i16* [ %incdec.ptr, %while.body ], [ %b, %while.body.preheader ]
-  %count = phi i32 [ %N, %while.body.preheader ], [ %count.next, %while.body ]
+  %count = phi i32 [ %wls0, %while.body.preheader ], [ %count.next, %while.body ]
   %incdec.ptr = getelementptr inbounds i16, i16* %b.addr.05, i32 1
   %ld.b = load i16, i16* %b.addr.05, align 2
   %incdec.ptr1 = getelementptr inbounds i16, i16* %a.addr.06, i32 1
@@ -495,8 +496,10 @@ while.body:
   br i1 %cmp, label %while.body, label %while.end
 
 while:
-  %wls = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
-  %xor = xor i1 %wls, 1
+  %wls = call {i32, i1} @llvm.test.start.loop.iterations.i32(i32 %N)
+  %wls0 = extractvalue {i32, i1} %wls, 0
+  %wls1 = extractvalue {i32, i1} %wls, 1
+  %xor = xor i1 %wls1, 1
   br i1 %xor, label %while.end, label %while.body.preheader
 
 while.end:
@@ -504,5 +507,5 @@ while.end:
 }
 
 declare i32 @llvm.start.loop.iterations.i32(i32)
-declare i1 @llvm.test.set.loop.iterations.i32(i32)
+declare {i32, i1} @llvm.test.start.loop.iterations.i32(i32)
 declare i32 @llvm.loop.decrement.reg.i32(i32, i32)

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
index c9df1bb2e2eb..5f9838e3e634 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
@@ -25,7 +25,7 @@ define arm_aapcs_vfpcc void @fast_float_mul(float* nocapture %a, float* nocaptur
 ; CHECK-NEXT:    beq .LBB0_4
 ; CHECK-NEXT:  @ %bb.2: @ %for.body.preheader
 ; CHECK-NEXT:    subs r5, r3, #1
-; CHECK-NEXT:    and lr, r3, #3
+; CHECK-NEXT:    and r12, r3, #3
 ; CHECK-NEXT:    cmp r5, #3
 ; CHECK-NEXT:    bhs .LBB0_6
 ; CHECK-NEXT:  @ %bb.3:
@@ -44,7 +44,7 @@ define arm_aapcs_vfpcc void @fast_float_mul(float* nocapture %a, float* nocaptur
 ; CHECK-NEXT:    letp lr, .LBB0_5
 ; CHECK-NEXT:    b .LBB0_11
 ; CHECK-NEXT:  .LBB0_6: @ %for.body.preheader.new
-; CHECK-NEXT:    sub.w r12, r3, lr
+; CHECK-NEXT:    sub.w lr, r3, r12
 ; CHECK-NEXT:    movs r4, #0
 ; CHECK-NEXT:    movs r3, #0
 ; CHECK-NEXT:  .LBB0_7: @ %for.body
@@ -56,7 +56,7 @@ define arm_aapcs_vfpcc void @fast_float_mul(float* nocapture %a, float* nocaptur
 ; CHECK-NEXT:    vldr s0, [r5]
 ; CHECK-NEXT:    adds r4, #16
 ; CHECK-NEXT:    vldr s2, [r6]
-; CHECK-NEXT:    cmp r12, r3
+; CHECK-NEXT:    cmp lr, r3
 ; CHECK-NEXT:    vmul.f32 s0, s2, s0
 ; CHECK-NEXT:    vstr s0, [r7]
 ; CHECK-NEXT:    vldr s0, [r5, #4]
@@ -73,7 +73,7 @@ define arm_aapcs_vfpcc void @fast_float_mul(float* nocapture %a, float* nocaptur
 ; CHECK-NEXT:    vstr s0, [r7, #12]
 ; CHECK-NEXT:    bne .LBB0_7
 ; CHECK-NEXT:  .LBB0_8: @ %for.cond.cleanup.loopexit.unr-lcssa
-; CHECK-NEXT:    wls lr, lr, .LBB0_11
+; CHECK-NEXT:    wls lr, r12, .LBB0_11
 ; CHECK-NEXT:  @ %bb.9: @ %for.body.epil.preheader
 ; CHECK-NEXT:    add.w r1, r1, r3, lsl #2
 ; CHECK-NEXT:    add.w r2, r2, r3, lsl #2

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/loop-guards.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/loop-guards.ll
index 9d46303946e9..b1efc91cdee9 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/loop-guards.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/loop-guards.ll
@@ -179,13 +179,11 @@ if.end:                                           ; preds = %do.body, %entry
   ret void
 }
 
-; TODO: Remove the tMOVr in the preheader!
 ; CHECK: ne_trip_count
 ; CHECK: body:
 ; CHECK: bb.0.entry:
-; CHECK:   $lr = t2WLS $r3, %bb.3
+; CHECK:   $lr = t2WLS killed renamable $r3, %bb.3
 ; CHECK: bb.1.do.body.preheader:
-; CHECK:   $lr = tMOVr
 ; CHECK: bb.2.do.body:
 ; CHECK:   $lr = t2LEUpdate killed renamable $lr, %bb.2
 define void @ne_trip_count(i1 zeroext %t1, i32* nocapture %a, i32* nocapture readonly %b, i32 %N) {

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
index b79d33e55bb2..752486a8cb33 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
@@ -33,8 +33,8 @@ define arm_aapcs_vfpcc void @float_float_mul(float* nocapture readonly %a, float
 ; CHECK-NEXT:  .LBB0_4: @ %for.body.preheader22
 ; CHECK-NEXT:    mvn.w r7, r12
 ; CHECK-NEXT:    adds r4, r7, r3
-; CHECK-NEXT:    and lr, r3, #3
-; CHECK-NEXT:    wls lr, lr, .LBB0_7
+; CHECK-NEXT:    and r7, r3, #3
+; CHECK-NEXT:    wls lr, r7, .LBB0_7
 ; CHECK-NEXT:  @ %bb.5: @ %for.body.prol.preheader
 ; CHECK-NEXT:    add.w r5, r0, r12, lsl #2
 ; CHECK-NEXT:    add.w r6, r1, r12, lsl #2
@@ -246,8 +246,8 @@ define arm_aapcs_vfpcc void @float_float_add(float* nocapture readonly %a, float
 ; CHECK-NEXT:  .LBB1_4: @ %for.body.preheader22
 ; CHECK-NEXT:    mvn.w r7, r12
 ; CHECK-NEXT:    adds r4, r7, r3
-; CHECK-NEXT:    and lr, r3, #3
-; CHECK-NEXT:    wls lr, lr, .LBB1_7
+; CHECK-NEXT:    and r7, r3, #3
+; CHECK-NEXT:    wls lr, r7, .LBB1_7
 ; CHECK-NEXT:  @ %bb.5: @ %for.body.prol.preheader
 ; CHECK-NEXT:    add.w r5, r0, r12, lsl #2
 ; CHECK-NEXT:    add.w r6, r1, r12, lsl #2
@@ -459,8 +459,8 @@ define arm_aapcs_vfpcc void @float_float_sub(float* nocapture readonly %a, float
 ; CHECK-NEXT:  .LBB2_4: @ %for.body.preheader22
 ; CHECK-NEXT:    mvn.w r7, r12
 ; CHECK-NEXT:    adds r4, r7, r3
-; CHECK-NEXT:    and lr, r3, #3
-; CHECK-NEXT:    wls lr, lr, .LBB2_7
+; CHECK-NEXT:    and r7, r3, #3
+; CHECK-NEXT:    wls lr, r7, .LBB2_7
 ; CHECK-NEXT:  @ %bb.5: @ %for.body.prol.preheader
 ; CHECK-NEXT:    add.w r5, r0, r12, lsl #2
 ; CHECK-NEXT:    add.w r6, r1, r12, lsl #2
@@ -681,8 +681,8 @@ define arm_aapcs_vfpcc void @float_int_mul(float* nocapture readonly %a, i32* no
 ; CHECK-NEXT:  .LBB3_7: @ %for.body.preheader16
 ; CHECK-NEXT:    mvn.w r7, r12
 ; CHECK-NEXT:    add.w r8, r7, r3
-; CHECK-NEXT:    and lr, r3, #3
-; CHECK-NEXT:    wls lr, lr, .LBB3_10
+; CHECK-NEXT:    and r7, r3, #3
+; CHECK-NEXT:    wls lr, r7, .LBB3_10
 ; CHECK-NEXT:  @ %bb.8: @ %for.body.prol.preheader
 ; CHECK-NEXT:    add.w r5, r0, r12, lsl #2
 ; CHECK-NEXT:    add.w r6, r1, r12, lsl #2
@@ -1424,7 +1424,7 @@ define arm_aapcs_vfpcc float @half_half_mac(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    cbz r2, .LBB9_3
 ; CHECK-NEXT:  @ %bb.1: @ %for.body.preheader
 ; CHECK-NEXT:    subs r3, r2, #1
-; CHECK-NEXT:    and lr, r2, #3
+; CHECK-NEXT:    and r12, r2, #3
 ; CHECK-NEXT:    vldr s0, .LCPI9_0
 ; CHECK-NEXT:    cmp r3, #3
 ; CHECK-NEXT:    bhs .LBB9_4
@@ -1435,7 +1435,7 @@ define arm_aapcs_vfpcc float @half_half_mac(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    vldr s0, .LCPI9_0
 ; CHECK-NEXT:    b .LBB9_9
 ; CHECK-NEXT:  .LBB9_4: @ %for.body.preheader.new
-; CHECK-NEXT:    sub.w r12, r2, lr
+; CHECK-NEXT:    sub.w lr, r2, r12
 ; CHECK-NEXT:    movs r3, #0
 ; CHECK-NEXT:    movs r2, #0
 ; CHECK-NEXT:  .LBB9_5: @ %for.body
@@ -1459,7 +1459,7 @@ define arm_aapcs_vfpcc float @half_half_mac(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    vcvtb.f32.f16 s6, s6
 ; CHECK-NEXT:    adds r3, #8
 ; CHECK-NEXT:    vmul.f16 s8, s10, s8
-; CHECK-NEXT:    cmp r12, r2
+; CHECK-NEXT:    cmp lr, r2
 ; CHECK-NEXT:    vcvtb.f32.f16 s8, s8
 ; CHECK-NEXT:    vadd.f32 s0, s0, s8
 ; CHECK-NEXT:    vadd.f32 s0, s0, s6
@@ -1467,7 +1467,7 @@ define arm_aapcs_vfpcc float @half_half_mac(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    vadd.f32 s0, s0, s2
 ; CHECK-NEXT:    bne .LBB9_5
 ; CHECK-NEXT:  .LBB9_6: @ %for.cond.cleanup.loopexit.unr-lcssa
-; CHECK-NEXT:    wls lr, lr, .LBB9_9
+; CHECK-NEXT:    wls lr, r12, .LBB9_9
 ; CHECK-NEXT:  @ %bb.7: @ %for.body.epil.preheader
 ; CHECK-NEXT:    add.w r0, r0, r2, lsl #1
 ; CHECK-NEXT:    add.w r1, r1, r2, lsl #1
@@ -1576,7 +1576,7 @@ define arm_aapcs_vfpcc float @half_half_acc(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    cbz r2, .LBB10_3
 ; CHECK-NEXT:  @ %bb.1: @ %for.body.preheader
 ; CHECK-NEXT:    subs r3, r2, #1
-; CHECK-NEXT:    and lr, r2, #3
+; CHECK-NEXT:    and r12, r2, #3
 ; CHECK-NEXT:    vldr s0, .LCPI10_0
 ; CHECK-NEXT:    cmp r3, #3
 ; CHECK-NEXT:    bhs .LBB10_4
@@ -1587,7 +1587,7 @@ define arm_aapcs_vfpcc float @half_half_acc(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    vldr s0, .LCPI10_0
 ; CHECK-NEXT:    b .LBB10_9
 ; CHECK-NEXT:  .LBB10_4: @ %for.body.preheader.new
-; CHECK-NEXT:    sub.w r12, r2, lr
+; CHECK-NEXT:    sub.w lr, r2, r12
 ; CHECK-NEXT:    movs r3, #0
 ; CHECK-NEXT:    movs r2, #0
 ; CHECK-NEXT:  .LBB10_5: @ %for.body
@@ -1611,7 +1611,7 @@ define arm_aapcs_vfpcc float @half_half_acc(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    vcvtb.f32.f16 s6, s6
 ; CHECK-NEXT:    adds r3, #8
 ; CHECK-NEXT:    vadd.f16 s8, s10, s8
-; CHECK-NEXT:    cmp r12, r2
+; CHECK-NEXT:    cmp lr, r2
 ; CHECK-NEXT:    vcvtb.f32.f16 s8, s8
 ; CHECK-NEXT:    vadd.f32 s0, s0, s8
 ; CHECK-NEXT:    vadd.f32 s0, s0, s6
@@ -1619,7 +1619,7 @@ define arm_aapcs_vfpcc float @half_half_acc(half* nocapture readonly %a, half* n
 ; CHECK-NEXT:    vadd.f32 s0, s0, s2
 ; CHECK-NEXT:    bne .LBB10_5
 ; CHECK-NEXT:  .LBB10_6: @ %for.cond.cleanup.loopexit.unr-lcssa
-; CHECK-NEXT:    wls lr, lr, .LBB10_9
+; CHECK-NEXT:    wls lr, r12, .LBB10_9
 ; CHECK-NEXT:  @ %bb.7: @ %for.body.epil.preheader
 ; CHECK-NEXT:    add.w r0, r0, r2, lsl #1
 ; CHECK-NEXT:    add.w r1, r1, r2, lsl #1
@@ -1728,7 +1728,7 @@ define arm_aapcs_vfpcc float @half_short_mac(half* nocapture readonly %a, i16* n
 ; CHECK-NEXT:    cbz r2, .LBB11_3
 ; CHECK-NEXT:  @ %bb.1: @ %for.body.preheader
 ; CHECK-NEXT:    subs r3, r2, #1
-; CHECK-NEXT:    and lr, r2, #3
+; CHECK-NEXT:    and r12, r2, #3
 ; CHECK-NEXT:    vldr s0, .LCPI11_0
 ; CHECK-NEXT:    cmp r3, #3
 ; CHECK-NEXT:    bhs .LBB11_4
@@ -1739,7 +1739,7 @@ define arm_aapcs_vfpcc float @half_short_mac(half* nocapture readonly %a, i16* n
 ; CHECK-NEXT:    vldr s0, .LCPI11_0
 ; CHECK-NEXT:    b .LBB11_9
 ; CHECK-NEXT:  .LBB11_4: @ %for.body.preheader.new
-; CHECK-NEXT:    sub.w r12, r2, lr
+; CHECK-NEXT:    sub.w lr, r2, r12
 ; CHECK-NEXT:    adds r3, r1, #4
 ; CHECK-NEXT:    adds r4, r0, #4
 ; CHECK-NEXT:    movs r2, #0
@@ -1748,7 +1748,7 @@ define arm_aapcs_vfpcc float @half_short_mac(half* nocapture readonly %a, i16* n
 ; CHECK-NEXT:    ldrsh.w r5, [r3, #2]
 ; CHECK-NEXT:    vldr.16 s2, [r4, #2]
 ; CHECK-NEXT:    adds r2, #4
-; CHECK-NEXT:    cmp r12, r2
+; CHECK-NEXT:    cmp lr, r2
 ; CHECK-NEXT:    vmov s4, r5
 ; CHECK-NEXT:    ldrsh r5, [r3], #8
 ; CHECK-NEXT:    vcvt.f16.s32 s4, s4
@@ -1778,7 +1778,7 @@ define arm_aapcs_vfpcc float @half_short_mac(half* nocapture readonly %a, i16* n
 ; CHECK-NEXT:    vadd.f32 s0, s0, s2
 ; CHECK-NEXT:    bne .LBB11_5
 ; CHECK-NEXT:  .LBB11_6: @ %for.cond.cleanup.loopexit.unr-lcssa
-; CHECK-NEXT:    wls lr, lr, .LBB11_9
+; CHECK-NEXT:    wls lr, r12, .LBB11_9
 ; CHECK-NEXT:  @ %bb.7: @ %for.body.epil.preheader
 ; CHECK-NEXT:    add.w r0, r0, r2, lsl #1
 ; CHECK-NEXT:    add.w r1, r1, r2, lsl #1

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/predicated-liveout.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/predicated-liveout.mir
index 3a098f272cc0..8d0c21c5b612 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/predicated-liveout.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/predicated-liveout.mir
@@ -117,7 +117,7 @@ body:             |
     frame-setup CFI_INSTRUCTION offset $r7, -8
     renamable $r3, dead $cpsr = tADDi3 renamable $r2, 7, 14, $noreg
     renamable $lr = t2LSRri killed renamable $r3, 3, 14, $noreg, $noreg
-    t2WhileLoopStart renamable $lr, %bb.4, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $lr, %bb.4, implicit-def dead $cpsr
     tB %bb.1, 14, $noreg
 
   bb.1.for.body.preheader:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-non-loop.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-non-loop.mir
index 8f195cefa50e..5c8639eaa76e 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-non-loop.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-non-loop.mir
@@ -4,7 +4,7 @@
 # CHECK: bb.0.entry:
 # CHECK:   tBcc %bb.2, 3
 # CHECK: bb.1.not.preheader:
-# CHECK:   t2CMPri renamable $lr, 0, 14
+# CHECK:   $lr = t2SUBri killed renamable $lr, 0, 14
 # CHECK:   tBcc %bb.4, 0
 # CHECK:   tB %bb.2
 # CHECK: bb.3.while.body:
@@ -119,7 +119,7 @@ body:             |
     successors: %bb.2(0x40000000), %bb.4(0x40000000)
     liveins: $lr, $r0, $r1
   
-    t2WhileLoopStart renamable $lr, %bb.4, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $lr, %bb.4, implicit-def dead $cpsr
     tB %bb.2, 14, $noreg
   
   bb.2.while.body.preheader:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-while.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-while.mir
index 8126e0b8257f..d4fa91bf3718 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-while.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-while.mir
@@ -102,7 +102,7 @@ body:             |
   ; CHECK:   frame-setup CFI_INSTRUCTION def_cfa_offset 8
   ; CHECK:   frame-setup CFI_INSTRUCTION offset $lr, -4
   ; CHECK:   frame-setup CFI_INSTRUCTION offset $r7, -8
-  ; CHECK:   t2CMPri $r3, 0, 14 /* CC::al */, $noreg, implicit-def $cpsr
+  ; CHECK:   dead $lr = t2SUBri $r3, 0, 14 /* CC::al */, $noreg, def $cpsr
   ; CHECK:   t2Bcc %bb.3, 0 /* CC::eq */, killed $cpsr
   ; CHECK:   tB %bb.1, 14 /* CC::al */, $noreg
   ; CHECK: bb.1.do.body.preheader:
@@ -130,7 +130,7 @@ body:             |
     frame-setup CFI_INSTRUCTION def_cfa_offset 8
     frame-setup CFI_INSTRUCTION offset $lr, -4
     frame-setup CFI_INSTRUCTION offset $r7, -8
-    t2WhileLoopStart $r3, %bb.3, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR $r3, %bb.3, implicit-def dead $cpsr
     tB %bb.1, 14, $noreg
 
   bb.1.do.body.preheader:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmaxmin_vpred_r.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmaxmin_vpred_r.mir
index bedb54227455..9fe20c5ed911 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmaxmin_vpred_r.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmaxmin_vpred_r.mir
@@ -188,7 +188,7 @@ body:             |
     renamable $r12 = t2LDRi12 $sp, 44, 14, $noreg :: (load 4 from %fixed-stack.0, align 8)
     renamable $r5 = t2ADDri renamable $r12, 3, 14, $noreg, $noreg
     renamable $lr = t2LSRri killed renamable $r5, 2, 14, $noreg, $noreg
-    t2WhileLoopStart renamable $lr, %bb.3, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $lr, %bb.3, implicit-def dead $cpsr
     tB %bb.1, 14, $noreg
 
   bb.1.for.body.lr.ph:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
index d392baf542f6..8845cf8d7f45 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vmldava_in_vpt.mir
@@ -6,8 +6,10 @@
   entry:
     %add = add i32 %block_size, 3
     %div = lshr i32 %add, 2
-    %0 = call i1 @llvm.test.set.loop.iterations.i32(i32 %div)
-    br i1 %0, label %for.body.lr.ph, label %for.cond.cleanup
+    %0 = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
+    %wls0 = extractvalue { i32, i1 } %0, 0
+    %wls1 = extractvalue { i32, i1 } %0, 1
+    br i1 %wls1, label %for.body.lr.ph, label %for.cond.cleanup
 
   for.body.lr.ph:                                   ; preds = %entry
     %.splatinsert.i41 = insertelement <4 x i32> undef, i32 %out_activation_min, i32 0
@@ -21,7 +23,7 @@
     ret i32 %res
 
   for.body:                                         ; preds = %for.body, %for.body.lr.ph
-    %lsr.iv = phi i32 [ %lsr.iv.next, %for.body ], [ %div, %for.body.lr.ph ]
+    %lsr.iv = phi i32 [ %iv.next, %for.body ], [ %wls0, %for.body.lr.ph ]
     %input_1_vect.addr.052 = phi i8* [ %input_1_vect, %for.body.lr.ph ], [ %add.ptr, %for.body ]
     %input_2_vect.addr.051 = phi i8* [ %input_2_vect, %for.body.lr.ph ], [ %add.ptr14, %for.body ]
     %num_elements.049 = phi i32 [ %block_size, %for.body.lr.ph ], [ %sub, %for.body ]
@@ -47,9 +49,8 @@
     %add.ptr = getelementptr inbounds i8, i8* %input_1_vect.addr.052, i32 4
     %add.ptr14 = getelementptr inbounds i8, i8* %input_2_vect.addr.051, i32 4
     %sub = add i32 %num_elements.049, -4
-    %iv.next = call i32 @llvm.loop.decrement.reg.i32.i32.i32(i32 %lsr.iv, i32 1)
+    %iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
     %cmp = icmp ne i32 %iv.next, 0
-    %lsr.iv.next = add i32 %lsr.iv, -1
     br i1 %cmp, label %for.body, label %for.cond.cleanup
   }
   declare <4 x i1> @llvm.arm.mve.vctp32(i32) #1
@@ -58,8 +59,8 @@
   declare <4 x i32> @llvm.arm.mve.min.predicated.v4i32.v4i1(<4 x i32>, <4 x i32>, i32, <4 x i1>, <4 x i32>) #1
   declare <4 x i32> @llvm.arm.mve.max.predicated.v4i32.v4i1(<4 x i32>, <4 x i32>, i32, <4 x i1>, <4 x i32>) #1
   declare i32 @llvm.arm.mve.vmldava.predicated.v4i32.v4i1(i32, i32, i32, i32, <4 x i32>, <4 x i32>, <4 x i1>) #1
-  declare i1 @llvm.test.set.loop.iterations.i32(i32) #4
-  declare i32 @llvm.loop.decrement.reg.i32.i32.i32(i32, i32) #4
+  declare { i32, i1 } @llvm.test.start.loop.iterations.i32(i32) #4
+  declare i32 @llvm.loop.decrement.reg.i32(i32, i32) #4
 ...
 ---
 name:            vmldava_in_vpt
@@ -82,7 +83,7 @@ frameInfo:
   isReturnAddressTaken: false
   hasStackMap:     false
   hasPatchPoint:   false
-  stackSize:       20
+  stackSize:       16
   offsetAdjustment: 0
   maxAlignment:    4
   adjustsStack:    false
@@ -120,117 +121,109 @@ stack:
       stack-id: default, callee-saved-register: '$lr', callee-saved-restored: false,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
   - { id: 1, name: '', type: spill-slot, offset: -8, size: 4, alignment: 4,
-      stack-id: default, callee-saved-register: '$r7', callee-saved-restored: true,
-      debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 2, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r6', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 3, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
+  - { id: 2, name: '', type: spill-slot, offset: -12, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r5', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
-  - { id: 4, name: '', type: spill-slot, offset: -20, size: 4, alignment: 4,
+  - { id: 3, name: '', type: spill-slot, offset: -16, size: 4, alignment: 4,
       stack-id: default, callee-saved-register: '$r4', callee-saved-restored: true,
       debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
 callSites:       []
+debugValueSubstitutions: []
 constants:       []
 machineFunctionInfo: {}
 body:             |
   ; CHECK-LABEL: name: vmldava_in_vpt
   ; CHECK: bb.0.entry:
   ; CHECK:   successors: %bb.1(0x40000000), %bb.3(0x40000000)
-  ; CHECK:   liveins: $lr, $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7
-  ; CHECK:   frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $lr, implicit-def $sp, implicit $sp
-  ; CHECK:   frame-setup CFI_INSTRUCTION def_cfa_offset 20
+  ; CHECK:   liveins: $lr, $r0, $r1, $r2, $r3, $r4, $r5, $r6
+  ; CHECK:   frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $r5, killed $r6, killed $lr, implicit-def $sp, implicit $sp
+  ; CHECK:   frame-setup CFI_INSTRUCTION def_cfa_offset 16
   ; CHECK:   frame-setup CFI_INSTRUCTION offset $lr, -4
-  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r7, -8
-  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r6, -12
-  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r5, -16
-  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r4, -20
-  ; CHECK:   renamable $r7 = tLDRspi $sp, 10, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.5)
+  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r6, -8
+  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r5, -12
+  ; CHECK:   frame-setup CFI_INSTRUCTION offset $r4, -16
+  ; CHECK:   renamable $r4 = tLDRspi $sp, 9, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.5)
   ; CHECK:   renamable $r12 = t2MOVi 0, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   $lr = MVE_WLSTP_32 killed renamable $r7, %bb.3
+  ; CHECK:   renamable $r5, dead $cpsr = tADDi3 renamable $r4, 3, 14 /* CC::al */, $noreg
+  ; CHECK:   dead renamable $r5, dead $cpsr = tLSRri killed renamable $r5, 2, 14 /* CC::al */, $noreg
+  ; CHECK:   $lr = MVE_WLSTP_32 killed renamable $r4, %bb.3
   ; CHECK: bb.1.for.body.lr.ph:
   ; CHECK:   successors: %bb.2(0x80000000)
   ; CHECK:   liveins: $lr, $r0, $r1, $r2, $r3
-  ; CHECK:   $r5, $r12 = t2LDRDi8 $sp, 32, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.3), (load 4 from %fixed-stack.4, align 8)
-  ; CHECK:   renamable $r4 = tLDRspi $sp, 5, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
+  ; CHECK:   renamable $r5 = tLDRspi $sp, 4, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0, align 8)
+  ; CHECK:   $r6, $r12 = t2LDRDi8 $sp, 28, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.3), (load 4 from %fixed-stack.4, align 8)
   ; CHECK:   renamable $q0 = MVE_VDUP32 killed renamable $r12, 0, $noreg, undef renamable $q0
-  ; CHECK:   renamable $q1 = MVE_VDUP32 killed renamable $r5, 0, $noreg, undef renamable $q1
+  ; CHECK:   renamable $q1 = MVE_VDUP32 killed renamable $r6, 0, $noreg, undef renamable $q1
   ; CHECK:   renamable $r12 = t2MOVi 0, 14 /* CC::al */, $noreg, $noreg
   ; CHECK: bb.2.for.body:
   ; CHECK:   successors: %bb.2(0x7c000000), %bb.3(0x04000000)
-  ; CHECK:   liveins: $lr, $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r12
+  ; CHECK:   liveins: $lr, $q0, $q1, $r0, $r1, $r2, $r3, $r5, $r12
   ; CHECK:   renamable $r1, renamable $q2 = MVE_VLDRWU32_post killed renamable $r1, 4, 0, $noreg :: (load 16 from %ir.input_2_cast, align 4)
   ; CHECK:   renamable $r0, renamable $q3 = MVE_VLDRWU32_post killed renamable $r0, 4, 0, $noreg :: (load 16 from %ir.input_1_cast, align 4)
   ; CHECK:   renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r3, 0, $noreg, undef renamable $q2
   ; CHECK:   renamable $q3 = MVE_VADD_qr_i32 killed renamable $q3, renamable $r2, 0, $noreg, undef renamable $q3
-  ; CHECK:   renamable $q2 = MVE_VMULi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
-  ; CHECK:   renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r4, 0, $noreg, undef renamable $q2
-  ; CHECK:   renamable $q2 = MVE_VMAXu32 killed renamable $q2, renamable $q1, 0, $noreg, undef renamable $q2
+  ; CHECK:   renamable $q3 = MVE_VMLAS_qr_u32 killed renamable $q3, killed renamable $q2, renamable $r5, 0, $noreg
+  ; CHECK:   renamable $q2 = MVE_VMAXu32 killed renamable $q3, renamable $q1, 0, $noreg, undef renamable $q2
   ; CHECK:   renamable $q3 = MVE_VMINu32 renamable $q2, renamable $q0, 0, $noreg, undef renamable $q3
   ; CHECK:   renamable $r12 = MVE_VMLADAVas32 killed renamable $r12, killed renamable $q3, killed renamable $q2, 0, killed $noreg
   ; CHECK:   $lr = MVE_LETP killed renamable $lr, %bb.2
   ; CHECK: bb.3.for.cond.cleanup:
   ; CHECK:   liveins: $r12
   ; CHECK:   $r0 = tMOVr killed $r12, 14 /* CC::al */, $noreg
-  ; CHECK:   tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $r7, def $pc, implicit killed $r0
+  ; CHECK:   frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $pc, implicit killed $r0
   bb.0.entry:
     successors: %bb.1(0x40000000), %bb.3(0x40000000)
-    liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $lr
+    liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $lr
 
-    frame-setup tPUSH 14, $noreg, killed $r4, killed $r5, killed $r6, killed $r7, killed $lr, implicit-def $sp, implicit $sp
-    frame-setup CFI_INSTRUCTION def_cfa_offset 20
+    frame-setup tPUSH 14 /* CC::al */, $noreg, killed $r4, killed $r5, killed $r6, killed $lr, implicit-def $sp, implicit $sp
+    frame-setup CFI_INSTRUCTION def_cfa_offset 16
     frame-setup CFI_INSTRUCTION offset $lr, -4
-    frame-setup CFI_INSTRUCTION offset $r7, -8
-    frame-setup CFI_INSTRUCTION offset $r6, -12
-    frame-setup CFI_INSTRUCTION offset $r5, -16
-    frame-setup CFI_INSTRUCTION offset $r4, -20
-    renamable $r7 = tLDRspi $sp, 10, 14, $noreg :: (load 4 from %fixed-stack.0)
-    renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
-    renamable $r4, dead $cpsr = tADDi3 renamable $r7, 3, 14, $noreg
-    renamable $r5, dead $cpsr = tLSRri killed renamable $r4, 2, 14, $noreg
-    t2WhileLoopStart renamable $r5, %bb.3, implicit-def dead $cpsr
-    tB %bb.1, 14, $noreg
+    frame-setup CFI_INSTRUCTION offset $r6, -8
+    frame-setup CFI_INSTRUCTION offset $r5, -12
+    frame-setup CFI_INSTRUCTION offset $r4, -16
+    renamable $r4 = tLDRspi $sp, 9, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.0)
+    renamable $r12 = t2MOVi 0, 14 /* CC::al */, $noreg, $noreg
+    renamable $r5, dead $cpsr = tADDi3 renamable $r4, 3, 14 /* CC::al */, $noreg
+    renamable $r5, dead $cpsr = tLSRri killed renamable $r5, 2, 14 /* CC::al */, $noreg
+    renamable $lr = t2WhileLoopStartLR killed renamable $r5, %bb.3, implicit-def dead $cpsr
+    tB %bb.1, 14 /* CC::al */, $noreg
 
   bb.1.for.body.lr.ph:
     successors: %bb.2(0x80000000)
-    liveins: $r0, $r1, $r2, $r3, $r5, $r7
+    liveins: $lr, $r0, $r1, $r2, $r3, $r4
 
-    $r6 = tMOVr killed $r5, 14, $noreg
-    $r5, $r12 = t2LDRDi8 $sp, 32, 14, $noreg :: (load 4 from %fixed-stack.2), (load 4 from %fixed-stack.1, align 8)
-    renamable $r4 = tLDRspi $sp, 5, 14, $noreg :: (load 4 from %fixed-stack.5, align 8)
+    renamable $r5 = tLDRspi $sp, 4, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.5, align 8)
+    $r6, $r12 = t2LDRDi8 $sp, 28, 14 /* CC::al */, $noreg :: (load 4 from %fixed-stack.2), (load 4 from %fixed-stack.1, align 8)
     renamable $q0 = MVE_VDUP32 killed renamable $r12, 0, $noreg, undef renamable $q0
-    renamable $q1 = MVE_VDUP32 killed renamable $r5, 0, $noreg, undef renamable $q1
-    renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
+    renamable $q1 = MVE_VDUP32 killed renamable $r6, 0, $noreg, undef renamable $q1
+    renamable $r12 = t2MOVi 0, 14 /* CC::al */, $noreg, $noreg
 
   bb.2.for.body:
     successors: %bb.2(0x7c000000), %bb.3(0x04000000)
-    liveins: $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r6, $r7, $r12
+    liveins: $lr, $q0, $q1, $r0, $r1, $r2, $r3, $r4, $r5, $r12
 
-    renamable $vpr = MVE_VCTP32 renamable $r7, 0, $noreg
+    renamable $vpr = MVE_VCTP32 renamable $r4, 0, $noreg
     MVE_VPST 8, implicit $vpr
     renamable $r1, renamable $q2 = MVE_VLDRWU32_post killed renamable $r1, 4, 1, renamable $vpr :: (load 16 from %ir.input_2_cast, align 4)
     MVE_VPST 8, implicit $vpr
     renamable $r0, renamable $q3 = MVE_VLDRWU32_post killed renamable $r0, 4, 1, renamable $vpr :: (load 16 from %ir.input_1_cast, align 4)
     renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r3, 0, $noreg, undef renamable $q2
     renamable $q3 = MVE_VADD_qr_i32 killed renamable $q3, renamable $r2, 0, $noreg, undef renamable $q3
-    $lr = tMOVr $r6, 14, $noreg
-    renamable $q2 = MVE_VMULi32 killed renamable $q3, killed renamable $q2, 0, $noreg, undef renamable $q2
-    renamable $r6, dead $cpsr = tSUBi8 killed $r6, 1, 14, $noreg
-    renamable $q2 = MVE_VADD_qr_i32 killed renamable $q2, renamable $r4, 0, $noreg, undef renamable $q2
-    renamable $r7, dead $cpsr = tSUBi8 killed renamable $r7, 4, 14, $noreg
+    renamable $r4, dead $cpsr = tSUBi8 killed renamable $r4, 4, 14 /* CC::al */, $noreg
+    renamable $q3 = MVE_VMLAS_qr_u32 killed renamable $q3, killed renamable $q2, renamable $r5, 0, $noreg
     MVE_VPST 2, implicit $vpr
-    renamable $q2 = MVE_VMAXu32 killed renamable $q2, renamable $q1, 1, renamable $vpr, undef renamable $q2
+    renamable $q2 = MVE_VMAXu32 killed renamable $q3, renamable $q1, 1, renamable $vpr, undef renamable $q2
     renamable $q3 = MVE_VMINu32 renamable $q2, renamable $q0, 1, renamable $vpr, undef renamable $q3
     renamable $r12 = MVE_VMLADAVas32 killed renamable $r12, killed renamable $q3, killed renamable $q2, 1, killed renamable $vpr
-    renamable $lr = t2LoopDec killed renamable $lr, 1
-    t2LoopEnd killed renamable $lr, %bb.2, implicit-def dead $cpsr
-    tB %bb.3, 14, $noreg
+    renamable $lr = t2LoopEndDec killed renamable $lr, %bb.2, implicit-def dead $cpsr
+    tB %bb.3, 14 /* CC::al */, $noreg
 
   bb.3.for.cond.cleanup:
     liveins: $r12
 
-    $r0 = tMOVr killed $r12, 14, $noreg
-    tPOP_RET 14, $noreg, def $r4, def $r5, def $r6, def $r7, def $pc, implicit killed $r0
+    $r0 = tMOVr killed $r12, 14 /* CC::al */, $noreg
+    frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $pc, implicit killed $r0
 
 ...

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
index 888337c12365..52690fc11f33 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
@@ -164,81 +164,75 @@ define dso_local i32 @b(i32* %c, i32 %d, i32 %e) "frame-pointer"="all" {
 ; CHECK-NEXT:    push.w {r8, r9, r10, r11}
 ; CHECK-NEXT:    .pad #8
 ; CHECK-NEXT:    sub sp, #8
-; CHECK-NEXT:    str r1, [sp, #4] @ 4-byte Spill
-; CHECK-NEXT:    cmp.w r1, #0
-; CHECK-NEXT:    beq .LBB2_3
-; CHECK-NEXT:    b .LBB2_1
-; CHECK-NEXT:  .LBB2_1: @ %while.body.preheader
+; CHECK-NEXT:    wls lr, r1, .LBB2_3
+; CHECK-NEXT:  @ %bb.1: @ %while.body.preheader
 ; CHECK-NEXT:    adds r1, r0, #4
-; CHECK-NEXT:    mov r3, r2
-; CHECK-NEXT:    mvn r2, #1
+; CHECK-NEXT:    mvn r3, #1
 ; CHECK-NEXT:    @ implicit-def: $r9
 ; CHECK-NEXT:    @ implicit-def: $r10
 ; CHECK-NEXT:    @ implicit-def: $r6
-; CHECK-NEXT:    @ implicit-def: $r8
-; CHECK-NEXT:    str r3, [sp] @ 4-byte Spill
+; CHECK-NEXT:    @ implicit-def: $r4
+; CHECK-NEXT:    str r2, [sp] @ 4-byte Spill
 ; CHECK-NEXT:  .LBB2_2: @ %while.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    mov lr, r1
+; CHECK-NEXT:    str r1, [sp, #4] @ 4-byte Spill
+; CHECK-NEXT:    ldr r1, [sp, #4] @ 4-byte Reload
+; CHECK-NEXT:    ldr.w r8, [r10]
 ; CHECK-NEXT:    ldr r1, [r1, #-4]
+; CHECK-NEXT:    mul r11, r8, r0
+; CHECK-NEXT:    adds r0, #4
 ; CHECK-NEXT:    mul r1, r1, r9
 ; CHECK-NEXT:    adds.w r12, r1, #-2147483648
 ; CHECK-NEXT:    asr.w r5, r1, #31
-; CHECK-NEXT:    ldr.w r1, [r10]
+; CHECK-NEXT:    add.w r1, r11, #-2147483648
 ; CHECK-NEXT:    adc r5, r5, #0
-; CHECK-NEXT:    mul r11, r1, r0
-; CHECK-NEXT:    adds r0, #4
-; CHECK-NEXT:    add.w r3, r11, #-2147483648
-; CHECK-NEXT:    asrl r12, r5, r3
-; CHECK-NEXT:    smull r4, r3, r1, r12
-; CHECK-NEXT:    lsll r4, r3, #30
-; CHECK-NEXT:    asrs r5, r3, #31
-; CHECK-NEXT:    mov r4, r3
-; CHECK-NEXT:    lsll r4, r5, r1
-; CHECK-NEXT:    lsll r4, r5, #30
-; CHECK-NEXT:    ldrd r4, r11, [r2]
-; CHECK-NEXT:    asrs r3, r5, #31
+; CHECK-NEXT:    asrl r12, r5, r1
+; CHECK-NEXT:    smull r2, r1, r8, r12
+; CHECK-NEXT:    lsll r2, r1, #30
+; CHECK-NEXT:    asrs r5, r1, #31
+; CHECK-NEXT:    mov r2, r1
+; CHECK-NEXT:    lsll r2, r5, r8
+; CHECK-NEXT:    lsll r2, r5, #30
+; CHECK-NEXT:    ldrd r2, r11, [r3]
+; CHECK-NEXT:    asrs r1, r5, #31
 ; CHECK-NEXT:    mov r12, r5
-; CHECK-NEXT:    ldr.w r5, [lr]
-; CHECK-NEXT:    muls r4, r6, r4
-; CHECK-NEXT:    mul r5, r5, r9
+; CHECK-NEXT:    asrs r5, r4, #31
+; CHECK-NEXT:    muls r2, r6, r2
+; CHECK-NEXT:    adds r2, #2
+; CHECK-NEXT:    lsll r12, r1, r2
+; CHECK-NEXT:    ldr r2, [sp, #4] @ 4-byte Reload
+; CHECK-NEXT:    add.w r1, r12, #-2147483648
+; CHECK-NEXT:    ldr r2, [r2]
+; CHECK-NEXT:    mul r2, r2, r9
 ; CHECK-NEXT:    add.w r9, r9, #4
-; CHECK-NEXT:    adds r4, #2
-; CHECK-NEXT:    lsll r12, r3, r4
-; CHECK-NEXT:    asr.w r4, r8, #31
-; CHECK-NEXT:    adds.w r3, r8, r5
-; CHECK-NEXT:    add.w r12, r12, #-2147483648
-; CHECK-NEXT:    adc.w r4, r4, r5, asr #31
-; CHECK-NEXT:    smull r5, r6, r11, r6
-; CHECK-NEXT:    adds.w r3, r3, #-2147483648
-; CHECK-NEXT:    adc r3, r4, #0
-; CHECK-NEXT:    asrs r4, r3, #31
-; CHECK-NEXT:    subs r5, r3, r5
-; CHECK-NEXT:    sbcs r4, r6
-; CHECK-NEXT:    adds.w r6, r5, #-2147483648
-; CHECK-NEXT:    adc r5, r4, #0
-; CHECK-NEXT:    asrl r6, r5, r12
+; CHECK-NEXT:    adds r4, r4, r2
+; CHECK-NEXT:    adc.w r2, r5, r2, asr #31
+; CHECK-NEXT:    adds.w r5, r4, #-2147483648
+; CHECK-NEXT:    smull r6, r4, r11, r6
+; CHECK-NEXT:    adc r2, r2, #0
+; CHECK-NEXT:    asrs r5, r2, #31
+; CHECK-NEXT:    subs r6, r2, r6
+; CHECK-NEXT:    sbcs r5, r4
+; CHECK-NEXT:    adds.w r6, r6, #-2147483648
+; CHECK-NEXT:    adc r5, r5, #0
+; CHECK-NEXT:    asrl r6, r5, r1
+; CHECK-NEXT:    movs r1, #2
 ; CHECK-NEXT:    lsrl r6, r5, #2
-; CHECK-NEXT:    movs r5, #2
-; CHECK-NEXT:    str r6, [r5]
-; CHECK-NEXT:    ldr r5, [r2], #-4
-; CHECK-NEXT:    mls r1, r5, r1, r3
-; CHECK-NEXT:    adds.w r8, r1, #-2147483648
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc r1, r3, #0
-; CHECK-NEXT:    ldr r3, [sp] @ 4-byte Reload
-; CHECK-NEXT:    lsrl r8, r1, #2
-; CHECK-NEXT:    rsb.w r1, r8, #0
+; CHECK-NEXT:    str r6, [r1]
+; CHECK-NEXT:    ldr r1, [r3], #-4
+; CHECK-NEXT:    mls r1, r1, r8, r2
+; CHECK-NEXT:    adds.w r4, r1, #-2147483648
+; CHECK-NEXT:    asr.w r2, r1, #31
+; CHECK-NEXT:    adc r1, r2, #0
+; CHECK-NEXT:    ldr r2, [sp] @ 4-byte Reload
+; CHECK-NEXT:    lsrl r4, r1, #2
+; CHECK-NEXT:    rsbs r1, r4, #0
 ; CHECK-NEXT:    str r1, [r10, #-4]
 ; CHECK-NEXT:    add.w r10, r10, #4
-; CHECK-NEXT:    str r1, [r3]
-; CHECK-NEXT:    mov r1, lr
-; CHECK-NEXT:    add.w r1, lr, #4
-; CHECK-NEXT:    ldr.w lr, [sp, #4] @ 4-byte Reload
-; CHECK-NEXT:    subs.w lr, lr, #1
-; CHECK-NEXT:    str.w lr, [sp, #4] @ 4-byte Spill
-; CHECK-NEXT:    bne .LBB2_2
-; CHECK-NEXT:    b .LBB2_3
+; CHECK-NEXT:    str r1, [r2]
+; CHECK-NEXT:    ldr r1, [sp, #4] @ 4-byte Reload
+; CHECK-NEXT:    adds r1, #4
+; CHECK-NEXT:    le lr, .LBB2_2
 ; CHECK-NEXT:  .LBB2_3: @ %while.end
 ; CHECK-NEXT:    add sp, #8
 ; CHECK-NEXT:    pop.w {r8, r9, r10, r11}
@@ -328,20 +322,21 @@ define void @callinpreheader(i32* noalias nocapture readonly %pAngle, i32* nocap
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .save {r4, r5, r6, lr}
 ; CHECK-NEXT:    push {r4, r5, r6, lr}
+; CHECK-NEXT:    subs r6, r2, #0
 ; CHECK-NEXT:    mov r5, r0
 ; CHECK-NEXT:    mov r4, r1
-; CHECK-NEXT:    movs r0, #0
-; CHECK-NEXT:    wls lr, r2, .LBB3_3
+; CHECK-NEXT:    mov.w r0, #0
+; CHECK-NEXT:    beq .LBB3_3
 ; CHECK-NEXT:  @ %bb.1: @ %for.body.ph
-; CHECK-NEXT:    mov r6, r2
 ; CHECK-NEXT:    bl callee
-; CHECK-NEXT:    mov lr, r6
 ; CHECK-NEXT:    movs r0, #0
 ; CHECK-NEXT:  .LBB3_2: @ %for.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldr r1, [r5], #4
+; CHECK-NEXT:    subs r6, #1
 ; CHECK-NEXT:    add r0, r1
-; CHECK-NEXT:    le lr, .LBB3_2
+; CHECK-NEXT:    cbz r6, .LBB3_3
+; CHECK-NEXT:    le .LBB3_2
 ; CHECK-NEXT:  .LBB3_3: @ %for.cond.cleanup
 ; CHECK-NEXT:    str r0, [r4]
 ; CHECK-NEXT:    pop {r4, r5, r6, pc}

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-negative-offset.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-negative-offset.mir
index cfafa4277311..e8a2239066be 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-negative-offset.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-negative-offset.mir
@@ -189,7 +189,7 @@ body:             |
     successors: %bb.2(0x40000000), %bb.1(0x40000000)
   
     $r0 = tLDRspi $sp, 7, 14, $noreg :: (load 4 from %stack.0)
-    t2WhileLoopStart killed renamable $r0, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r0, %bb.1, implicit-def dead $cpsr
     tB %bb.2, 14, $noreg
 
 ...

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while.mir
index b1ad223dc3b8..f8b215072052 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/while.mir
@@ -119,7 +119,7 @@ body:             |
     frame-setup CFI_INSTRUCTION def_cfa_offset 8
     frame-setup CFI_INSTRUCTION offset $lr, -4
     frame-setup CFI_INSTRUCTION offset $r7, -8
-    t2WhileLoopStart $r2, %bb.3, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR $r2, %bb.3, implicit-def dead $cpsr
     tB %bb.1, 14, $noreg
 
   bb.1.while.body.preheader:

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/wlstp.mir b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/wlstp.mir
index 07c136ec7c37..390d510a3f94 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/wlstp.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/wlstp.mir
@@ -228,7 +228,7 @@ body:             |
     renamable $r12 = t2BICri killed renamable $r12, 15, 14, $noreg, $noreg
     renamable $r12 = t2SUBri killed renamable $r12, 16, 14, $noreg, $noreg
     renamable $lr = nuw nsw t2ADDrs killed renamable $lr, killed renamable $r12, 35, 14, $noreg, $noreg
-    t2WhileLoopStart renamable $lr, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $lr, %bb.1, implicit-def dead $cpsr
     tB %bb.3, 14, $noreg
 
   bb.1.vector.ph:
@@ -345,7 +345,7 @@ body:             |
     renamable $r12 = t2BICri killed renamable $r12, 7, 14, $noreg, $noreg
     renamable $r12 = t2SUBri killed renamable $r12, 8, 14, $noreg, $noreg
     renamable $lr = nuw nsw t2ADDrs killed renamable $lr, killed renamable $r12, 27, 14, $noreg, $noreg
-    t2WhileLoopStart renamable $lr, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $lr, %bb.1, implicit-def dead $cpsr
     tB %bb.2, 14, $noreg
 
   bb.1.vector.body:
@@ -477,7 +477,7 @@ body:             |
     renamable $r3, dead $cpsr = tMOVi8 1, 14, $noreg
     renamable $lr = nuw nsw t2ADDrs killed renamable $r3, killed renamable $r12, 19, 14, $noreg, $noreg
     renamable $r12 = t2MOVi 0, 14, $noreg, $noreg
-    t2WhileLoopStart renamable $lr, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR renamable $lr, %bb.1, implicit-def dead $cpsr
     tB %bb.4, 14, $noreg
 
   bb.1.vector.ph:

diff  --git a/llvm/test/CodeGen/Thumb2/block-placement.mir b/llvm/test/CodeGen/Thumb2/block-placement.mir
index ed4a0a6b493d..855895b45ee6 100644
--- a/llvm/test/CodeGen/Thumb2/block-placement.mir
+++ b/llvm/test/CodeGen/Thumb2/block-placement.mir
@@ -47,7 +47,7 @@ body:             |
   ; CHECK:   frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r7, def $pc, implicit killed $itstate
   ; CHECK: bb.2:
   ; CHECK:   successors: %bb.3(0x80000000)
-  ; CHECK:   t2WhileLoopStart killed renamable $r0, %bb.1, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r0, %bb.1, implicit-def dead $cpsr
   ; CHECK:   tB %bb.3, 14 /* CC::al */, $noreg
   ; CHECK: bb.1:
   ; CHECK:   frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc
@@ -72,7 +72,7 @@ body:             |
     successors: %bb.3(0x80000000)
     liveins: $r0, $r1, $r2
 
-    t2WhileLoopStart killed renamable $r0, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r0, %bb.1, implicit-def dead $cpsr
 
   bb.3:
     successors: %bb.3(0x7c000000), %bb.1(0x04000000)
@@ -97,7 +97,7 @@ body:             |
   ; CHECK:   frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc
   ; CHECK: bb.2:
   ; CHECK:   successors: %bb.3(0x80000000)
-  ; CHECK:   t2WhileLoopStart killed renamable $r0, %bb.0, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r0, %bb.0, implicit-def dead $cpsr
   ; CHECK: bb.3:
   ; CHECK:   successors: %bb.3(0x7c000000), %bb.1(0x04000000)
   ; CHECK:   renamable $r0 = tLDRi renamable $r2, 0, 14 /* CC::al */, $noreg
@@ -119,7 +119,7 @@ body:             |
     successors: %bb.3(0x80000000)
     liveins: $r0, $r1, $r2
 
-    t2WhileLoopStart killed renamable $r0, %bb.0, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r0, %bb.0, implicit-def dead $cpsr
 
   bb.3:
     successors: %bb.3(0x7c000000), %bb.1(0x04000000)
@@ -144,14 +144,14 @@ body:             |
   ; CHECK:   successors: %bb.3(0x80000000)
   ; CHECK:   $lr = tMOVr $r0, 14 /* CC::al */, $noreg
   ; CHECK:   renamable $r0 = t2ADDrs killed renamable $r2, killed $r0, 18, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   t2WhileLoopStart killed renamable $lr, %bb.1, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $lr, %bb.1, implicit-def dead $cpsr
   ; CHECK:   tB %bb.3, 14 /* CC::al */, $noreg
   ; CHECK: bb.1:
   ; CHECK:   successors: %bb.4(0x80000000)
   ; CHECK:   tCMPi8 renamable $r1, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
   ; CHECK:   t2IT 11, 8, implicit-def $itstate
   ; CHECK:   frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r7, def $pc, implicit killed $itstate
-  ; CHECK:   t2WhileLoopStart killed renamable $r1, %bb.0, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r1, %bb.0, implicit-def dead $cpsr
   ; CHECK:   t2B %bb.4, 14 /* CC::al */, $noreg
   ; CHECK: bb.3:
   ; CHECK:   successors: %bb.3(0x7c000000), %bb.1(0x04000000)
@@ -160,7 +160,7 @@ body:             |
   ; CHECK: bb.4:
   ; CHECK:   successors: %bb.5(0x80000000)
   ; CHECK:   renamable $r0 = t2ADDrs killed renamable $r3, renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   t2WhileLoopStart killed renamable $r1, %bb.6, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r1, %bb.6, implicit-def dead $cpsr
   ; CHECK: bb.5:
   ; CHECK:   successors: %bb.5(0x7c000000), %bb.6(0x04000000)
   ; CHECK:   renamable $lr = t2LoopEndDec killed renamable $lr, %bb.5, implicit-def dead $cpsr
@@ -182,7 +182,7 @@ body:             |
     tCMPi8 renamable $r1, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
     t2IT 11, 8, implicit-def $itstate
     frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r7, def $pc, implicit killed $itstate
-    t2WhileLoopStart killed renamable $r1, %bb.0, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r1, %bb.0, implicit-def dead $cpsr
     t2B %bb.4, 14 /* CC::al */, $noreg
 
   bb.1:
@@ -191,7 +191,7 @@ body:             |
 
     $lr = tMOVr $r0, 14 /* CC::al */, $noreg
     renamable $r0 = t2ADDrs killed renamable $r2, killed $r0, 18, 14 /* CC::al */, $noreg, $noreg
-    t2WhileLoopStart killed renamable $lr, %bb.3, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $lr, %bb.3, implicit-def dead $cpsr
 
   bb.2:
     successors: %bb.2(0x7c000000), %bb.3(0x04000000)
@@ -205,7 +205,7 @@ body:             |
     liveins: $r1, $r3
 
     renamable $r0 = t2ADDrs killed renamable $r3, renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg
-    t2WhileLoopStart killed renamable $r1, %bb.6, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r1, %bb.6, implicit-def dead $cpsr
 
   bb.5:
     successors: %bb.5(0x7c000000), %bb.6(0x04000000)
@@ -232,13 +232,13 @@ body:             |
   ; CHECK:   tCMPi8 renamable $r1, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
   ; CHECK:   t2IT 11, 8, implicit-def $itstate
   ; CHECK:   frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r7, def $pc, implicit killed $itstate
-  ; CHECK:   t2WhileLoopStart killed renamable $r1, %bb.2, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r1, %bb.2, implicit-def dead $cpsr
   ; CHECK:   t2B %bb.4, 14 /* CC::al */, $noreg
   ; CHECK: bb.2:
   ; CHECK:   successors: %bb.3(0x80000000)
   ; CHECK:   $lr = tMOVr $r0, 14 /* CC::al */, $noreg
   ; CHECK:   renamable $r0 = t2ADDrs killed renamable $r2, killed $r0, 18, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   t2WhileLoopStart killed renamable $lr, %bb.1, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $lr, %bb.1, implicit-def dead $cpsr
   ; CHECK: bb.3:
   ; CHECK:   successors: %bb.3(0x7c000000), %bb.1(0x04000000)
   ; CHECK:   renamable $lr = t2LoopEndDec killed renamable $lr, %bb.3, implicit-def dead $cpsr
@@ -246,7 +246,7 @@ body:             |
   ; CHECK: bb.4:
   ; CHECK:   successors: %bb.5(0x80000000)
   ; CHECK:   renamable $r0 = t2ADDrs killed renamable $r3, renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg
-  ; CHECK:   t2WhileLoopStart killed renamable $r1, %bb.6, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r1, %bb.6, implicit-def dead $cpsr
   ; CHECK: bb.5:
   ; CHECK:   successors: %bb.5(0x7c000000), %bb.6(0x04000000)
   ; CHECK:   renamable $lr = t2LoopEndDec killed renamable $lr, %bb.5, implicit-def dead $cpsr
@@ -268,7 +268,7 @@ body:             |
     tCMPi8 renamable $r1, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
     t2IT 11, 8, implicit-def $itstate
     frame-destroy tPOP_RET 11 /* CC::lt */, killed $cpsr, def $r7, def $pc, implicit killed $itstate
-    t2WhileLoopStart killed renamable $r1, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r1, %bb.1, implicit-def dead $cpsr
     t2B %bb.4, 14 /* CC::al */, $noreg
 
   bb.1:
@@ -277,7 +277,7 @@ body:             |
 
     $lr = tMOVr $r0, 14 /* CC::al */, $noreg
     renamable $r0 = t2ADDrs killed renamable $r2, killed $r0, 18, 14 /* CC::al */, $noreg, $noreg
-    t2WhileLoopStart killed renamable $lr, %bb.3, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $lr, %bb.3, implicit-def dead $cpsr
 
   bb.2:
     successors: %bb.2(0x7c000000), %bb.3(0x04000000)
@@ -291,7 +291,7 @@ body:             |
     liveins: $r1, $r3
 
     renamable $r0 = t2ADDrs killed renamable $r3, renamable $r1, 18, 14 /* CC::al */, $noreg, $noreg
-    t2WhileLoopStart killed renamable $r1, %bb.6, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r1, %bb.6, implicit-def dead $cpsr
 
   bb.5:
     successors: %bb.5(0x7c000000), %bb.6(0x04000000)
@@ -318,7 +318,7 @@ body:             |
   ; CHECK:   frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc
   ; CHECK: bb.2:
   ; CHECK:   successors: %bb.3(0x80000000)
-  ; CHECK:   t2WhileLoopStart killed renamable $r0, %bb.1, implicit-def dead $cpsr
+  ; CHECK:   $lr = t2WhileLoopStartLR killed renamable $r0, %bb.1, implicit-def dead $cpsr
   ; CHECK: bb.3:
   ; CHECK:   successors: %bb.3(0x7c000000), %bb.1(0x04000000)
   ; CHECK:   renamable $r0 = tLDRi renamable $r2, 0, 14 /* CC::al */, $noreg
@@ -341,7 +341,7 @@ body:             |
     successors: %bb.3(0x80000000)
     liveins: $r0, $r1, $r2
 
-    t2WhileLoopStart killed renamable $r0, %bb.1, implicit-def dead $cpsr
+    $lr = t2WhileLoopStartLR killed renamable $r0, %bb.1, implicit-def dead $cpsr
 
   bb.3:
     successors: %bb.3(0x7c000000), %bb.1(0x04000000)

diff  --git a/llvm/test/CodeGen/Thumb2/mve-float16regloops.ll b/llvm/test/CodeGen/Thumb2/mve-float16regloops.ll
index eab2090e576e..6b053b8fd104 100644
--- a/llvm/test/CodeGen/Thumb2/mve-float16regloops.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-float16regloops.ll
@@ -785,24 +785,24 @@ define void @arm_fir_f32_1_4_mve(%struct.arm_fir_instance_f32* nocapture readonl
 ; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
 ; CHECK-NEXT:    .pad #16
 ; CHECK-NEXT:    sub sp, #16
-; CHECK-NEXT:    ldrh r5, [r0]
-; CHECK-NEXT:    ldr.w r9, [r0, #4]
-; CHECK-NEXT:    subs r6, r5, #1
+; CHECK-NEXT:    ldrh.w r9, [r0]
+; CHECK-NEXT:    ldr.w r10, [r0, #4]
+; CHECK-NEXT:    sub.w r6, r9, #1
 ; CHECK-NEXT:    cmp r6, #3
 ; CHECK-NEXT:    bhi .LBB15_6
 ; CHECK-NEXT:  @ %bb.1: @ %if.then
 ; CHECK-NEXT:    ldr r7, [r0, #8]
-; CHECK-NEXT:    add.w r4, r9, r6, lsl #1
-; CHECK-NEXT:    lsr.w lr, r3, #2
+; CHECK-NEXT:    add.w r4, r10, r6, lsl #1
+; CHECK-NEXT:    lsrs r5, r3, #2
 ; CHECK-NEXT:    ldrh.w r8, [r7, #6]
 ; CHECK-NEXT:    ldrh.w r12, [r7, #4]
 ; CHECK-NEXT:    ldrh r6, [r7, #2]
 ; CHECK-NEXT:    ldrh r7, [r7]
-; CHECK-NEXT:    wls lr, lr, .LBB15_5
+; CHECK-NEXT:    wls lr, r5, .LBB15_5
 ; CHECK-NEXT:  @ %bb.2: @ %while.body.lr.ph
-; CHECK-NEXT:    str r5, [sp, #12] @ 4-byte Spill
+; CHECK-NEXT:    str.w r9, [sp, #12] @ 4-byte Spill
 ; CHECK-NEXT:    bic r5, r3, #3
-; CHECK-NEXT:    add.w r10, r9, #2
+; CHECK-NEXT:    add.w r9, r10, #2
 ; CHECK-NEXT:    str r5, [sp] @ 4-byte Spill
 ; CHECK-NEXT:    add.w r5, r2, r5, lsl #1
 ; CHECK-NEXT:    str r5, [sp, #4] @ 4-byte Spill
@@ -810,71 +810,71 @@ define void @arm_fir_f32_1_4_mve(%struct.arm_fir_instance_f32* nocapture readonl
 ; CHECK-NEXT:  .LBB15_3: @ %while.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r1], #8
-; CHECK-NEXT:    sub.w r11, r10, #2
-; CHECK-NEXT:    add.w r5, r10, #2
+; CHECK-NEXT:    sub.w r11, r9, #2
+; CHECK-NEXT:    add.w r5, r9, #2
 ; CHECK-NEXT:    vstrb.8 q0, [r4], #8
 ; CHECK-NEXT:    vldrw.u32 q0, [r11]
-; CHECK-NEXT:    vldrw.u32 q1, [r10]
+; CHECK-NEXT:    vldrw.u32 q1, [r9]
 ; CHECK-NEXT:    vmul.f16 q0, q0, r7
 ; CHECK-NEXT:    vfma.f16 q0, q1, r6
 ; CHECK-NEXT:    vldrw.u32 q1, [r5]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r12
-; CHECK-NEXT:    vldrw.u32 q1, [r10, #4]
-; CHECK-NEXT:    add.w r10, r10, #8
+; CHECK-NEXT:    vldrw.u32 q1, [r9, #4]
+; CHECK-NEXT:    add.w r9, r9, #8
 ; CHECK-NEXT:    vfma.f16 q0, q1, r8
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #8
 ; CHECK-NEXT:    le lr, .LBB15_3
 ; CHECK-NEXT:  @ %bb.4: @ %while.end.loopexit
 ; CHECK-NEXT:    ldr r2, [sp] @ 4-byte Reload
 ; CHECK-NEXT:    ldr r1, [sp, #8] @ 4-byte Reload
-; CHECK-NEXT:    ldr r5, [sp, #12] @ 4-byte Reload
-; CHECK-NEXT:    add.w r9, r9, r2, lsl #1
+; CHECK-NEXT:    ldr.w r9, [sp, #12] @ 4-byte Reload
+; CHECK-NEXT:    add.w r10, r10, r2, lsl #1
 ; CHECK-NEXT:    add.w r1, r1, r2, lsl #1
 ; CHECK-NEXT:    ldr r2, [sp, #4] @ 4-byte Reload
 ; CHECK-NEXT:  .LBB15_5: @ %while.end
-; CHECK-NEXT:    and lr, r3, #3
+; CHECK-NEXT:    and r5, r3, #3
 ; CHECK-NEXT:    vldrw.u32 q0, [r1]
-; CHECK-NEXT:    vctp.16 lr
+; CHECK-NEXT:    vctp.16 r5
 ; CHECK-NEXT:    vpst
 ; CHECK-NEXT:    vstrht.16 q0, [r4]
-; CHECK-NEXT:    vldrw.u32 q0, [r9]
-; CHECK-NEXT:    add.w r1, r9, #2
+; CHECK-NEXT:    vldrw.u32 q0, [r10]
+; CHECK-NEXT:    add.w r1, r10, #2
 ; CHECK-NEXT:    vldrw.u32 q1, [r1]
-; CHECK-NEXT:    add.w r1, r9, #6
+; CHECK-NEXT:    add.w r1, r10, #6
 ; CHECK-NEXT:    vmul.f16 q0, q0, r7
 ; CHECK-NEXT:    vfma.f16 q0, q1, r6
-; CHECK-NEXT:    vldrw.u32 q1, [r9, #4]
+; CHECK-NEXT:    vldrw.u32 q1, [r10, #4]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r12
 ; CHECK-NEXT:    vldrw.u32 q1, [r1]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r8
 ; CHECK-NEXT:    vpst
 ; CHECK-NEXT:    vstrht.16 q0, [r2]
-; CHECK-NEXT:    ldr.w r9, [r0, #4]
+; CHECK-NEXT:    ldr.w r10, [r0, #4]
 ; CHECK-NEXT:  .LBB15_6: @ %if.end
-; CHECK-NEXT:    add.w r0, r9, r3, lsl #1
-; CHECK-NEXT:    lsr.w lr, r5, #2
-; CHECK-NEXT:    wls lr, lr, .LBB15_10
+; CHECK-NEXT:    add.w r0, r10, r3, lsl #1
+; CHECK-NEXT:    lsr.w r1, r9, #2
+; CHECK-NEXT:    wls lr, r1, .LBB15_10
 ; CHECK-NEXT:  @ %bb.7: @ %while.body51.preheader
-; CHECK-NEXT:    bic r2, r5, #3
+; CHECK-NEXT:    bic r2, r9, #3
 ; CHECK-NEXT:    adds r1, r2, r3
-; CHECK-NEXT:    mov r3, r9
-; CHECK-NEXT:    add.w r1, r9, r1, lsl #1
+; CHECK-NEXT:    mov r3, r10
+; CHECK-NEXT:    add.w r1, r10, r1, lsl #1
 ; CHECK-NEXT:  .LBB15_8: @ %while.body51
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r0], #8
 ; CHECK-NEXT:    vstrb.8 q0, [r3], #8
 ; CHECK-NEXT:    le lr, .LBB15_8
 ; CHECK-NEXT:  @ %bb.9: @ %while.end55.loopexit
-; CHECK-NEXT:    add.w r9, r9, r2, lsl #1
+; CHECK-NEXT:    add.w r10, r10, r2, lsl #1
 ; CHECK-NEXT:    mov r0, r1
 ; CHECK-NEXT:  .LBB15_10: @ %while.end55
-; CHECK-NEXT:    ands r1, r5, #3
+; CHECK-NEXT:    ands r1, r9, #3
 ; CHECK-NEXT:    beq .LBB15_12
 ; CHECK-NEXT:  @ %bb.11: @ %if.then59
 ; CHECK-NEXT:    vldrw.u32 q0, [r0]
 ; CHECK-NEXT:    vctp.16 r1
 ; CHECK-NEXT:    vpst
-; CHECK-NEXT:    vstrht.16 q0, [r9]
+; CHECK-NEXT:    vstrht.16 q0, [r10]
 ; CHECK-NEXT:  .LBB15_12: @ %if.end61
 ; CHECK-NEXT:    add sp, #16
 ; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
@@ -1052,36 +1052,36 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, half* noca
 ; CHECK-NEXT:    .pad #24
 ; CHECK-NEXT:    sub sp, #24
 ; CHECK-NEXT:    cmp r3, #8
+; CHECK-NEXT:    str r1, [sp, #20] @ 4-byte Spill
 ; CHECK-NEXT:    blo.w .LBB16_12
 ; CHECK-NEXT:  @ %bb.1: @ %entry
 ; CHECK-NEXT:    lsrs.w r12, r3, #2
 ; CHECK-NEXT:    beq.w .LBB16_12
 ; CHECK-NEXT:  @ %bb.2: @ %while.body.lr.ph
 ; CHECK-NEXT:    ldrh r4, [r0]
-; CHECK-NEXT:    movs r6, #1
+; CHECK-NEXT:    movs r1, #1
 ; CHECK-NEXT:    ldrd r5, r3, [r0, #4]
 ; CHECK-NEXT:    sub.w r0, r4, #8
-; CHECK-NEXT:    and r8, r0, #7
 ; CHECK-NEXT:    add.w r7, r0, r0, lsr #29
-; CHECK-NEXT:    asr.w lr, r7, #3
-; CHECK-NEXT:    cmp.w lr, #1
+; CHECK-NEXT:    and r0, r0, #7
+; CHECK-NEXT:    asrs r6, r7, #3
+; CHECK-NEXT:    cmp r6, #1
 ; CHECK-NEXT:    it gt
-; CHECK-NEXT:    asrgt r6, r7, #3
+; CHECK-NEXT:    asrgt r1, r7, #3
 ; CHECK-NEXT:    add.w r7, r5, r4, lsl #1
-; CHECK-NEXT:    subs r7, #2
-; CHECK-NEXT:    str r7, [sp, #20] @ 4-byte Spill
+; CHECK-NEXT:    str r1, [sp] @ 4-byte Spill
+; CHECK-NEXT:    subs r1, r7, #2
 ; CHECK-NEXT:    rsbs r7, r4, #0
 ; CHECK-NEXT:    str r7, [sp, #8] @ 4-byte Spill
 ; CHECK-NEXT:    add.w r7, r3, #16
-; CHECK-NEXT:    str r6, [sp] @ 4-byte Spill
 ; CHECK-NEXT:    str r4, [sp, #12] @ 4-byte Spill
 ; CHECK-NEXT:    str r7, [sp, #4] @ 4-byte Spill
+; CHECK-NEXT:    str r0, [sp, #16] @ 4-byte Spill
 ; CHECK-NEXT:    b .LBB16_4
 ; CHECK-NEXT:  .LBB16_3: @ %while.end
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
 ; CHECK-NEXT:    ldr r0, [sp, #8] @ 4-byte Reload
 ; CHECK-NEXT:    subs.w r12, r12, #1
-; CHECK-NEXT:    ldr r1, [sp, #16] @ 4-byte Reload
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #8
 ; CHECK-NEXT:    add.w r0, r5, r0, lsl #1
 ; CHECK-NEXT:    add.w r5, r0, #8
@@ -1090,40 +1090,39 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, half* noca
 ; CHECK-NEXT:    @ =>This Loop Header: Depth=1
 ; CHECK-NEXT:    @ Child Loop BB16_6 Depth 2
 ; CHECK-NEXT:    @ Child Loop BB16_10 Depth 2
-; CHECK-NEXT:    vldrw.u32 q0, [r1], #8
+; CHECK-NEXT:    ldr r0, [sp, #20] @ 4-byte Reload
 ; CHECK-NEXT:    ldrh.w lr, [r3, #14]
-; CHECK-NEXT:    ldrh r0, [r3, #12]
-; CHECK-NEXT:    str r1, [sp, #16] @ 4-byte Spill
-; CHECK-NEXT:    ldr r1, [sp, #20] @ 4-byte Reload
-; CHECK-NEXT:    ldrh r4, [r3, #10]
-; CHECK-NEXT:    ldrh r7, [r3, #8]
+; CHECK-NEXT:    vldrw.u32 q0, [r0], #8
+; CHECK-NEXT:    ldrh.w r8, [r3, #12]
+; CHECK-NEXT:    ldrh r7, [r3, #10]
+; CHECK-NEXT:    ldrh r4, [r3, #8]
 ; CHECK-NEXT:    ldrh r6, [r3, #6]
 ; CHECK-NEXT:    ldrh.w r9, [r3, #4]
 ; CHECK-NEXT:    ldrh.w r11, [r3, #2]
 ; CHECK-NEXT:    ldrh.w r10, [r3]
 ; CHECK-NEXT:    vstrb.8 q0, [r1], #8
 ; CHECK-NEXT:    vldrw.u32 q0, [r5]
-; CHECK-NEXT:    str r1, [sp, #20] @ 4-byte Spill
-; CHECK-NEXT:    adds r1, r5, #2
-; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    str r0, [sp, #20] @ 4-byte Spill
+; CHECK-NEXT:    adds r0, r5, #2
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
 ; CHECK-NEXT:    vmul.f16 q0, q0, r10
-; CHECK-NEXT:    adds r1, r5, #6
+; CHECK-NEXT:    adds r0, r5, #6
 ; CHECK-NEXT:    vfma.f16 q0, q1, r11
 ; CHECK-NEXT:    vldrw.u32 q1, [r5, #4]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r9
-; CHECK-NEXT:    vldrw.u32 q1, [r1]
-; CHECK-NEXT:    add.w r1, r5, #10
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    add.w r0, r5, #10
 ; CHECK-NEXT:    vfma.f16 q0, q1, r6
 ; CHECK-NEXT:    vldrw.u32 q1, [r5, #8]
-; CHECK-NEXT:    vfma.f16 q0, q1, r7
-; CHECK-NEXT:    vldrw.u32 q1, [r1]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r4
-; CHECK-NEXT:    vldrw.u32 q1, [r5, #12]
-; CHECK-NEXT:    vfma.f16 q0, q1, r0
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
 ; CHECK-NEXT:    add.w r0, r5, #14
+; CHECK-NEXT:    vfma.f16 q0, q1, r7
+; CHECK-NEXT:    vldrw.u32 q1, [r5, #12]
+; CHECK-NEXT:    adds r5, #16
+; CHECK-NEXT:    vfma.f16 q0, q1, r8
 ; CHECK-NEXT:    vldrw.u32 q1, [r0]
 ; CHECK-NEXT:    ldr r0, [sp, #12] @ 4-byte Reload
-; CHECK-NEXT:    adds r5, #16
 ; CHECK-NEXT:    vfma.f16 q0, q1, lr
 ; CHECK-NEXT:    cmp r0, #16
 ; CHECK-NEXT:    blo .LBB16_7
@@ -1137,25 +1136,25 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, half* noca
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    ldrh r0, [r6], #16
 ; CHECK-NEXT:    vldrw.u32 q1, [r5]
-; CHECK-NEXT:    adds r1, r5, #2
+; CHECK-NEXT:    adds r4, r5, #2
 ; CHECK-NEXT:    vfma.f16 q0, q1, r0
-; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    vldrw.u32 q1, [r4]
 ; CHECK-NEXT:    ldrh r0, [r6, #-14]
-; CHECK-NEXT:    adds r1, r5, #6
+; CHECK-NEXT:    adds r4, r5, #6
 ; CHECK-NEXT:    vfma.f16 q0, q1, r0
 ; CHECK-NEXT:    ldrh r0, [r6, #-12]
 ; CHECK-NEXT:    vldrw.u32 q1, [r5, #4]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r0
-; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    vldrw.u32 q1, [r4]
 ; CHECK-NEXT:    ldrh r0, [r6, #-10]
-; CHECK-NEXT:    add.w r1, r5, #10
+; CHECK-NEXT:    add.w r4, r5, #10
 ; CHECK-NEXT:    vfma.f16 q0, q1, r0
 ; CHECK-NEXT:    ldrh r0, [r6, #-8]
 ; CHECK-NEXT:    vldrw.u32 q1, [r5, #8]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r0
-; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    vldrw.u32 q1, [r4]
 ; CHECK-NEXT:    ldrh r0, [r6, #-6]
-; CHECK-NEXT:    ldrh r1, [r6, #-2]
+; CHECK-NEXT:    ldrh r4, [r6, #-2]
 ; CHECK-NEXT:    vfma.f16 q0, q1, r0
 ; CHECK-NEXT:    ldrh r0, [r6, #-4]
 ; CHECK-NEXT:    vldrw.u32 q1, [r5, #12]
@@ -1163,32 +1162,33 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, half* noca
 ; CHECK-NEXT:    add.w r0, r5, #14
 ; CHECK-NEXT:    vldrw.u32 q1, [r0]
 ; CHECK-NEXT:    adds r5, #16
-; CHECK-NEXT:    vfma.f16 q0, q1, r1
+; CHECK-NEXT:    vfma.f16 q0, q1, r4
 ; CHECK-NEXT:    le lr, .LBB16_6
 ; CHECK-NEXT:    b .LBB16_8
 ; CHECK-NEXT:  .LBB16_7: @ in Loop: Header=BB16_4 Depth=1
 ; CHECK-NEXT:    ldr r6, [sp, #4] @ 4-byte Reload
 ; CHECK-NEXT:  .LBB16_8: @ %for.end
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
-; CHECK-NEXT:    cmp.w r8, #0
+; CHECK-NEXT:    ldr r0, [sp, #16] @ 4-byte Reload
+; CHECK-NEXT:    subs.w lr, r0, #0
 ; CHECK-NEXT:    beq.w .LBB16_3
 ; CHECK-NEXT:    b .LBB16_9
 ; CHECK-NEXT:  .LBB16_9: @ %while.body76.preheader
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
 ; CHECK-NEXT:    mov r0, r5
-; CHECK-NEXT:    mov lr, r8
 ; CHECK-NEXT:  .LBB16_10: @ %while.body76
 ; CHECK-NEXT:    @ Parent Loop BB16_4 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    ldrh r1, [r6], #2
+; CHECK-NEXT:    ldrh r4, [r6], #2
 ; CHECK-NEXT:    vldrh.u16 q1, [r0], #2
+; CHECK-NEXT:    vfma.f16 q0, q1, r4
 ; CHECK-NEXT:    subs.w lr, lr, #1
-; CHECK-NEXT:    vfma.f16 q0, q1, r1
 ; CHECK-NEXT:    bne .LBB16_10
 ; CHECK-NEXT:    b .LBB16_11
 ; CHECK-NEXT:  .LBB16_11: @ %while.end.loopexit
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
-; CHECK-NEXT:    add.w r5, r5, r8, lsl #1
+; CHECK-NEXT:    ldr r0, [sp, #16] @ 4-byte Reload
+; CHECK-NEXT:    add.w r5, r5, r0, lsl #1
 ; CHECK-NEXT:    b .LBB16_3
 ; CHECK-NEXT:  .LBB16_12: @ %if.end
 ; CHECK-NEXT:    add sp, #24
@@ -1450,12 +1450,12 @@ define void @arm_biquad_cascade_df2T_f16(%struct.arm_biquad_cascade_df2T_instanc
 ; CHECK-NEXT:  .LBB17_3: @ %do.body
 ; CHECK-NEXT:    @ =>This Loop Header: Depth=1
 ; CHECK-NEXT:    @ Child Loop BB17_5 Depth 2
-; CHECK-NEXT:    vldrh.u16 q4, [r6]
-; CHECK-NEXT:    vldrh.u16 q3, [r6, #4]
+; CHECK-NEXT:    vldrh.u16 q3, [r6]
 ; CHECK-NEXT:    movs r5, #0
-; CHECK-NEXT:    vmov q5, q4
-; CHECK-NEXT:    vmov q6, q3
+; CHECK-NEXT:    vmov q5, q3
 ; CHECK-NEXT:    vshlc q5, r5, #16
+; CHECK-NEXT:    vldrh.u16 q4, [r6, #4]
+; CHECK-NEXT:    vmov q6, q4
 ; CHECK-NEXT:    vshlc q6, r5, #16
 ; CHECK-NEXT:    vldrh.u16 q2, [r12]
 ; CHECK-NEXT:    vmov.f32 s9, s1
@@ -1464,16 +1464,15 @@ define void @arm_biquad_cascade_df2T_f16(%struct.arm_biquad_cascade_df2T_instanc
 ; CHECK-NEXT:  @ %bb.4: @ %while.body.preheader
 ; CHECK-NEXT:    @ in Loop: Header=BB17_3 Depth=1
 ; CHECK-NEXT:    mov r5, r2
-; CHECK-NEXT:    mov lr, r9
 ; CHECK-NEXT:  .LBB17_5: @ %while.body
 ; CHECK-NEXT:    @ Parent Loop BB17_3 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    ldrh r7, [r1], #4
 ; CHECK-NEXT:    vmov r4, s4
-; CHECK-NEXT:    vfma.f16 q2, q4, r7
+; CHECK-NEXT:    vfma.f16 q2, q3, r7
 ; CHECK-NEXT:    ldrh r3, [r1, #-2]
 ; CHECK-NEXT:    vmov.u16 r7, q2[0]
-; CHECK-NEXT:    vfma.f16 q2, q3, r7
+; CHECK-NEXT:    vfma.f16 q2, q4, r7
 ; CHECK-NEXT:    vmov.16 q2[3], r4
 ; CHECK-NEXT:    vfma.f16 q2, q5, r3
 ; CHECK-NEXT:    vmov.u16 r3, q2[1]
@@ -1490,9 +1489,9 @@ define void @arm_biquad_cascade_df2T_f16(%struct.arm_biquad_cascade_df2T_instanc
 ; CHECK-NEXT:  @ %bb.7: @ %if.then
 ; CHECK-NEXT:    @ in Loop: Header=BB17_3 Depth=1
 ; CHECK-NEXT:    ldrh r1, [r1]
-; CHECK-NEXT:    vfma.f16 q2, q4, r1
-; CHECK-NEXT:    vmov.u16 r1, q2[0]
 ; CHECK-NEXT:    vfma.f16 q2, q3, r1
+; CHECK-NEXT:    vmov.u16 r1, q2[0]
+; CHECK-NEXT:    vfma.f16 q2, q4, r1
 ; CHECK-NEXT:    strh r1, [r5]
 ; CHECK-NEXT:    vmovx.f16 s6, s8
 ; CHECK-NEXT:    vstr.16 s6, [r12]

diff  --git a/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll b/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
index 44a152b32c0b..2fd717bf2d47 100644
--- a/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
@@ -785,23 +785,23 @@ define void @arm_fir_f32_1_4_mve(%struct.arm_fir_instance_f32* nocapture readonl
 ; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
 ; CHECK-NEXT:    .pad #8
 ; CHECK-NEXT:    sub sp, #8
-; CHECK-NEXT:    ldrh.w r9, [r0]
+; CHECK-NEXT:    ldrh.w r10, [r0]
 ; CHECK-NEXT:    mov r11, r1
 ; CHECK-NEXT:    ldr.w r12, [r0, #4]
-; CHECK-NEXT:    sub.w r1, r9, #1
+; CHECK-NEXT:    sub.w r1, r10, #1
 ; CHECK-NEXT:    cmp r1, #3
 ; CHECK-NEXT:    bhi .LBB15_6
 ; CHECK-NEXT:  @ %bb.1: @ %if.then
 ; CHECK-NEXT:    ldr r4, [r0, #8]
-; CHECK-NEXT:    lsr.w lr, r3, #2
 ; CHECK-NEXT:    ldrd r7, r6, [r4]
 ; CHECK-NEXT:    ldrd r5, r8, [r4, #8]
 ; CHECK-NEXT:    add.w r4, r12, r1, lsl #2
-; CHECK-NEXT:    wls lr, lr, .LBB15_5
+; CHECK-NEXT:    lsrs r1, r3, #2
+; CHECK-NEXT:    wls lr, r1, .LBB15_5
 ; CHECK-NEXT:  @ %bb.2: @ %while.body.lr.ph
 ; CHECK-NEXT:    bic r1, r3, #3
 ; CHECK-NEXT:    str r1, [sp] @ 4-byte Spill
-; CHECK-NEXT:    add.w r10, r12, #4
+; CHECK-NEXT:    add.w r9, r12, #4
 ; CHECK-NEXT:    add.w r1, r2, r1, lsl #2
 ; CHECK-NEXT:    str r1, [sp, #4] @ 4-byte Spill
 ; CHECK-NEXT:    mov r1, r11
@@ -809,12 +809,12 @@ define void @arm_fir_f32_1_4_mve(%struct.arm_fir_instance_f32* nocapture readonl
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r1], #16
 ; CHECK-NEXT:    vstrb.8 q0, [r4], #16
-; CHECK-NEXT:    vldrw.u32 q0, [r10, #-4]
-; CHECK-NEXT:    vldrw.u32 q1, [r10], #16
+; CHECK-NEXT:    vldrw.u32 q0, [r9, #-4]
+; CHECK-NEXT:    vldrw.u32 q1, [r9], #16
 ; CHECK-NEXT:    vmul.f32 q0, q0, r7
-; CHECK-NEXT:    vldrw.u32 q2, [r10, #-8]
+; CHECK-NEXT:    vldrw.u32 q2, [r9, #-8]
 ; CHECK-NEXT:    vfma.f32 q0, q1, r6
-; CHECK-NEXT:    vldrw.u32 q1, [r10, #-12]
+; CHECK-NEXT:    vldrw.u32 q1, [r9, #-12]
 ; CHECK-NEXT:    vfma.f32 q0, q1, r5
 ; CHECK-NEXT:    vfma.f32 q0, q2, r8
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
@@ -843,10 +843,10 @@ define void @arm_fir_f32_1_4_mve(%struct.arm_fir_instance_f32* nocapture readonl
 ; CHECK-NEXT:    ldr.w r12, [r0, #4]
 ; CHECK-NEXT:  .LBB15_6: @ %if.end
 ; CHECK-NEXT:    add.w r0, r12, r3, lsl #2
-; CHECK-NEXT:    lsr.w lr, r9, #2
-; CHECK-NEXT:    wls lr, lr, .LBB15_10
+; CHECK-NEXT:    lsr.w r1, r10, #2
+; CHECK-NEXT:    wls lr, r1, .LBB15_10
 ; CHECK-NEXT:  @ %bb.7: @ %while.body51.preheader
-; CHECK-NEXT:    bic r2, r9, #3
+; CHECK-NEXT:    bic r2, r10, #3
 ; CHECK-NEXT:    adds r1, r2, r3
 ; CHECK-NEXT:    mov r3, r12
 ; CHECK-NEXT:    add.w r1, r12, r1, lsl #2
@@ -859,7 +859,7 @@ define void @arm_fir_f32_1_4_mve(%struct.arm_fir_instance_f32* nocapture readonl
 ; CHECK-NEXT:    add.w r12, r12, r2, lsl #2
 ; CHECK-NEXT:    mov r0, r1
 ; CHECK-NEXT:  .LBB15_10: @ %while.end55
-; CHECK-NEXT:    ands r1, r9, #3
+; CHECK-NEXT:    ands r1, r10, #3
 ; CHECK-NEXT:    beq .LBB15_12
 ; CHECK-NEXT:  @ %bb.11: @ %if.then59
 ; CHECK-NEXT:    vldrw.u32 q0, [r0]
@@ -1053,32 +1053,32 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, float* noc
 ; CHECK-NEXT:    beq.w .LBB16_12
 ; CHECK-NEXT:  @ %bb.2: @ %while.body.lr.ph
 ; CHECK-NEXT:    ldrh r6, [r0]
-; CHECK-NEXT:    movs r4, #1
-; CHECK-NEXT:    ldrd r5, r10, [r0, #4]
-; CHECK-NEXT:    sub.w r3, r6, #8
-; CHECK-NEXT:    add.w r0, r3, r3, lsr #29
-; CHECK-NEXT:    asrs r7, r0, #3
+; CHECK-NEXT:    movs r5, #1
+; CHECK-NEXT:    ldrd r4, r10, [r0, #4]
+; CHECK-NEXT:    sub.w r0, r6, #8
+; CHECK-NEXT:    add.w r3, r0, r0, lsr #29
+; CHECK-NEXT:    and r0, r0, #7
+; CHECK-NEXT:    asrs r7, r3, #3
 ; CHECK-NEXT:    cmp r7, #1
 ; CHECK-NEXT:    it gt
-; CHECK-NEXT:    asrgt r4, r0, #3
-; CHECK-NEXT:    add.w r0, r5, r6, lsl #2
-; CHECK-NEXT:    sub.w r9, r0, #4
-; CHECK-NEXT:    rsbs r0, r6, #0
-; CHECK-NEXT:    str r4, [sp, #4] @ 4-byte Spill
-; CHECK-NEXT:    and r4, r3, #7
-; CHECK-NEXT:    str r0, [sp, #16] @ 4-byte Spill
-; CHECK-NEXT:    add.w r0, r10, #32
-; CHECK-NEXT:    str r6, [sp, #20] @ 4-byte Spill
-; CHECK-NEXT:    str r0, [sp, #8] @ 4-byte Spill
-; CHECK-NEXT:    str r4, [sp, #12] @ 4-byte Spill
+; CHECK-NEXT:    asrgt r5, r3, #3
+; CHECK-NEXT:    add.w r3, r4, r6, lsl #2
+; CHECK-NEXT:    sub.w r9, r3, #4
+; CHECK-NEXT:    rsbs r3, r6, #0
+; CHECK-NEXT:    str r3, [sp, #12] @ 4-byte Spill
+; CHECK-NEXT:    add.w r3, r10, #32
+; CHECK-NEXT:    str r5, [sp, #4] @ 4-byte Spill
+; CHECK-NEXT:    str r6, [sp, #16] @ 4-byte Spill
+; CHECK-NEXT:    str r3, [sp, #8] @ 4-byte Spill
+; CHECK-NEXT:    str r0, [sp, #20] @ 4-byte Spill
 ; CHECK-NEXT:    b .LBB16_4
 ; CHECK-NEXT:  .LBB16_3: @ %while.end
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
-; CHECK-NEXT:    ldr r0, [sp, #16] @ 4-byte Reload
+; CHECK-NEXT:    ldr r0, [sp, #12] @ 4-byte Reload
 ; CHECK-NEXT:    subs.w r12, r12, #1
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    add.w r0, r5, r0, lsl #2
-; CHECK-NEXT:    add.w r5, r0, #16
+; CHECK-NEXT:    add.w r0, r4, r0, lsl #2
+; CHECK-NEXT:    add.w r4, r0, #16
 ; CHECK-NEXT:    beq .LBB16_12
 ; CHECK-NEXT:  .LBB16_4: @ %while.body
 ; CHECK-NEXT:    @ =>This Loop Header: Depth=1
@@ -1087,25 +1087,25 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, float* noc
 ; CHECK-NEXT:    add.w lr, r10, #8
 ; CHECK-NEXT:    vldrw.u32 q0, [r1], #16
 ; CHECK-NEXT:    ldrd r3, r7, [r10]
-; CHECK-NEXT:    ldm.w lr, {r0, r4, r6, lr}
+; CHECK-NEXT:    ldm.w lr, {r0, r5, r6, lr}
 ; CHECK-NEXT:    ldrd r11, r8, [r10, #24]
 ; CHECK-NEXT:    vstrb.8 q0, [r9], #16
-; CHECK-NEXT:    vldrw.u32 q0, [r5], #32
+; CHECK-NEXT:    vldrw.u32 q0, [r4], #32
 ; CHECK-NEXT:    strd r9, r1, [sp, #24] @ 8-byte Folded Spill
-; CHECK-NEXT:    vldrw.u32 q1, [r5, #-28]
+; CHECK-NEXT:    vldrw.u32 q1, [r4, #-28]
 ; CHECK-NEXT:    vmul.f32 q0, q0, r3
-; CHECK-NEXT:    vldrw.u32 q6, [r5, #-24]
-; CHECK-NEXT:    vldrw.u32 q4, [r5, #-20]
+; CHECK-NEXT:    vldrw.u32 q6, [r4, #-24]
+; CHECK-NEXT:    vldrw.u32 q4, [r4, #-20]
 ; CHECK-NEXT:    vfma.f32 q0, q1, r7
-; CHECK-NEXT:    vldrw.u32 q5, [r5, #-16]
+; CHECK-NEXT:    vldrw.u32 q5, [r4, #-16]
 ; CHECK-NEXT:    vfma.f32 q0, q6, r0
-; CHECK-NEXT:    vldrw.u32 q2, [r5, #-12]
-; CHECK-NEXT:    vfma.f32 q0, q4, r4
-; CHECK-NEXT:    vldrw.u32 q3, [r5, #-8]
+; CHECK-NEXT:    vldrw.u32 q2, [r4, #-12]
+; CHECK-NEXT:    vfma.f32 q0, q4, r5
+; CHECK-NEXT:    vldrw.u32 q3, [r4, #-8]
 ; CHECK-NEXT:    vfma.f32 q0, q5, r6
-; CHECK-NEXT:    ldr r0, [sp, #20] @ 4-byte Reload
+; CHECK-NEXT:    ldr r0, [sp, #16] @ 4-byte Reload
 ; CHECK-NEXT:    vfma.f32 q0, q2, lr
-; CHECK-NEXT:    vldrw.u32 q1, [r5, #-4]
+; CHECK-NEXT:    vldrw.u32 q1, [r4, #-4]
 ; CHECK-NEXT:    vfma.f32 q0, q3, r11
 ; CHECK-NEXT:    cmp r0, #16
 ; CHECK-NEXT:    vfma.f32 q0, q1, r8
@@ -1118,54 +1118,52 @@ define void @fir(%struct.arm_fir_instance_f32* nocapture readonly %S, float* noc
 ; CHECK-NEXT:  .LBB16_6: @ %for.body
 ; CHECK-NEXT:    @ Parent Loop BB16_4 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    ldm.w r7, {r0, r3, r4, r6}
-; CHECK-NEXT:    vldrw.u32 q1, [r5], #32
-; CHECK-NEXT:    add.w r11, r7, #16
-; CHECK-NEXT:    vldrw.u32 q6, [r5, #-24]
-; CHECK-NEXT:    vldrw.u32 q4, [r5, #-20]
+; CHECK-NEXT:    ldm.w r7, {r0, r3, r5, r6, r8, r11}
+; CHECK-NEXT:    vldrw.u32 q1, [r4], #32
+; CHECK-NEXT:    vldrw.u32 q6, [r4, #-24]
+; CHECK-NEXT:    vldrw.u32 q4, [r4, #-20]
 ; CHECK-NEXT:    vfma.f32 q0, q1, r0
-; CHECK-NEXT:    vldrw.u32 q1, [r5, #-28]
-; CHECK-NEXT:    ldm.w r11, {r1, r8, r11}
-; CHECK-NEXT:    vldrw.u32 q5, [r5, #-16]
+; CHECK-NEXT:    vldrw.u32 q1, [r4, #-28]
+; CHECK-NEXT:    vldrw.u32 q5, [r4, #-16]
+; CHECK-NEXT:    vldrw.u32 q2, [r4, #-12]
 ; CHECK-NEXT:    vfma.f32 q0, q1, r3
-; CHECK-NEXT:    vldrw.u32 q2, [r5, #-12]
-; CHECK-NEXT:    vfma.f32 q0, q6, r4
-; CHECK-NEXT:    vldrw.u32 q3, [r5, #-8]
+; CHECK-NEXT:    ldrd r9, r1, [r7, #24]
+; CHECK-NEXT:    vfma.f32 q0, q6, r5
+; CHECK-NEXT:    vldrw.u32 q3, [r4, #-8]
 ; CHECK-NEXT:    vfma.f32 q0, q4, r6
-; CHECK-NEXT:    ldr.w r9, [r7, #28]
-; CHECK-NEXT:    vfma.f32 q0, q5, r1
-; CHECK-NEXT:    vldrw.u32 q1, [r5, #-4]
-; CHECK-NEXT:    vfma.f32 q0, q2, r8
+; CHECK-NEXT:    vldrw.u32 q1, [r4, #-4]
+; CHECK-NEXT:    vfma.f32 q0, q5, r8
 ; CHECK-NEXT:    adds r7, #32
-; CHECK-NEXT:    vfma.f32 q0, q3, r11
-; CHECK-NEXT:    vfma.f32 q0, q1, r9
+; CHECK-NEXT:    vfma.f32 q0, q2, r11
+; CHECK-NEXT:    vfma.f32 q0, q3, r9
+; CHECK-NEXT:    vfma.f32 q0, q1, r1
 ; CHECK-NEXT:    le lr, .LBB16_6
 ; CHECK-NEXT:    b .LBB16_8
 ; CHECK-NEXT:  .LBB16_7: @ in Loop: Header=BB16_4 Depth=1
 ; CHECK-NEXT:    ldr r7, [sp, #8] @ 4-byte Reload
 ; CHECK-NEXT:  .LBB16_8: @ %for.end
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
-; CHECK-NEXT:    ldrd r9, r1, [sp, #24] @ 8-byte Folded Reload
-; CHECK-NEXT:    ldr r4, [sp, #12] @ 4-byte Reload
-; CHECK-NEXT:    cmp.w r4, #0
+; CHECK-NEXT:    ldr r1, [sp, #28] @ 4-byte Reload
+; CHECK-NEXT:    ldrd r0, r9, [sp, #20] @ 8-byte Folded Reload
+; CHECK-NEXT:    subs.w lr, r0, #0
 ; CHECK-NEXT:    beq .LBB16_3
 ; CHECK-NEXT:    b .LBB16_9
 ; CHECK-NEXT:  .LBB16_9: @ %while.body76.preheader
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
-; CHECK-NEXT:    mov r3, r5
-; CHECK-NEXT:    mov lr, r4
+; CHECK-NEXT:    mov r3, r4
 ; CHECK-NEXT:  .LBB16_10: @ %while.body76
 ; CHECK-NEXT:    @ Parent Loop BB16_4 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    ldr r0, [r7], #4
 ; CHECK-NEXT:    vldrw.u32 q1, [r3], #4
-; CHECK-NEXT:    subs.w lr, lr, #1
 ; CHECK-NEXT:    vfma.f32 q0, q1, r0
+; CHECK-NEXT:    subs.w lr, lr, #1
 ; CHECK-NEXT:    bne .LBB16_10
 ; CHECK-NEXT:    b .LBB16_11
 ; CHECK-NEXT:  .LBB16_11: @ %while.end.loopexit
 ; CHECK-NEXT:    @ in Loop: Header=BB16_4 Depth=1
-; CHECK-NEXT:    add.w r5, r5, r4, lsl #2
+; CHECK-NEXT:    ldr r0, [sp, #20] @ 4-byte Reload
+; CHECK-NEXT:    add.w r4, r4, r0, lsl #2
 ; CHECK-NEXT:    b .LBB16_3
 ; CHECK-NEXT:  .LBB16_12: @ %if.end
 ; CHECK-NEXT:    add sp, #32
@@ -1660,15 +1658,15 @@ define arm_aapcs_vfpcc void @arm_biquad_cascade_df1_f32(%struct.arm_biquad_casd_
 ; CHECK-NEXT:    ldrd r12, r10, [r0]
 ; CHECK-NEXT:    @ implicit-def: $s2
 ; CHECK-NEXT:    and r7, r3, #3
-; CHECK-NEXT:    ldr.w r11, [r0, #8]
+; CHECK-NEXT:    ldr.w r9, [r0, #8]
 ; CHECK-NEXT:    lsrs r0, r3, #2
-; CHECK-NEXT:    str r0, [sp, #60] @ 4-byte Spill
+; CHECK-NEXT:    str r0, [sp, #8] @ 4-byte Spill
 ; CHECK-NEXT:    str r7, [sp, #12] @ 4-byte Spill
-; CHECK-NEXT:    str r2, [sp, #56] @ 4-byte Spill
+; CHECK-NEXT:    str r2, [sp, #60] @ 4-byte Spill
 ; CHECK-NEXT:    b .LBB19_3
 ; CHECK-NEXT:  .LBB19_1: @ in Loop: Header=BB19_3 Depth=1
 ; CHECK-NEXT:    vmov.f32 s14, s7
-; CHECK-NEXT:    ldr r2, [sp, #56] @ 4-byte Reload
+; CHECK-NEXT:    ldr r2, [sp, #60] @ 4-byte Reload
 ; CHECK-NEXT:    vmov.f32 s0, s10
 ; CHECK-NEXT:    vmov.f32 s7, s6
 ; CHECK-NEXT:  .LBB19_2: @ %if.end69
@@ -1676,7 +1674,7 @@ define arm_aapcs_vfpcc void @arm_biquad_cascade_df1_f32(%struct.arm_biquad_casd_
 ; CHECK-NEXT:    vstr s8, [r10]
 ; CHECK-NEXT:    subs.w r12, r12, #1
 ; CHECK-NEXT:    vstr s0, [r10, #4]
-; CHECK-NEXT:    add.w r11, r11, #128
+; CHECK-NEXT:    add.w r9, r9, #128
 ; CHECK-NEXT:    vstr s14, [r10, #8]
 ; CHECK-NEXT:    mov r1, r2
 ; CHECK-NEXT:    vstr s7, [r10, #12]
@@ -1687,45 +1685,45 @@ define arm_aapcs_vfpcc void @arm_biquad_cascade_df1_f32(%struct.arm_biquad_casd_
 ; CHECK-NEXT:    @ Child Loop BB19_5 Depth 2
 ; CHECK-NEXT:    vldr s7, [r10, #8]
 ; CHECK-NEXT:    mov r5, r2
-; CHECK-NEXT:    ldr r0, [sp, #60] @ 4-byte Reload
+; CHECK-NEXT:    ldr r0, [sp, #8] @ 4-byte Reload
 ; CHECK-NEXT:    vldr s8, [r10]
 ; CHECK-NEXT:    vldr s10, [r10, #4]
 ; CHECK-NEXT:    vldr s6, [r10, #12]
 ; CHECK-NEXT:    wls lr, r0, .LBB19_6
 ; CHECK-NEXT:  @ %bb.4: @ %while.body.lr.ph
 ; CHECK-NEXT:    @ in Loop: Header=BB19_3 Depth=1
-; CHECK-NEXT:    ldrd r5, lr, [sp, #56] @ 8-byte Folded Reload
+; CHECK-NEXT:    ldr r5, [sp, #60] @ 4-byte Reload
 ; CHECK-NEXT:  .LBB19_5: @ %while.body
 ; CHECK-NEXT:    @ Parent Loop BB19_3 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    vmov r4, s8
 ; CHECK-NEXT:    vldr s8, [r1, #12]
-; CHECK-NEXT:    vldrw.u32 q0, [r11, #112]
-; CHECK-NEXT:    vmov r0, s10
+; CHECK-NEXT:    vldrw.u32 q0, [r9, #112]
+; CHECK-NEXT:    vmov r3, s10
 ; CHECK-NEXT:    vldr s10, [r1, #8]
 ; CHECK-NEXT:    vmov r7, s7
-; CHECK-NEXT:    vmov r9, s6
-; CHECK-NEXT:    vldrw.u32 q1, [r11]
+; CHECK-NEXT:    vmov r11, s6
+; CHECK-NEXT:    vldrw.u32 q1, [r9]
 ; CHECK-NEXT:    vstrw.32 q0, [sp, #64] @ 16-byte Spill
 ; CHECK-NEXT:    vmov r8, s8
-; CHECK-NEXT:    vldrw.u32 q0, [r11, #16]
+; CHECK-NEXT:    vldrw.u32 q0, [r9, #16]
 ; CHECK-NEXT:    ldr r6, [r1, #4]
-; CHECK-NEXT:    vldrw.u32 q7, [r11, #32]
+; CHECK-NEXT:    vldrw.u32 q7, [r9, #32]
 ; CHECK-NEXT:    vmul.f32 q1, q1, r8
-; CHECK-NEXT:    vmov r3, s10
-; CHECK-NEXT:    vldrw.u32 q3, [r11, #48]
-; CHECK-NEXT:    vfma.f32 q1, q0, r3
-; CHECK-NEXT:    ldr r3, [r1], #16
+; CHECK-NEXT:    vmov r0, s10
+; CHECK-NEXT:    vldrw.u32 q3, [r9, #48]
+; CHECK-NEXT:    vfma.f32 q1, q0, r0
+; CHECK-NEXT:    ldr r0, [r1], #16
 ; CHECK-NEXT:    vfma.f32 q1, q7, r6
-; CHECK-NEXT:    vldrw.u32 q6, [r11, #64]
-; CHECK-NEXT:    vfma.f32 q1, q3, r3
-; CHECK-NEXT:    vldrw.u32 q5, [r11, #80]
+; CHECK-NEXT:    vldrw.u32 q6, [r9, #64]
+; CHECK-NEXT:    vfma.f32 q1, q3, r0
+; CHECK-NEXT:    vldrw.u32 q5, [r9, #80]
 ; CHECK-NEXT:    vfma.f32 q1, q6, r4
-; CHECK-NEXT:    vldrw.u32 q4, [r11, #96]
-; CHECK-NEXT:    vfma.f32 q1, q5, r0
+; CHECK-NEXT:    vldrw.u32 q4, [r9, #96]
+; CHECK-NEXT:    vfma.f32 q1, q5, r3
 ; CHECK-NEXT:    vldrw.u32 q0, [sp, #64] @ 16-byte Reload
 ; CHECK-NEXT:    vfma.f32 q1, q4, r7
-; CHECK-NEXT:    vfma.f32 q1, q0, r9
+; CHECK-NEXT:    vfma.f32 q1, q0, r11
 ; CHECK-NEXT:    vmov.f32 s2, s8
 ; CHECK-NEXT:    vstrb.8 q1, [r5], #16
 ; CHECK-NEXT:    le lr, .LBB19_5
@@ -1739,25 +1737,25 @@ define arm_aapcs_vfpcc void @arm_biquad_cascade_df1_f32(%struct.arm_biquad_casd_
 ; CHECK-NEXT:    vldr s24, [r1]
 ; CHECK-NEXT:    vmov r0, s8
 ; CHECK-NEXT:    vldr s0, [r1, #4]
-; CHECK-NEXT:    vldrw.u32 q3, [r11]
+; CHECK-NEXT:    vldrw.u32 q3, [r9]
 ; CHECK-NEXT:    vldr s3, [r1, #12]
-; CHECK-NEXT:    vldrw.u32 q4, [r11, #32]
+; CHECK-NEXT:    vldrw.u32 q4, [r9, #32]
 ; CHECK-NEXT:    vldr s1, [r1, #8]
 ; CHECK-NEXT:    vmov r1, s10
-; CHECK-NEXT:    vldrw.u32 q2, [r11, #96]
+; CHECK-NEXT:    vldrw.u32 q2, [r9, #96]
 ; CHECK-NEXT:    vmov r6, s3
 ; CHECK-NEXT:    vmul.f32 q3, q3, r6
 ; CHECK-NEXT:    vmov r6, s1
 ; CHECK-NEXT:    vstrw.32 q2, [sp, #32] @ 16-byte Spill
-; CHECK-NEXT:    vldrw.u32 q2, [r11, #112]
-; CHECK-NEXT:    vldrw.u32 q5, [r11, #48]
+; CHECK-NEXT:    vldrw.u32 q2, [r9, #112]
+; CHECK-NEXT:    vldrw.u32 q5, [r9, #48]
 ; CHECK-NEXT:    vmov r4, s0
 ; CHECK-NEXT:    vstrw.32 q2, [sp, #64] @ 16-byte Spill
-; CHECK-NEXT:    vldrw.u32 q2, [r11, #80]
-; CHECK-NEXT:    vldrw.u32 q7, [r11, #64]
+; CHECK-NEXT:    vldrw.u32 q2, [r9, #80]
+; CHECK-NEXT:    vldrw.u32 q7, [r9, #64]
 ; CHECK-NEXT:    vmov r3, s24
 ; CHECK-NEXT:    vstrw.32 q2, [sp, #16] @ 16-byte Spill
-; CHECK-NEXT:    vldrw.u32 q2, [r11, #16]
+; CHECK-NEXT:    vldrw.u32 q2, [r9, #16]
 ; CHECK-NEXT:    vmov r2, s7
 ; CHECK-NEXT:    cmp r7, #1
 ; CHECK-NEXT:    vfma.f32 q3, q2, r6
@@ -1792,12 +1790,12 @@ define arm_aapcs_vfpcc void @arm_biquad_cascade_df1_f32(%struct.arm_biquad_casd_
 ; CHECK-NEXT:  .LBB19_11: @ %if.end69
 ; CHECK-NEXT:    @ in Loop: Header=BB19_3 Depth=1
 ; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    ldr r2, [sp, #56] @ 4-byte Reload
+; CHECK-NEXT:    ldr r2, [sp, #60] @ 4-byte Reload
 ; CHECK-NEXT:    b .LBB19_2
 ; CHECK-NEXT:  .LBB19_12: @ %if.else64
 ; CHECK-NEXT:    @ in Loop: Header=BB19_3 Depth=1
 ; CHECK-NEXT:    vmov.f32 s7, s13
-; CHECK-NEXT:    ldr r2, [sp, #56] @ 4-byte Reload
+; CHECK-NEXT:    ldr r2, [sp, #60] @ 4-byte Reload
 ; CHECK-NEXT:    vmov.f32 s2, s3
 ; CHECK-NEXT:    vstr s14, [r5, #8]
 ; CHECK-NEXT:    vmov.f32 s8, s1
@@ -2063,7 +2061,6 @@ define void @arm_biquad_cascade_df2T_f32(%struct.arm_biquad_cascade_df2T_instanc
 ; CHECK-NEXT:    @ in Loop: Header=BB20_3 Depth=1
 ; CHECK-NEXT:    vmov q6, q1
 ; CHECK-NEXT:    mov r5, r2
-; CHECK-NEXT:    mov lr, r3
 ; CHECK-NEXT:  .LBB20_5: @ %while.body
 ; CHECK-NEXT:    @ Parent Loop BB20_3 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2

diff  --git a/llvm/test/CodeGen/Thumb2/mve-postinc-distribute.ll b/llvm/test/CodeGen/Thumb2/mve-postinc-distribute.ll
index 0eb1226f60db..39dededb5973 100644
--- a/llvm/test/CodeGen/Thumb2/mve-postinc-distribute.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-postinc-distribute.ll
@@ -64,16 +64,16 @@ define void @arm_cmplx_dot_prod_q15(i16* nocapture readonly %pSrcA, i16* nocaptu
 ; CHECK-NEXT:    vldrh.u16 q0, [r0]
 ; CHECK-NEXT:    vldrh.u16 q1, [r1]
 ; CHECK-NEXT:    movs r4, #0
-; CHECK-NEXT:    lsr.w lr, r7, #3
+; CHECK-NEXT:    lsr.w r9, r7, #3
 ; CHECK-NEXT:    mov r7, r12
 ; CHECK-NEXT:    mov r11, r12
-; CHECK-NEXT:    wls lr, lr, .LBB1_4
+; CHECK-NEXT:    wls lr, r9, .LBB1_4
 ; CHECK-NEXT:  @ %bb.1: @ %while.body.preheader
+; CHECK-NEXT:    add.w r8, r0, r9, lsl #5
 ; CHECK-NEXT:    mov.w r11, #0
-; CHECK-NEXT:    add.w r8, r0, lr, lsl #5
 ; CHECK-NEXT:    adds r0, #32
 ; CHECK-NEXT:    add.w r6, r1, #32
-; CHECK-NEXT:    lsl.w r9, lr, #4
+; CHECK-NEXT:    lsl.w r9, r9, #4
 ; CHECK-NEXT:    mov r4, r11
 ; CHECK-NEXT:    movs r7, #0
 ; CHECK-NEXT:    mov r12, r11
@@ -100,9 +100,9 @@ define void @arm_cmplx_dot_prod_q15(i16* nocapture readonly %pSrcA, i16* nocaptu
 ; CHECK-NEXT:    ldr.w r8, [sp, #36]
 ; CHECK-NEXT:    mov r6, r12
 ; CHECK-NEXT:    mov r5, r7
-; CHECK-NEXT:    and lr, r2, #3
+; CHECK-NEXT:    and r2, r2, #3
 ; CHECK-NEXT:    lsrl r6, r5, #6
-; CHECK-NEXT:    wls lr, lr, .LBB1_7
+; CHECK-NEXT:    wls lr, r2, .LBB1_7
 ; CHECK-NEXT:  .LBB1_5: @ %while.body11
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldrsh r9, [r0], #4

diff  --git a/llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll b/llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll
index 1b6cdfc517be..e7a75935912d 100644
--- a/llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll
@@ -1163,14 +1163,14 @@ define arm_aapcs_vfpcc void @_Z37_arm_radix4_butterfly_inverse_f32_mvePK21arm_cf
 ; CHECK-NEXT:    bne .LBB7_6
 ; CHECK-NEXT:    b .LBB7_2
 ; CHECK-NEXT:  .LBB7_9:
-; CHECK-NEXT:    adr r0, .LCPI7_0
-; CHECK-NEXT:    vldrw.u32 q1, [r0]
-; CHECK-NEXT:    ldr r0, [sp, #20] @ 4-byte Reload
-; CHECK-NEXT:    vadd.i32 q1, q1, r0
-; CHECK-NEXT:    vldrw.u32 q2, [q1, #64]!
+; CHECK-NEXT:    adr r1, .LCPI7_0
 ; CHECK-NEXT:    ldr r0, [sp, #8] @ 4-byte Reload
-; CHECK-NEXT:    lsr.w lr, r0, #3
-; CHECK-NEXT:    wls lr, lr, .LBB7_12
+; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    ldr r1, [sp, #20] @ 4-byte Reload
+; CHECK-NEXT:    vadd.i32 q1, q1, r1
+; CHECK-NEXT:    lsrs r0, r0, #3
+; CHECK-NEXT:    vldrw.u32 q2, [q1, #64]!
+; CHECK-NEXT:    wls lr, r0, .LBB7_12
 ; CHECK-NEXT:  @ %bb.10:
 ; CHECK-NEXT:    vldr s0, [sp, #4] @ 4-byte Reload
 ; CHECK-NEXT:    vmov r0, s0

diff  --git a/llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll b/llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll
index 6a30d964392a..0a18279a57ef 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vmaxnma-commute.ll
@@ -197,8 +197,8 @@ define void @loop_absmax32(float* nocapture readonly %0, i32 %1, float* nocaptur
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
-; CHECK-NEXT:    lsr.w lr, r1, #3
-; CHECK-NEXT:    wls lr, lr, .LBB16_3
+; CHECK-NEXT:    lsrs r1, r1, #3
+; CHECK-NEXT:    wls lr, r1, .LBB16_3
 ; CHECK-NEXT:  @ %bb.1: @ %.preheader
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
 ; CHECK-NEXT:  .LBB16_2: @ =>This Inner Loop Header: Depth=1
@@ -247,8 +247,8 @@ define void @loop_absmax32_c(float* nocapture readonly %0, i32 %1, float* nocapt
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
-; CHECK-NEXT:    lsr.w lr, r1, #3
-; CHECK-NEXT:    wls lr, lr, .LBB17_3
+; CHECK-NEXT:    lsrs r1, r1, #3
+; CHECK-NEXT:    wls lr, r1, .LBB17_3
 ; CHECK-NEXT:  @ %bb.1: @ %.preheader
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
 ; CHECK-NEXT:  .LBB17_2: @ =>This Inner Loop Header: Depth=1
@@ -389,8 +389,8 @@ define void @loop_absmax16(half* nocapture readonly %0, i32 %1, half* nocapture
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
-; CHECK-NEXT:    lsr.w lr, r1, #3
-; CHECK-NEXT:    wls lr, lr, .LBB20_3
+; CHECK-NEXT:    lsrs r1, r1, #3
+; CHECK-NEXT:    wls lr, r1, .LBB20_3
 ; CHECK-NEXT:  @ %bb.1: @ %.preheader
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
 ; CHECK-NEXT:  .LBB20_2: @ =>This Inner Loop Header: Depth=1
@@ -439,8 +439,8 @@ define void @loop_absmax16_c(half* nocapture readonly %0, i32 %1, half* nocaptur
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
-; CHECK-NEXT:    lsr.w lr, r1, #3
-; CHECK-NEXT:    wls lr, lr, .LBB21_3
+; CHECK-NEXT:    lsrs r1, r1, #3
+; CHECK-NEXT:    wls lr, r1, .LBB21_3
 ; CHECK-NEXT:  @ %bb.1: @ %.preheader
 ; CHECK-NEXT:    vmov.i32 q0, #0x0
 ; CHECK-NEXT:  .LBB21_2: @ =>This Inner Loop Header: Depth=1

diff  --git a/llvm/test/Transforms/HardwareLoops/ARM/do-rem.ll b/llvm/test/Transforms/HardwareLoops/ARM/do-rem.ll
index 74763a6f5414..da315a238546 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/do-rem.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/do-rem.ll
@@ -4,14 +4,16 @@
 
 ; CHECK-LABEL: do_with_i32_urem
 ; CHECK: entry:
-; CHECK: [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %n)
-; CHECK: br i1 [[TEST]], label %while.body.preheader, label %while.end
+; CHECK: [[TEST:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %n)
+; CHECK: [[TEST1:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 1
+; CHECK: [[TEST0:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 0
+; CHECK: br i1 [[TEST1]], label %while.body.preheader, label %while.end
 
 ; CHECK: while.body.preheader:
 ; CHECK-NEXT: br label %while.body
 
 ; CHECK: while.body:
-; CHECK: [[REM:%[^ ]+]] = phi i32 [ %n, %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
+; CHECK: [[REM:%[^ ]+]] = phi i32 [ [[TEST0]], %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
 ; CHECK: [[LOOP_DEC]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[REM]], i32 1)
 ; CHECK: [[CMP:%[^ ]+]] = icmp ne i32 [[LOOP_DEC]], 0
 ; CHECK: br i1 [[CMP]], label %while.body, label %while.end.loopexit
@@ -43,14 +45,16 @@ while.end:
 
 ; CHECK-LABEL: do_with_i32_srem
 ; CHECK: entry:
-; CHECK: [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %n)
-; CHECK: br i1 [[TEST]], label %while.body.preheader, label %while.end
+; CHECK: [[TEST:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %n)
+; CHECK: [[TEST1:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 1
+; CHECK: [[TEST0:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 0
+; CHECK: br i1 [[TEST1]], label %while.body.preheader, label %while.end
 
 ; CHECK: while.body.preheader:
 ; CHECK-NEXT: br label %while.body
 
 ; CHECK: while.body:
-; CHECK: [[REM:%[^ ]+]] = phi i32 [ %n, %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
+; CHECK: [[REM:%[^ ]+]] = phi i32 [ [[TEST0]], %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
 ; CHECK: [[LOOP_DEC]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[REM]], i32 1)
 ; CHECK: [[CMP:%[^ ]+]] = icmp ne i32 [[LOOP_DEC]], 0
 ; CHECK: br i1 [[CMP]], label %while.body, label %while.end.loopexit
@@ -82,14 +86,16 @@ while.end:
 
 ; CHECK-LABEL: do_with_i32_udiv
 ; CHECK: entry:
-; CHECK: [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %n)
-; CHECK: br i1 [[TEST]], label %while.body.preheader, label %while.end
+; CHECK: [[TEST:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %n)
+; CHECK: [[TEST1:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 1
+; CHECK: [[TEST0:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 0
+; CHECK: br i1 [[TEST1]], label %while.body.preheader, label %while.end
 
 ; CHECK: while.body.preheader:
 ; CHECK-NEXT: br label %while.body
 
 ; CHECK: while.body:
-; CHECK: [[REM:%[^ ]+]] = phi i32 [ %n, %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
+; CHECK: [[REM:%[^ ]+]] = phi i32 [ [[TEST0]], %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
 ; CHECK: [[LOOP_DEC]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[REM]], i32 1)
 ; CHECK: [[CMP:%[^ ]+]] = icmp ne i32 [[LOOP_DEC]], 0
 ; CHECK: br i1 [[CMP]], label %while.body, label %while.end.loopexit
@@ -121,14 +127,16 @@ while.end:
 
 ; CHECK-LABEL: do_with_i32_sdiv
 ; CHECK: entry:
-; CHECK: [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %n)
-; CHECK: br i1 [[TEST]], label %while.body.preheader, label %while.end
+; CHECK: [[TEST:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %n)
+; CHECK: [[TEST1:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 1
+; CHECK: [[TEST0:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 0
+; CHECK: br i1 [[TEST1]], label %while.body.preheader, label %while.end
 
 ; CHECK: while.body.preheader:
 ; CHECK-NEXT: br label %while.body
 
 ; CHECK: while.body:
-; CHECK: [[REM:%[^ ]+]] = phi i32 [ %n, %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
+; CHECK: [[REM:%[^ ]+]] = phi i32 [ [[TEST0]], %while.body.preheader ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
 ; CHECK: [[LOOP_DEC]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[REM]], i32 1)
 ; CHECK: [[CMP:%[^ ]+]] = icmp ne i32 [[LOOP_DEC]], 0
 ; CHECK: br i1 [[CMP]], label %while.body, label %while.end.loopexit

diff  --git a/llvm/test/Transforms/HardwareLoops/ARM/simple-do.ll b/llvm/test/Transforms/HardwareLoops/ARM/simple-do.ll
index d4f8683cdae7..836da6bdc4da 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/simple-do.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/simple-do.ll
@@ -46,13 +46,15 @@ while.end:
 
 ; CHECK-LABEL: do_inc1
 ; CHECK: entry:
-; CHECK: [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %n)
-; CHECK: br i1 [[TEST]], label %while.body.lr.ph, label %while.end
+; CHECK: [[TEST:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %n)
+; CHECK: [[TEST1:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 1
+; CHECK: [[TEST0:%[^ ]+]] = extractvalue { i32, i1 } [[TEST]], 0
+; CHECK: br i1 [[TEST1]], label %while.body.lr.ph, label %while.end
 
 ; CHECK: while.body.lr.ph:
 ; CHECK: br label %while.body
 
-; CHECK: [[REM:%[^ ]+]] = phi i32 [ %n, %while.body.lr.ph ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
+; CHECK: [[REM:%[^ ]+]] = phi i32 [ [[TEST0]], %while.body.lr.ph ], [ [[LOOP_DEC:%[^ ]+]], %while.body ]
 ; CHECK: [[LOOP_DEC]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[REM]], i32 1)
 ; CHECK: [[CMP:%[^ ]+]] = icmp ne i32 [[LOOP_DEC]], 0
 ; CHECK: br i1 [[CMP]], label %while.body, label %while.end.loopexit

diff  --git a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
index 88967ccd06a7..c611bc217796 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
@@ -118,14 +118,16 @@ while.end:                                        ; preds = %while.body
 }
 
 ; CHECK-LABEL: pre_existing_test_set
-; CHECK: call i1 @llvm.test.set.loop.iterations
+; CHECK: call { i32, i1 } @llvm.test.start.loop.iterations
 ; CHECK-NOT: llvm.set{{.*}}.loop.iterations
 ; CHECK: call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
 ; CHECK-NOT: call i32 @llvm.loop.decrement.reg
 define i32 @pre_existing_test_set(i32 %n, i32* nocapture %p, i32* nocapture readonly %q) {
 entry:
-  %guard = call i1 @llvm.test.set.loop.iterations.i32(i32 %n)
-  br i1 %guard, label %while.preheader, label %while.end
+  %guard = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %n)
+  %g0 = extractvalue { i32, i1 } %guard, 0
+  %g1 = extractvalue { i32, i1 } %guard, 1
+  br i1 %g1, label %while.preheader, label %while.end
 
 while.preheader:
   br label %while.body
@@ -133,7 +135,7 @@ while.preheader:
 while.body:                                       ; preds = %while.body, %entry
   %q.addr.05 = phi i32* [ %incdec.ptr, %while.body ], [ %q, %while.preheader ]
   %p.addr.04 = phi i32* [ %incdec.ptr1, %while.body ], [ %p, %while.preheader ]
-  %0 = phi i32 [ %n, %while.preheader ], [ %2, %while.body ]
+  %0 = phi i32 [ %g0, %while.preheader ], [ %2, %while.body ]
   %incdec.ptr = getelementptr inbounds i32, i32* %q.addr.05, i32 1
   %1 = load i32, i32* %q.addr.05, align 4
   %incdec.ptr1 = getelementptr inbounds i32, i32* %p.addr.04, i32 1
@@ -261,7 +263,8 @@ exit:
 
 ; CHECK-LABEL: search
 ; CHECK: entry:
-; CHECK:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK:   [[TEST1:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %N)
+; CHECK:   [[TEST:%[^ ]+]] = extractvalue { i32, i1 } [[TEST1]], 1
 ; CHECK:   br i1 [[TEST]], label %for.body.preheader, label %for.cond.cleanup
 ; CHECK: for.body.preheader:
 ; CHECK:   br label %for.body
@@ -321,7 +324,7 @@ for.inc:                                          ; preds = %sw.bb, %sw.bb1, %fo
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL-NOT: le lr, [[LOOP]]
 ; CHECK-UNROLL:     bne [[LOOP]]
-; CHECK-UNROLL:     wls lr, lr, [[EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     wls lr, r12, [[EXIT:.LBB[0-9_]+]]
 ; CHECK-UNROLL:     [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
 ; CHECK-UNROLL-NEXT: [[EXIT]]
@@ -349,7 +352,7 @@ for.body:
 }
 
 ; CHECK-LABEL: unroll_inc_unsigned
-; CHECK: call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK: call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %N)
 ; CHECK: call i32 @llvm.loop.decrement.reg.i32(
 
 ; TODO: We should be able to support the unrolled loop body.
@@ -359,7 +362,7 @@ for.body:
 ; CHECK-UNROLL:     [[LOOP:.LBB[0-9_]+]]: @ %for.body
 ; CHECK-UNROLL-NOT: le lr, [[LOOP]]
 ; CHECK-UNROLL:     bne [[LOOP]]
-; CHECK-UNROLL:     wls lr, lr, [[EPIL_EXIT:.LBB[0-9_]+]]
+; CHECK-UNROLL:     wls lr, r12, [[EPIL_EXIT:.LBB[0-9_]+]]
 ; CHECK-UNROLL: [[EPIL:.LBB[0-9_]+]]:
 ; CHECK-UNROLL:     le lr, [[EPIL]]
 ; CHECK-UNROLL: [[EPIL_EXIT]]:
@@ -422,6 +425,6 @@ for.body:
 }
 
 declare i32 @llvm.start.loop.iterations.i32(i32) #0
-declare i1 @llvm.test.set.loop.iterations.i32(i32) #0
+declare { i32, i1 } @llvm.test.start.loop.iterations.i32(i32) #0
 declare i32 @llvm.loop.decrement.reg.i32(i32, i32) #0
 

diff  --git a/llvm/test/Transforms/HardwareLoops/loop-guards.ll b/llvm/test/Transforms/HardwareLoops/loop-guards.ll
index 840779c977b2..f1238616996e 100644
--- a/llvm/test/Transforms/HardwareLoops/loop-guards.ll
+++ b/llvm/test/Transforms/HardwareLoops/loop-guards.ll
@@ -153,7 +153,9 @@ if.end:                                           ; preds = %while.body, %entry
 ; CHECK: entry:
 ; CHECK:   br i1 %brmerge.demorgan, label %while.preheader
 ; CHECK: while.preheader:
-; CHECK:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK-EXIT:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK-LATCH:   [[TEST1:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %N)
+; CHECK-LATCH:  [[TEST:%[^ ]+]] = extractvalue { i32, i1 } [[TEST1]], 1
 ; CHECK:   br i1 [[TEST]], label %while.body.preheader, label %if.end
 ; CHECK: while.body.preheader:
 ; CHECK:   br label %while.body
@@ -186,7 +188,9 @@ if.end:                                           ; preds = %while.body, %while.
 ; CHECK: entry:
 ; CHECK:   br i1 %brmerge.demorgan, label %while.preheader
 ; CHECK: while.preheader:
-; CHECK:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK-EXIT:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK-LATCH:   [[TEST1:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %N)
+; CHECK-LATCH:  [[TEST:%[^ ]+]] = extractvalue { i32, i1 } [[TEST1]], 1
 ; CHECK:   br i1 [[TEST]], label %while.body.preheader, label %if.end
 ; CHECK: while.body.preheader:
 ; CHECK:   br label %while.body
@@ -315,7 +319,9 @@ if.end:                                           ; preds = %do.body, %entry
 ; CHECK: entry:
 ; CHECK:   br label %do.body.preheader
 ; CHECK: do.body.preheader:
-; CHECK:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK-EXIT:   [[TEST:%[^ ]+]] = call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
+; CHECK-LATCH:  [[TEST1:%[^ ]+]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %N)
+; CHECK-LATCH:  [[TEST:%[^ ]+]] = extractvalue { i32, i1 } [[TEST1]], 1
 ; CHECK:   br i1 [[TEST]], label %do.body.preheader1, label %if.end
 ; CHECK: do.body.preheader1:
 ; CHECK:   br label %do.body

diff  --git a/llvm/test/Transforms/HardwareLoops/scalar-while.ll b/llvm/test/Transforms/HardwareLoops/scalar-while.ll
index 91a57747cbc7..9d5695e40167 100644
--- a/llvm/test/Transforms/HardwareLoops/scalar-while.ll
+++ b/llvm/test/Transforms/HardwareLoops/scalar-while.ll
@@ -417,19 +417,21 @@ define void @while_ne(i32 %N, i32* nocapture %A) {
 ; CHECK-PHIGUARD-LABEL: @while_ne(
 ; CHECK-PHIGUARD-NEXT:  entry:
 ; CHECK-PHIGUARD-NEXT:    [[CMP:%.*]] = icmp ne i32 [[N:%.*]], 0
-; CHECK-PHIGUARD-NEXT:    [[TMP0:%.*]] = call i1 @llvm.test.set.loop.iterations.i32(i32 [[N]])
-; CHECK-PHIGUARD-NEXT:    br i1 [[TMP0]], label [[WHILE_BODY_PREHEADER:%.*]], label [[WHILE_END:%.*]]
+; CHECK-PHIGUARD-NEXT:    [[TMP0:%.*]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 [[N]])
+; CHECK-PHIGUARD-NEXT:    [[TMP1:%.*]] = extractvalue { i32, i1 } [[TMP0]], 1
+; CHECK-PHIGUARD-NEXT:    [[TMP2:%.*]] = extractvalue { i32, i1 } [[TMP0]], 0
+; CHECK-PHIGUARD-NEXT:    br i1 [[TMP1]], label [[WHILE_BODY_PREHEADER:%.*]], label [[WHILE_END:%.*]]
 ; CHECK-PHIGUARD:       while.body.preheader:
 ; CHECK-PHIGUARD-NEXT:    br label [[WHILE_BODY:%.*]]
 ; CHECK-PHIGUARD:       while.body:
 ; CHECK-PHIGUARD-NEXT:    [[I_ADDR_05:%.*]] = phi i32 [ [[INC:%.*]], [[WHILE_BODY]] ], [ 0, [[WHILE_BODY_PREHEADER]] ]
-; CHECK-PHIGUARD-NEXT:    [[TMP1:%.*]] = phi i32 [ [[N]], [[WHILE_BODY_PREHEADER]] ], [ [[TMP2:%.*]], [[WHILE_BODY]] ]
+; CHECK-PHIGUARD-NEXT:    [[TMP3:%.*]] = phi i32 [ [[TMP2]], [[WHILE_BODY_PREHEADER]] ], [ [[TMP4:%.*]], [[WHILE_BODY]] ]
 ; CHECK-PHIGUARD-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[A:%.*]], i32 [[I_ADDR_05]]
 ; CHECK-PHIGUARD-NEXT:    store i32 [[I_ADDR_05]], i32* [[ARRAYIDX]], align 4
 ; CHECK-PHIGUARD-NEXT:    [[INC]] = add nuw i32 [[I_ADDR_05]], 1
-; CHECK-PHIGUARD-NEXT:    [[TMP2]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP1]], i32 1)
-; CHECK-PHIGUARD-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], 0
-; CHECK-PHIGUARD-NEXT:    br i1 [[TMP3]], label [[WHILE_BODY]], label [[WHILE_END]]
+; CHECK-PHIGUARD-NEXT:    [[TMP4]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP3]], i32 1)
+; CHECK-PHIGUARD-NEXT:    [[TMP5:%.*]] = icmp ne i32 [[TMP4]], 0
+; CHECK-PHIGUARD-NEXT:    br i1 [[TMP5]], label [[WHILE_BODY]], label [[WHILE_END]]
 ; CHECK-PHIGUARD:       while.end:
 ; CHECK-PHIGUARD-NEXT:    ret void
 ;
@@ -523,19 +525,21 @@ define void @while_eq(i32 %N, i32* nocapture %A) {
 ; CHECK-PHIGUARD-LABEL: @while_eq(
 ; CHECK-PHIGUARD-NEXT:  entry:
 ; CHECK-PHIGUARD-NEXT:    [[CMP:%.*]] = icmp eq i32 [[N:%.*]], 0
-; CHECK-PHIGUARD-NEXT:    [[TMP0:%.*]] = call i1 @llvm.test.set.loop.iterations.i32(i32 [[N]])
-; CHECK-PHIGUARD-NEXT:    br i1 [[TMP0]], label [[WHILE_BODY_PREHEADER:%.*]], label [[WHILE_END:%.*]]
+; CHECK-PHIGUARD-NEXT:    [[TMP0:%.*]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 [[N]])
+; CHECK-PHIGUARD-NEXT:    [[TMP1:%.*]] = extractvalue { i32, i1 } [[TMP0]], 1
+; CHECK-PHIGUARD-NEXT:    [[TMP2:%.*]] = extractvalue { i32, i1 } [[TMP0]], 0
+; CHECK-PHIGUARD-NEXT:    br i1 [[TMP1]], label [[WHILE_BODY_PREHEADER:%.*]], label [[WHILE_END:%.*]]
 ; CHECK-PHIGUARD:       while.body.preheader:
 ; CHECK-PHIGUARD-NEXT:    br label [[WHILE_BODY:%.*]]
 ; CHECK-PHIGUARD:       while.body:
 ; CHECK-PHIGUARD-NEXT:    [[I_ADDR_05:%.*]] = phi i32 [ [[INC:%.*]], [[WHILE_BODY]] ], [ 0, [[WHILE_BODY_PREHEADER]] ]
-; CHECK-PHIGUARD-NEXT:    [[TMP1:%.*]] = phi i32 [ [[N]], [[WHILE_BODY_PREHEADER]] ], [ [[TMP2:%.*]], [[WHILE_BODY]] ]
+; CHECK-PHIGUARD-NEXT:    [[TMP3:%.*]] = phi i32 [ [[TMP2]], [[WHILE_BODY_PREHEADER]] ], [ [[TMP4:%.*]], [[WHILE_BODY]] ]
 ; CHECK-PHIGUARD-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[A:%.*]], i32 [[I_ADDR_05]]
 ; CHECK-PHIGUARD-NEXT:    store i32 [[I_ADDR_05]], i32* [[ARRAYIDX]], align 4
 ; CHECK-PHIGUARD-NEXT:    [[INC]] = add nuw i32 [[I_ADDR_05]], 1
-; CHECK-PHIGUARD-NEXT:    [[TMP2]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP1]], i32 1)
-; CHECK-PHIGUARD-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], 0
-; CHECK-PHIGUARD-NEXT:    br i1 [[TMP3]], label [[WHILE_BODY]], label [[WHILE_END]]
+; CHECK-PHIGUARD-NEXT:    [[TMP4]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP3]], i32 1)
+; CHECK-PHIGUARD-NEXT:    [[TMP5:%.*]] = icmp ne i32 [[TMP4]], 0
+; CHECK-PHIGUARD-NEXT:    br i1 [[TMP5]], label [[WHILE_BODY]], label [[WHILE_END]]
 ; CHECK-PHIGUARD:       while.end:
 ; CHECK-PHIGUARD-NEXT:    ret void
 ;
@@ -639,19 +643,21 @@ define void @while_preheader_eq(i32 %N, i32* nocapture %A) {
 ; CHECK-PHIGUARD-NEXT:    br label [[PREHEADER:%.*]]
 ; CHECK-PHIGUARD:       preheader:
 ; CHECK-PHIGUARD-NEXT:    [[CMP:%.*]] = icmp eq i32 [[N:%.*]], 0
-; CHECK-PHIGUARD-NEXT:    [[TMP0:%.*]] = call i1 @llvm.test.set.loop.iterations.i32(i32 [[N]])
-; CHECK-PHIGUARD-NEXT:    br i1 [[TMP0]], label [[WHILE_BODY_PREHEADER:%.*]], label [[WHILE_END:%.*]]
+; CHECK-PHIGUARD-NEXT:    [[TMP0:%.*]] = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 [[N]])
+; CHECK-PHIGUARD-NEXT:    [[TMP1:%.*]] = extractvalue { i32, i1 } [[TMP0]], 1
+; CHECK-PHIGUARD-NEXT:    [[TMP2:%.*]] = extractvalue { i32, i1 } [[TMP0]], 0
+; CHECK-PHIGUARD-NEXT:    br i1 [[TMP1]], label [[WHILE_BODY_PREHEADER:%.*]], label [[WHILE_END:%.*]]
 ; CHECK-PHIGUARD:       while.body.preheader:
 ; CHECK-PHIGUARD-NEXT:    br label [[WHILE_BODY:%.*]]
 ; CHECK-PHIGUARD:       while.body:
 ; CHECK-PHIGUARD-NEXT:    [[I_ADDR_05:%.*]] = phi i32 [ [[INC:%.*]], [[WHILE_BODY]] ], [ 0, [[WHILE_BODY_PREHEADER]] ]
-; CHECK-PHIGUARD-NEXT:    [[TMP1:%.*]] = phi i32 [ [[N]], [[WHILE_BODY_PREHEADER]] ], [ [[TMP2:%.*]], [[WHILE_BODY]] ]
+; CHECK-PHIGUARD-NEXT:    [[TMP3:%.*]] = phi i32 [ [[TMP2]], [[WHILE_BODY_PREHEADER]] ], [ [[TMP4:%.*]], [[WHILE_BODY]] ]
 ; CHECK-PHIGUARD-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[A:%.*]], i32 [[I_ADDR_05]]
 ; CHECK-PHIGUARD-NEXT:    store i32 [[I_ADDR_05]], i32* [[ARRAYIDX]], align 4
 ; CHECK-PHIGUARD-NEXT:    [[INC]] = add nuw i32 [[I_ADDR_05]], 1
-; CHECK-PHIGUARD-NEXT:    [[TMP2]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP1]], i32 1)
-; CHECK-PHIGUARD-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], 0
-; CHECK-PHIGUARD-NEXT:    br i1 [[TMP3]], label [[WHILE_BODY]], label [[WHILE_END]]
+; CHECK-PHIGUARD-NEXT:    [[TMP4]] = call i32 @llvm.loop.decrement.reg.i32(i32 [[TMP3]], i32 1)
+; CHECK-PHIGUARD-NEXT:    [[TMP5:%.*]] = icmp ne i32 [[TMP4]], 0
+; CHECK-PHIGUARD-NEXT:    br i1 [[TMP5]], label [[WHILE_BODY]], label [[WHILE_END]]
 ; CHECK-PHIGUARD:       while.end:
 ; CHECK-PHIGUARD-NEXT:    ret void
 ;


        


More information about the llvm-commits mailing list