[llvm] [SystemZ] Add a SystemZ specific pre-RA scheduling strategy. (PR #135076)

Wed Apr 9 13:05:19 PDT 2025

llvmbot wrote:




@llvm/pr-subscribers-backend-systemz

Author: Jonas Paulsson (JonPsson1)

<details>
<summary>Changes</summary>

As the heavy spilling in Cactus was found to be caused by the
MachineScheduler, it has been clear that there is room for improvement in
GenericScheduler. A first attempt was made with #90181, which however did not
materialize in anything actually useful.

This time, a SystemZ specific MachineSchedStrategy has been developed, guided
by experimentation and benchmarking.

Finding that GenericScheduler fails to handle liveness despite all the work
it is doing with PressureSets etc, I tried a simpler approach by keeping
track of live registers and scheduling e.g. a load immediately if it closed a
live range. If this is always done first, it then seems to work well to also
consider latencies as a second step. The final piece that makes it all work
well is the distinction of tiny regions, which remedies seen problems with
comparison elimination, phys-reg copys and even compile time. These regions
(with up to 10 instructions) are mainly left as they were.

SPEC Results (omitting <1% changes):

```
f538.imagick_r      -13.34%
f507.cactuBSSN_r    -11.60% (*)
f526.blender_r       -3.82%
f544.nab_r           -1.99%
f508.namd_r          -1.82%
f519.lbm_r           -1.67%

(*) Improved 5% by disabling pre-RA scheduling.

geometric mean across all benchmarks: -1.81%

Number of Spill/Reload instructions : -5718 (-1%)
Number of register moves:             +1778 (+0.2%)
```

Compile time showed at first some heavy regressions that were due to a
large amount of LIS->getInterval() lookups in functions with many small
blocks (regions) and vregs. With the tiny regions handling however, this
problem went away and it seems to be now on par with GenericSched.
If needed there are likely things that could be sped up.



---

Patch is 211.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135076.diff


58 Files Affected:

- (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+2) 
- (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+26-25) 
- (modified) llvm/lib/Target/SystemZ/SystemZElimCompare.cpp (+5-37) 
- (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp (+28) 
- (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.h (+11) 
- (modified) llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp (+407-8) 
- (modified) llvm/lib/Target/SystemZ/SystemZMachineScheduler.h (+74-1) 
- (modified) llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp (+19) 
- (modified) llvm/lib/Target/SystemZ/SystemZTargetMachine.h (+4) 
- (modified) llvm/test/CodeGen/SystemZ/DAGCombiner_isAlias.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/alias-01.ll (+2-2) 
- (modified) llvm/test/CodeGen/SystemZ/args-06.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/args-12.ll (+1-1) 
- (modified) llvm/test/CodeGen/SystemZ/atomic-load-09.ll (+2-2) 
- (modified) llvm/test/CodeGen/SystemZ/atomic-store-08.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/atomic-store-09.ll (+2-2) 
- (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fmax-01.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fmin-01.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/atomicrmw-ops-i128.ll (+3-3) 
- (modified) llvm/test/CodeGen/SystemZ/bswap-09.ll (+5-5) 
- (modified) llvm/test/CodeGen/SystemZ/bswap-10.ll (+5-5) 
- (modified) llvm/test/CodeGen/SystemZ/call-zos-vec.ll (+9-9) 
- (modified) llvm/test/CodeGen/SystemZ/dag-combine-05.ll (+10-10) 
- (modified) llvm/test/CodeGen/SystemZ/inline-asm-fp-int-casting-explicit-regs-zEC12.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/inline-asm-fp-int-casting-zEC12.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/int-conv-14.ll (+6-6) 
- (modified) llvm/test/CodeGen/SystemZ/int-mul-12.ll (+13-14) 
- (modified) llvm/test/CodeGen/SystemZ/int-mul-13.ll (+4-4) 
- (modified) llvm/test/CodeGen/SystemZ/int-mul-15.ll (+5-5) 
- (modified) llvm/test/CodeGen/SystemZ/int-uadd-14.ll (+19-19) 
- (modified) llvm/test/CodeGen/SystemZ/int-usub-13.ll (+25-25) 
- (modified) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp.ll (+96-96) 
- (modified) llvm/test/CodeGen/SystemZ/memcpy-01.ll (+2-2) 
- (modified) llvm/test/CodeGen/SystemZ/memset-01.ll (+2-2) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-biaspregs.mir (+87) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-copy-coal.mir (+31) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-latencies.mir (+167) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-loads.mir (+391) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-manystores-01.ll (+31) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-manystores-02.mir (+200) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-manystores-03.mir (+154) 
- (added) llvm/test/CodeGen/SystemZ/misched-prera-tinyregions.mir (+160) 
- (modified) llvm/test/CodeGen/SystemZ/regcoal_remat_empty_subrange.ll (+3-3) 
- (modified) llvm/test/CodeGen/SystemZ/rot-03.ll (+11-11) 
- (modified) llvm/test/CodeGen/SystemZ/shift-13.ll (+24-24) 
- (modified) llvm/test/CodeGen/SystemZ/shift-14.ll (+24-24) 
- (modified) llvm/test/CodeGen/SystemZ/shift-15.ll (+24-24) 
- (modified) llvm/test/CodeGen/SystemZ/shift-16.ll (+43-43) 
- (modified) llvm/test/CodeGen/SystemZ/shift-17.ll (+50-50) 
- (modified) llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll (+40-40) 
- (modified) llvm/test/CodeGen/SystemZ/vec-args-04.ll (+8-8) 
- (modified) llvm/test/CodeGen/SystemZ/vec-cmp-cmp-logic-select.ll (+151-151) 
- (renamed) llvm/test/CodeGen/SystemZ/vec-cmpsel-01.ll (+94-105) 
- (added) llvm/test/CodeGen/SystemZ/vec-cmpsel-02.ll (+70) 
- (modified) llvm/test/CodeGen/SystemZ/vec-eval.ll (+8-8) 
- (modified) llvm/test/CodeGen/SystemZ/vec-move-23.ll (+165-85) 
- (modified) llvm/test/CodeGen/SystemZ/vec-sub-01.ll (+10-10) 
- (modified) llvm/test/CodeGen/SystemZ/vector-constrained-fp-intrinsics.ll (+99-99) 


``````````diff

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..8a819051f069a 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -1087,6 +1087,7 @@ class GenericSchedulerBase : public MachineSchedStrategy {
     NoCand,
     Only1,
     PhysReg,
+    LivenessReduce,
     RegExcess,
     RegCritical,
     Stall,
@@ -1218,6 +1219,7 @@ class GenericSchedulerBase : public MachineSchedStrategy {
 };
 
 // Utility functions used by heuristics in tryCandidate().
+unsigned computeRemLatency(SchedBoundary &CurrZone);
 bool tryLess(int TryVal, int CandVal,
              GenericSchedulerBase::SchedCandidate &TryCand,
              GenericSchedulerBase::SchedCandidate &Cand,
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 97f27277aface..579f787b0af9b 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -3168,31 +3168,6 @@ initResourceDelta(const ScheduleDAGMI *DAG,
   }
 }
 
-/// Compute remaining latency. We need this both to determine whether the
-/// overall schedule has become latency-limited and whether the instructions
-/// outside this zone are resource or latency limited.
-///
-/// The "dependent" latency is updated incrementally during scheduling as the
-/// max height/depth of scheduled nodes minus the cycles since it was
-/// scheduled:
-///   DLat = max (N.depth - (CurrCycle - N.ReadyCycle) for N in Zone
-///
-/// The "independent" latency is the max ready queue depth:
-///   ILat = max N.depth for N in Available|Pending
-///
-/// RemainingLatency is the greater of independent and dependent latency.
-///
-/// These computations are expensive, especially in DAGs with many edges, so
-/// only do them if necessary.
-static unsigned computeRemLatency(SchedBoundary &CurrZone) {
-  unsigned RemLatency = CurrZone.getDependentLatency();
-  RemLatency = std::max(RemLatency,
-                        CurrZone.findMaxLatency(CurrZone.Available.elements()));
-  RemLatency = std::max(RemLatency,
-                        CurrZone.findMaxLatency(CurrZone.Pending.elements()));
-  return RemLatency;
-}
-
 /// Returns true if the current cycle plus remaning latency is greater than
 /// the critical path in the scheduling region.
 bool GenericSchedulerBase::shouldReduceLatency(const CandPolicy &Policy,
@@ -3278,6 +3253,7 @@ const char *GenericSchedulerBase::getReasonStr(
   case NoCand:         return "NOCAND    ";
   case Only1:          return "ONLY1     ";
   case PhysReg:        return "PHYS-REG  ";
+  case LivenessReduce: return "LIVE-REDUC";
   case RegExcess:      return "REG-EXCESS";
   case RegCritical:    return "REG-CRIT  ";
   case Stall:          return "STALL     ";
@@ -3351,6 +3327,31 @@ void GenericSchedulerBase::traceCandidate(const SchedCandidate &Cand) {
 #endif
 
 namespace llvm {
+/// Compute remaining latency. We need this both to determine whether the
+/// overall schedule has become latency-limited and whether the instructions
+/// outside this zone are resource or latency limited.
+///
+/// The "dependent" latency is updated incrementally during scheduling as the
+/// max height/depth of scheduled nodes minus the cycles since it was
+/// scheduled:
+///   DLat = max (N.depth - (CurrCycle - N.ReadyCycle) for N in Zone
+///
+/// The "independent" latency is the max ready queue depth:
+///   ILat = max N.depth for N in Available|Pending
+///
+/// RemainingLatency is the greater of independent and dependent latency.
+///
+/// These computations are expensive, especially in DAGs with many edges, so
+/// only do them if necessary.
+unsigned computeRemLatency(SchedBoundary &CurrZone) {
+  unsigned RemLatency = CurrZone.getDependentLatency();
+  RemLatency = std::max(RemLatency,
+                        CurrZone.findMaxLatency(CurrZone.Available.elements()));
+  RemLatency = std::max(RemLatency,
+                        CurrZone.findMaxLatency(CurrZone.Pending.elements()));
+  return RemLatency;
+}
+
 /// Return true if this heuristic determines order.
 /// TODO: Consider refactor return type of these functions as integer or enum,
 /// as we may need to differentiate whether TryCand is better than Cand.
diff --git a/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp b/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp
index 81f0014dd83f2..149a7b2f451d5 100644
--- a/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp
@@ -151,30 +151,6 @@ Reference SystemZElimCompare::getRegReferences(MachineInstr &MI, unsigned Reg) {
   return Ref;
 }
 
-// Return true if this is a load and test which can be optimized the
-// same way as compare instruction.
-static bool isLoadAndTestAsCmp(MachineInstr &MI) {
-  // If we during isel used a load-and-test as a compare with 0, the
-  // def operand is dead.
-  return (MI.getOpcode() == SystemZ::LTEBR ||
-          MI.getOpcode() == SystemZ::LTDBR ||
-          MI.getOpcode() == SystemZ::LTXBR) &&
-         MI.getOperand(0).isDead();
-}
-
-// Return the source register of Compare, which is the unknown value
-// being tested.
-static unsigned getCompareSourceReg(MachineInstr &Compare) {
-  unsigned reg = 0;
-  if (Compare.isCompare())
-    reg = Compare.getOperand(0).getReg();
-  else if (isLoadAndTestAsCmp(Compare))
-    reg = Compare.getOperand(1).getReg();
-  assert(reg);
-
-  return reg;
-}
-
 // Compare compares the result of MI against zero.  If MI is an addition
 // of -1 and if CCUsers is a single branch on nonzero, eliminate the addition
 // and convert the branch to a BRCT(G) or BRCTH.  Return true on success.
@@ -207,7 +183,7 @@ bool SystemZElimCompare::convertToBRCT(
   // We already know that there are no references to the register between
   // MI and Compare.  Make sure that there are also no references between
   // Compare and Branch.
-  unsigned SrcReg = getCompareSourceReg(Compare);
+  unsigned SrcReg = TII->getCompareSourceReg(Compare);
   MachineBasicBlock::iterator MBBI = Compare, MBBE = Branch;
   for (++MBBI; MBBI != MBBE; ++MBBI)
     if (getRegReferences(*MBBI, SrcReg))
@@ -254,7 +230,7 @@ bool SystemZElimCompare::convertToLoadAndTrap(
   // We already know that there are no references to the register between
   // MI and Compare.  Make sure that there are also no references between
   // Compare and Branch.
-  unsigned SrcReg = getCompareSourceReg(Compare);
+  unsigned SrcReg = TII->getCompareSourceReg(Compare);
   MachineBasicBlock::iterator MBBI = Compare, MBBE = Branch;
   for (++MBBI; MBBI != MBBE; ++MBBI)
     if (getRegReferences(*MBBI, SrcReg))
@@ -495,25 +471,17 @@ bool SystemZElimCompare::adjustCCMasksForInstr(
   return true;
 }
 
-// Return true if Compare is a comparison against zero.
-static bool isCompareZero(MachineInstr &Compare) {
-  if (isLoadAndTestAsCmp(Compare))
-    return true;
-  return Compare.getNumExplicitOperands() == 2 &&
-    Compare.getOperand(1).isImm() && Compare.getOperand(1).getImm() == 0;
-}
-
 // Try to optimize cases where comparison instruction Compare is testing
 // a value against zero.  Return true on success and if Compare should be
 // deleted as dead.  CCUsers is the list of instructions that use the CC
 // value produced by Compare.
 bool SystemZElimCompare::optimizeCompareZero(
     MachineInstr &Compare, SmallVectorImpl<MachineInstr *> &CCUsers) {
-  if (!isCompareZero(Compare))
+  if (!TII->isCompareZero(Compare))
     return false;
 
   // Search back for CC results that are based on the first operand.
-  unsigned SrcReg = getCompareSourceReg(Compare);
+  unsigned SrcReg = TII->getCompareSourceReg(Compare);
   MachineBasicBlock &MBB = *Compare.getParent();
   Reference CCRefs;
   Reference SrcRefs;
@@ -702,7 +670,7 @@ bool SystemZElimCompare::processBlock(MachineBasicBlock &MBB) {
   MachineBasicBlock::iterator MBBI = MBB.end();
   while (MBBI != MBB.begin()) {
     MachineInstr &MI = *--MBBI;
-    if (CompleteCCUsers && (MI.isCompare() || isLoadAndTestAsCmp(MI)) &&
+    if (CompleteCCUsers && (MI.isCompare() || TII->isLoadAndTestAsCmp(MI)) &&
         (optimizeCompareZero(MI, CCUsers) ||
          fuseCompareOperations(MI, CCUsers))) {
       ++MBBI;
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 91a4aa9c73010..54a2adcc46198 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -2144,6 +2144,34 @@ unsigned SystemZInstrInfo::getFusedCompare(unsigned Opcode,
   return 0;
 }
 
+bool SystemZInstrInfo::isLoadAndTestAsCmp(const MachineInstr &MI) const {
+  // If we during isel used a load-and-test as a compare with 0, the
+  // def operand is dead.
+  return (MI.getOpcode() == SystemZ::LTEBR ||
+          MI.getOpcode() == SystemZ::LTDBR ||
+          MI.getOpcode() == SystemZ::LTXBR) &&
+    MI.getOperand(0).isDead();
+}
+
+bool SystemZInstrInfo::isCompareZero(const MachineInstr &Compare) const {
+  if (isLoadAndTestAsCmp(Compare))
+    return true;
+  return Compare.isCompare() && Compare.getNumExplicitOperands() == 2 &&
+    Compare.getOperand(1).isImm() && Compare.getOperand(1).getImm() == 0;
+}
+
+unsigned SystemZInstrInfo::
+getCompareSourceReg(const MachineInstr &Compare) const {
+  unsigned reg = 0;
+  if (Compare.isCompare())
+    reg = Compare.getOperand(0).getReg();
+  else if (isLoadAndTestAsCmp(Compare))
+    reg = Compare.getOperand(1).getReg();
+  assert(reg);
+
+  return reg;
+}
+
 bool SystemZInstrInfo::
 prepareCompareSwapOperands(MachineBasicBlock::iterator const MBBI) const {
   assert(MBBI->isCompare() && MBBI->getOperand(0).isReg() &&
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.h b/llvm/lib/Target/SystemZ/SystemZInstrInfo.h
index 8b82af61e669a..2030d52becc0e 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.h
@@ -356,6 +356,17 @@ class SystemZInstrInfo : public SystemZGenInstrInfo {
                            SystemZII::FusedCompareType Type,
                            const MachineInstr *MI = nullptr) const;
 
+  // Return true if this is a load and test which can be optimized the
+  // same way as compare instruction.
+  bool isLoadAndTestAsCmp(const MachineInstr &MI) const;
+
+  // Return true if Compare is a comparison against zero.
+  bool isCompareZero(const MachineInstr &Compare) const;
+
+  // Return the source register of Compare, which is the unknown value
+  // being tested.
+  unsigned getCompareSourceReg(const MachineInstr &Compare) const;
+
   // Try to find all CC users of the compare instruction (MBBI) and update
   // all of them to maintain equivalent behavior after swapping the compare
   // operands. Return false if not all users can be conclusively found and
diff --git a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
index 5e2365f1dc513..85376ec70edc5 100644
--- a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
@@ -5,22 +5,421 @@
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
 //===----------------------------------------------------------------------===//
-//
-// -------------------------- Post RA scheduling ---------------------------- //
-// SystemZPostRASchedStrategy is a scheduling strategy which is plugged into
-// the MachineScheduler. It has a sorted Available set of SUs and a pickNode()
-// implementation that looks to optimize decoder grouping and balance the
-// usage of processor resources. Scheduler states are saved for the end
-// region of each MBB, so that a successor block can learn from it.
-//===----------------------------------------------------------------------===//
 
 #include "SystemZMachineScheduler.h"
+#include "llvm/CodeGen/LiveInterval.h"
+#include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineLoopInfo.h"
 
 using namespace llvm;
 
 #define DEBUG_TYPE "machine-scheduler"
 
+/// Pre-RA scheduling ///
+
+static cl::opt<unsigned> TinyRegionLim(
+    "tiny-region-lim", cl::Hidden, cl::init(10),
+    cl::desc("Run limited pre-ra scheduling on regions of this size or "
+             "smaller. Mainly for testing."));
+
+static bool isRegDef(const MachineOperand &MO) {
+  return MO.isReg() && MO.isDef();
+}
+
+static bool isVirtRegDef(const MachineOperand &MO) {
+  return isRegDef(MO) && MO.getReg().isVirtual();
+}
+
+static bool isPhysRegDef(const MachineOperand &MO) {
+  return isRegDef(MO) && MO.getReg().isPhysical();
+}
+
+static bool isVirtRegUse(const MachineOperand &MO) {
+  return MO.isReg() && MO.isUse() && MO.readsReg() && MO.getReg().isVirtual();
+}
+
+void SystemZPreRASchedStrategy::initializePrioRegClasses(
+    const TargetRegisterInfo *TRI) {
+  for (const TargetRegisterClass *RC : TRI->regclasses()) {
+    for (MVT VT : MVT::fp_valuetypes())
+      if (TRI->isTypeLegalForClass(*RC, VT)) {
+        PrioRegClasses.insert(RC->getID());
+        break;
+      }
+
+    // On SystemZ vector and FP registers overlap: add any vector RC.
+    if (!PrioRegClasses.count(RC->getID()))
+      for (MVT VT : MVT::fp_fixedlen_vector_valuetypes())
+        if (TRI->isTypeLegalForClass(*RC, VT)) {
+          PrioRegClasses.insert(RC->getID());
+          break;
+        }
+  }
+}
+
+void SystemZPreRASchedStrategy::VRegSet::dump(std::string Msg) {
+  dbgs() << Msg.c_str();
+  bool First = true;
+  for (auto R : *this) {
+    if (!First)
+      dbgs() << ", ";
+    else
+      First = false;
+    dbgs() << "%" << R.virtRegIndex();
+  }
+  dbgs() << "\n";
+}
+
+unsigned SystemZPreRASchedStrategy::getRemLat(SchedBoundary *Zone) const {
+  if (RemLat == ~0U)
+    RemLat = computeRemLatency(*Zone);
+  return RemLat;
+}
+
+void SystemZPreRASchedStrategy::initializeStoresGroup() {
+  StoresGroup.clear();
+  FirstStoreInGroupScheduled = false;
+
+  unsigned CurrMaxDepth = 0;
+  for (unsigned Idx = DAG->SUnits.size() - 1; Idx + 1 != 0; --Idx) {
+    const SUnit *SU = &DAG->SUnits[Idx];
+    const MachineInstr *MI = SU->getInstr();
+    if (!MI->getNumOperands() || MI->isCopy())
+      continue;
+
+    bool HasVirtDef = false;
+    bool HasVirtUse = false;
+    for (unsigned I = 0; I < MI->getDesc().getNumOperands(); ++I) {
+      const MachineOperand &MO = MI->getOperand(I);
+      if (isVirtRegDef(MO) && !MO.isDead())
+        HasVirtDef = true;
+      else if (isVirtRegUse(MO) &&
+               MI->getDesc().operands()[I].OperandType != MCOI::OPERAND_MEMORY)
+        HasVirtUse = true;
+    }
+    bool IsStore = !HasVirtDef && HasVirtUse;
+
+    // Find a group of stores that all are at the bottom while avoiding
+    // regions with any additional group of lesser depth.
+    if (SU->getDepth() > CurrMaxDepth) {
+      CurrMaxDepth = SU->getDepth();
+      bool PrevGroup = StoresGroup.size() > 1;
+      StoresGroup.clear();
+      if (PrevGroup)
+        return;
+      if (IsStore)
+        StoresGroup.insert(SU);
+    }
+    else if (IsStore && !StoresGroup.empty() && SU->getDepth() == CurrMaxDepth) {
+      // The group members should all have the same opcode.
+      if ((*StoresGroup.begin())->getInstr()->getOpcode() != MI->getOpcode()) {
+        StoresGroup.clear();
+        return;
+      }
+      StoresGroup.insert(SU);
+    }
+  }
+
+  // Value of 8 handles a known regression (with group of 20).
+  // TODO: Would some other value be better?
+  if (StoresGroup.size() < 8)
+    StoresGroup.clear();
+}
+
+static int biasPhysRegExtra(const SUnit *SU) {
+  if (int Res = biasPhysReg(SU, /*isTop=*/false))
+    return Res;
+
+  // Also recognize Load Address of stack slot. There are (at least
+  // currently) no instructions here defining a physreg that uses a vreg.
+  const MachineInstr *MI = SU->getInstr();
+  if (MI->getNumOperands() && !MI->isCopy()) {
+    const MachineOperand &DefMO = MI->getOperand(0);
+    if (isPhysRegDef(DefMO))
+      return 1;
+  }
+
+  return 0;
+}
+
+int SystemZPreRASchedStrategy::
+computeSULivenessScore(SchedCandidate &C, ScheduleDAGMILive *DAG,
+                       SchedBoundary *Zone) const {
+  // Not all data deps are modelled around the SUnit - some data edges near
+  // boundaries are missing: Look directly at the MI operands instead.
+  const SUnit *SU = C.SU;
+  const MachineInstr *MI = SU->getInstr();
+  if (!MI->getNumOperands() || MI->isCopy())
+    return 0;
+
+  // Find uses of registers that are not already live (kills).
+  bool PrioKill = false;
+  bool GPRKill = false;
+  bool AddrKill = false;
+  bool HasPrioUse = false;
+  for (unsigned I = 0; I < MI->getDesc().getNumOperands(); ++I) {
+    const MachineOperand &MO = MI->getOperand(I);
+    if (!isVirtRegUse(MO))
+      continue;
+    HasPrioUse |= isPrioVirtReg(MO.getReg(), &DAG->MRI);
+    if (LiveRegs.count(MO.getReg()))
+      continue;
+    if (isPrioVirtReg(MO.getReg(), &DAG->MRI))
+      PrioKill = true;
+    else if (MI->getDesc().operands()[I].OperandType != MCOI::OPERAND_MEMORY)
+      GPRKill = true;
+    else
+      AddrKill = true;
+  }
+
+  // Find the interesting properties.
+  const MachineOperand &DefMO = MI->getOperand(0);
+  assert(!isPhysRegDef(DefMO) && "Did not expect physreg def!");
+  bool IsLoad =
+      isRegDef(DefMO) && !DefMO.isDead() && !IsRedefining[SU->NodeNum];
+  bool IsStore = (!isRegDef(DefMO) || DefMO.isDead());
+  // Prioritize FP: Ignore GPR/Addr kills with an FP def.
+  bool UsesLivePrio =
+      IsLoad && !PrioKill &&
+      (isPrioVirtReg(DefMO.getReg(), &DAG->MRI) || (!GPRKill && !AddrKill));
+  bool UsesLiveAll = !PrioKill && !GPRKill && !AddrKill;
+  bool PreservesSchedLat = SU->getHeight() <= Zone->getScheduledLatency();
+  const unsigned Cycles = 2;
+  unsigned Margin = SchedModel->getIssueWidth() * (Cycles + SU->Latency - 1);
+  bool HasDistToTop = NumLeft > Margin;
+
+  // Pull down a defining SU if it preserves the scheduled latency while not
+  // causing any (prioritized) register uses to become live. If however there
+  // will be relatively many SUs scheduled above this one and all uses are
+  // already live it should not be a problem to increase the scheduled
+  // latency given the OOO execution.
+  // TODO: Try schedulling small (DFSResult) subtrees as a unit.
+  bool SchedLow = IsLoad && ((PreservesSchedLat && UsesLivePrio) ||
+                             (HasDistToTop && UsesLiveAll));
+
+  // This handles regions with many chained stores of the same depth at the
+  // bottom in the input order (cactus). Push them upwards during scheduling.
+  bool SchedHigh = IsStore && FirstStoreInGroupScheduled &&
+                   StoresGroup.count(SU) &&
+                   (PrioKill || (!HasPrioUse && GPRKill));
+
+  if (SchedLow)
+    return -1;
+  if (SchedHigh)
+    return 1;
+  return 0;
+}
+
+bool SystemZPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
+                                             SchedCandidate &TryCand,
+                                             SchedBoundary *Zone) const {
+  assert(Zone && !Zone->isTop() && "Bottom-Up scheduling only.");
+
+  // Initialize the candidate if needed.
+  if (!Cand.isValid()) {
+    TryCand.Reason = FirstValid;
+    return true;
+  }
+
+  // Bias physreg defs and copies to their uses and definitions respectively.
+  int TryCandPRegBias = biasPhysRegExtra(TryCand.SU);
+  int CandPRegBias = biasPhysRegExtra(Cand.SU);
+  if (tryGreater(TryCandPRegBias, CandPRegBias, TryCand, Cand, PhysReg))
+    return TryCand.Reason != NoCand;
+  if (TryCandPRegBias && CandPRegBias) {
+    // Both biased same way.
+    tryGreater(TryCand.SU->NodeNum, Cand.SU->NodeNum, TryCand, Cand, NodeOrder);
+    return TryCand.Reason != NoCand;
+  }
+
+  if (TinyRegion) {
+    // Prioritize instructions that read unbuffered resources by stall cycles.
+    // TODO: Try this in bigger regions as well.
+    if (tryLess(Zone->getLatencyStallCycles(TryCand.SU),
+                Zone->getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))
+      return TryCand.Reason != NoCand;
+  } else {
+    // Look for an opportunity to reduce register liveness.
+    int TryCandScore = computeSULivenessScore(TryCand, DAG, Zone);
+    int CandScore = computeSULivenessScore(Cand, DAG, Zone);
+    if (tryLess(TryCandScore, CandScore, TryCand, Cand, LivenessReduce))
+      return TryCand.Reason != NoCand;
+
+    // Don't extend the scheduled latency.
+    if (ShouldReduceLatency && TryC...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/135076