[llvm] MachineScheduler: Improve instruction clustering (PR #137784)

Tue Apr 29 03:35:51 PDT 2025

llvmbot wrote:




@llvm/pr-subscribers-backend-powerpc

Author: Ruiling, Song (ruiling)

<details>
<summary>Changes</summary>

The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case.

There are two major reasons why there are so many test check changes.
1. The existing implemention has some buggy behavior: The scheduler does not reset the pointer to next cluster candidate. For example, we want to cluster A and B, but after picking A, we might pick node C. In theory, we should reset the next cluster candiate here, because we have decided not to cluster A and B during scheduling. Later picking B because of Cluster seems not logical.

2. As the cluster candidates are not ordered now, the candidates might be picked in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.

---

Patch is 5.52 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137784.diff


176 Files Affected:

- (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+6-8) 
- (modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+7) 
- (modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+10) 
- (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+52-23) 
- (modified) llvm/lib/CodeGen/MacroFusion.cpp (+13) 
- (modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+3) 
- (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+10-12) 
- (modified) llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp (+10-8) 
- (modified) llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll (+1-2) 
- (modified) llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll (+6-6) 
- (modified) llvm/test/CodeGen/AArch64/bcmp.ll (+4-3) 
- (modified) llvm/test/CodeGen/AArch64/expand-select.ll (+10-10) 
- (modified) llvm/test/CodeGen/AArch64/extbinopload.ll (+58-57) 
- (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+17-17) 
- (modified) llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll (+15-15) 
- (modified) llvm/test/CodeGen/AArch64/fptoi.ll (+70-70) 
- (modified) llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll (+16-16) 
- (modified) llvm/test/CodeGen/AArch64/itofp.ll (+90-90) 
- (modified) llvm/test/CodeGen/AArch64/mul.ll (+12-12) 
- (modified) llvm/test/CodeGen/AArch64/nontemporal-load.ll (+9-8) 
- (modified) llvm/test/CodeGen/AArch64/nzcv-save.ll (+9-9) 
- (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-llrint.ll (+43-43) 
- (modified) llvm/test/CodeGen/AArch64/sve-fixed-vector-lrint.ll (+43-43) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll (+47-47) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-convert.ll (+8-8) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-reduce.ll (+12-12) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-vselect.ll (+12-12) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-extends.ll (+80-82) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll (+12-12) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-reduce.ll (+16-16) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-to-fp.ll (+74-72) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-vselect.ll (+24-24) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll (+42-42) 
- (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ptest.ll (+6-6) 
- (modified) llvm/test/CodeGen/AArch64/vec_uaddo.ll (+1-1) 
- (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-4) 
- (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+15-15) 
- (modified) llvm/test/CodeGen/AArch64/wide-scalar-shift-legalization.ll (+28-31) 
- (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+55-54) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.vni16.ll (+27-27) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ashr.ll (+11-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (+30-30) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+321-314) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+292-289) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (+12-11) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (+10-9) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (+4-6) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/lshr.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+124-125) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+40-39) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+19-19) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+49-49) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+24-24) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+69-69) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+24-24) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18539-18522) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+14-12) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.320bit.ll (+134-134) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+3747-3714) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.576bit.ll (+107-127) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.704bit.ll (+173-183) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.960bit.ll (+423-414) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-cc.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-preserve-cc.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1672-1693) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+9-12) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+6-7) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+6-7) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/call-argument-types.ll (+15-15) 
- (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+1-2) 
- (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+42-44) 
- (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+43-45) 
- (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+21-21) 
- (modified) llvm/test/CodeGen/AMDGPU/ds-alignment.ll (+42-42) 
- (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+69-64) 
- (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+13-11) 
- (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+3-4) 
- (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+121-112) 
- (modified) llvm/test/CodeGen/AMDGPU/function-args-inreg.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+286-212) 
- (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-argument-types.ll (+23-22) 
- (modified) llvm/test/CodeGen/AMDGPU/gfx-callable-return-types.ll (+29-32) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/half.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/i1-to-bf16.ll (+16-16) 
- (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+3-4) 
- (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (+9-9) 
- (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+12-12) 
- (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+32-32) 
- (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.ll (+18-18) 
- (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+9-9) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.intersect_ray.ll (+10-10) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (+9-12) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.readfirstlane.ll (+8-8) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+4-3) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+4-3) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f16.ll (+106-106) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f32.ll (+115-115) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.maximum.f64.ll (+290-290) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f16.ll (+11-11) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f32.ll (+115-115) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.minimum.f64.ll (+290-290) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+20-21) 
- (modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (+85-85) 
- (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+18-18) 
- (modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/load-global-i32.ll (+170-171) 
- (modified) llvm/test/CodeGen/AMDGPU/load-local-redundant-copies.ll (+21-21) 
- (modified) llvm/test/CodeGen/AMDGPU/load-local.128.ll (+34-34) 
- (modified) llvm/test/CodeGen/AMDGPU/load-local.96.ll (+25-25) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-lastuse-metadata.ll (+8-8) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-nontemporal-metadata.ll (+16-16) 
- (modified) llvm/test/CodeGen/AMDGPU/max.i16.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+58-60) 
- (modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+38-40) 
- (modified) llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll (+474-480) 
- (modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+96-109) 
- (modified) llvm/test/CodeGen/AMDGPU/min.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+18-18) 
- (modified) llvm/test/CodeGen/AMDGPU/narrow_math_for_and.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+57-57) 
- (modified) llvm/test/CodeGen/AMDGPU/pr51516.mir (+5-1) 
- (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+73-73) 
- (modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+96-96) 
- (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (+168-173) 
- (modified) llvm/test/CodeGen/AMDGPU/shl.ll (+5-5) 
- (modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+7-7) 
- (modified) llvm/test/CodeGen/AMDGPU/sra.ll (+10-10) 
- (modified) llvm/test/CodeGen/AMDGPU/srem.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/srl.ll (+5-5) 
- (modified) llvm/test/CodeGen/AMDGPU/store-local.128.ll (+29-28) 
- (modified) llvm/test/CodeGen/AMDGPU/store-local.96.ll (+15-14) 
- (modified) llvm/test/CodeGen/AMDGPU/sub.ll (+15-15) 
- (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+4-4) 
- (modified) llvm/test/CodeGen/PowerPC/p10-fi-elim.ll (+2-2) 
- (modified) llvm/test/CodeGen/RISCV/GlobalISel/wide-scalar-shift-by-byte-multiple-legalization.ll (+34-34) 
- (modified) llvm/test/CodeGen/RISCV/abds-neg.ll (+30-30) 
- (modified) llvm/test/CodeGen/RISCV/abds.ll (+400-400) 
- (modified) llvm/test/CodeGen/RISCV/abdu-neg.ll (+26-26) 
- (modified) llvm/test/CodeGen/RISCV/add-before-shl.ll (+10-10) 
- (modified) llvm/test/CodeGen/RISCV/fold-mem-offset.ll (+8-8) 
- (modified) llvm/test/CodeGen/RISCV/legalize-fneg.ll (+5-5) 
- (modified) llvm/test/CodeGen/RISCV/memcmp-optsize.ll (+42-42) 
- (modified) llvm/test/CodeGen/RISCV/memcmp.ll (+42-42) 
- (modified) llvm/test/CodeGen/RISCV/rv32zbb.ll (+1-1) 
- (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-elen.ll (+17-17) 
- (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-buildvec.ll (+148-148) 
- (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-gather.ll (+5-5) 
- (modified) llvm/test/CodeGen/RISCV/rvv/pr125306.ll (+8-8) 
- (modified) llvm/test/CodeGen/RISCV/scmp.ll (+1-1) 
- (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+24-24) 
- (modified) llvm/test/CodeGen/RISCV/ucmp.ll (+1-1) 
- (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+16-16) 
- (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+18-18) 
- (modified) llvm/test/CodeGen/RISCV/vararg.ll (+9-9) 
- (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+359-359) 
- (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+201-201) 
- (modified) llvm/test/CodeGen/RISCV/xtheadmempair.ll (+7-7) 


``````````diff

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..14f3fda90ef6d 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -303,10 +303,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// The bottom of the unscheduled zone.
   MachineBasicBlock::iterator CurrentBottom;
 
-  /// Record the next node in a scheduled cluster.
-  const SUnit *NextClusterPred = nullptr;
-  const SUnit *NextClusterSucc = nullptr;
-
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
   /// The number of instructions scheduled so far. Used to cut off the
   /// scheduler at the point determined by misched-cutoff.
@@ -367,10 +363,6 @@ class ScheduleDAGMI : public ScheduleDAGInstrs {
   /// live ranges and region boundary iterators.
   void moveInstruction(MachineInstr *MI, MachineBasicBlock::iterator InsertPos);
 
-  const SUnit *getNextClusterPred() const { return NextClusterPred; }
-
-  const SUnit *getNextClusterSucc() const { return NextClusterSucc; }
-
   void viewGraph(const Twine &Name, const Twine &Title) override;
   void viewGraph() override;
 
@@ -1292,6 +1284,9 @@ class GenericScheduler : public GenericSchedulerBase {
   SchedBoundary Top;
   SchedBoundary Bot;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
   /// Candidate last picked from Top boundary.
   SchedCandidate TopCand;
   /// Candidate last picked from Bot boundary.
@@ -1332,6 +1327,9 @@ class PostGenericScheduler : public GenericSchedulerBase {
   /// Candidate last picked from Bot boundary.
   SchedCandidate BotCand;
 
+  ClusterInfo *TopCluster;
+  ClusterInfo *BotCluster;
+
 public:
   PostGenericScheduler(const MachineSchedContext *C)
       : GenericSchedulerBase(C), Top(SchedBoundary::TopQID, "TopQ"),
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..a4301d11a4454 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -17,6 +17,7 @@
 
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/PointerIntPair.h"
+#include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/CodeGen/MachineInstr.h"
@@ -234,6 +235,10 @@ class TargetRegisterInfo;
     void dump(const TargetRegisterInfo *TRI = nullptr) const;
   };
 
+  /// Keep record of which SUnit are in the same cluster group.
+  typedef SmallSet<SUnit *, 8> ClusterInfo;
+  constexpr unsigned InvalidClusterId = ~0u;
+
   /// Scheduling unit. This is a node in the scheduling DAG.
   class SUnit {
   private:
@@ -274,6 +279,8 @@ class TargetRegisterInfo;
     unsigned TopReadyCycle = 0; ///< Cycle relative to start when node is ready.
     unsigned BotReadyCycle = 0; ///< Cycle relative to end when node is ready.
 
+    unsigned ParentClusterIdx = InvalidClusterId; ///< The parent cluster id.
+
   private:
     unsigned Depth = 0;  ///< Node depth.
     unsigned Height = 0; ///< Node height.
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..6c6bd8015ee69 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -180,6 +180,8 @@ namespace llvm {
     /// case of a huge region that gets reduced).
     SUnit *BarrierChain = nullptr;
 
+    SmallVector<ClusterInfo> Clusters;
+
   public:
     /// A list of SUnits, used in Value2SUsMap, during DAG construction.
     /// Note: to gain speed it might be worth investigating an optimized
@@ -383,6 +385,14 @@ namespace llvm {
     /// equivalent edge already existed (false indicates failure).
     bool addEdge(SUnit *SuccSU, const SDep &PredDep);
 
+    /// Returns the array of the clusters.
+    SmallVector<ClusterInfo> &getClusters() { return Clusters; }
+
+    /// Get the specific cluster, return nullptr for InvalidClusterId.
+    ClusterInfo *getCluster(unsigned Idx) {
+      return Idx != InvalidClusterId ? &Clusters[Idx] : nullptr;
+    }
+
   protected:
     void initSUnits();
     void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0c3ffb1bbaa6f..91da22612eac6 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -15,6 +15,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
 #include "llvm/ADT/PriorityQueue.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
@@ -844,8 +845,6 @@ void ScheduleDAGMI::releaseSucc(SUnit *SU, SDep *SuccEdge) {
 
   if (SuccEdge->isWeak()) {
     --SuccSU->WeakPredsLeft;
-    if (SuccEdge->isCluster())
-      NextClusterSucc = SuccSU;
     return;
   }
 #ifndef NDEBUG
@@ -881,8 +880,6 @@ void ScheduleDAGMI::releasePred(SUnit *SU, SDep *PredEdge) {
 
   if (PredEdge->isWeak()) {
     --PredSU->WeakSuccsLeft;
-    if (PredEdge->isCluster())
-      NextClusterPred = PredSU;
     return;
   }
 #ifndef NDEBUG
@@ -1077,11 +1074,8 @@ findRootsAndBiasEdges(SmallVectorImpl<SUnit*> &TopRoots,
 }
 
 /// Identify DAG roots and setup scheduler queues.
-void ScheduleDAGMI::initQueues(ArrayRef<SUnit*> TopRoots,
-                               ArrayRef<SUnit*> BotRoots) {
-  NextClusterSucc = nullptr;
-  NextClusterPred = nullptr;
-
+void ScheduleDAGMI::initQueues(ArrayRef<SUnit *> TopRoots,
+                               ArrayRef<SUnit *> BotRoots) {
   // Release all DAG roots for scheduling, not including EntrySU/ExitSU.
   //
   // Nodes with unreleased weak edges can still be roots.
@@ -2008,6 +2002,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     ScheduleDAGInstrs *DAG) {
   // Keep track of the current cluster length and bytes for each SUnit.
   DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
+  EquivalenceClasses<SUnit *> Clusters;
 
   // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
   // cluster mem ops collected within `MemOpRecords` array.
@@ -2047,6 +2042,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
 
     SUnit *SUa = MemOpa.SU;
     SUnit *SUb = MemOpb.SU;
+
     if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);
 
@@ -2054,6 +2050,7 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
     if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
       continue;
 
+    Clusters.unionSets(SUa, SUb);
     LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
                       << SUb->NodeNum << ")\n");
     ++NumClustered;
@@ -2093,6 +2090,21 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
                       << ", Curr cluster bytes: " << CurrentClusterBytes
                       << "\n");
   }
+
+  // Add cluster group information.
+  // Iterate over all of the equivalence sets.
+  auto &AllClusters = DAG->getClusters();
+  for (auto &I : Clusters) {
+    if (!I->isLeader())
+      continue;
+    ClusterInfo Group;
+    unsigned ClusterIdx = AllClusters.size();
+    for (auto *MemberI : Clusters.members(*I)) {
+      MemberI->ParentClusterIdx = ClusterIdx;
+      Group.insert(MemberI);
+    }
+    AllClusters.push_back(Group);
+  }
 }
 
 void BaseMemOpClusterMutation::collectMemOpRecords(
@@ -3456,6 +3468,9 @@ void GenericScheduler::initialize(ScheduleDAGMI *dag) {
   }
   TopCand.SU = nullptr;
   BotCand.SU = nullptr;
+
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 /// Initialize the per-region scheduling policy.
@@ -3762,13 +3777,11 @@ bool GenericScheduler::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-    Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-    TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU,
-                 TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -4015,11 +4028,25 @@ void GenericScheduler::reschedulePhysReg(SUnit *SU, bool isTop) {
 void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (TopCluster) {
+      dbgs() << "  Top Cluster: ";
+      for (auto *N : *TopCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Top.bumpNode(SU);
     if (SU->hasPhysRegUses)
       reschedulePhysReg(SU, true);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
+    LLVM_DEBUG(if (BotCluster) {
+      dbgs() << "  Bot Cluster: ";
+      for (auto *N : *BotCluster)
+        dbgs() << N->NodeNum << '\t';
+      dbgs() << "\n";
+    });
     Bot.bumpNode(SU);
     if (SU->hasPhysRegDefs)
       reschedulePhysReg(SU, false);
@@ -4076,6 +4103,8 @@ void PostGenericScheduler::initialize(ScheduleDAGMI *Dag) {
   if (!Bot.HazardRec) {
     Bot.HazardRec = DAG->TII->CreateTargetMIHazardRecognizer(Itin, DAG);
   }
+  TopCluster = nullptr;
+  BotCluster = nullptr;
 }
 
 void PostGenericScheduler::initPolicy(MachineBasicBlock::iterator Begin,
@@ -4137,14 +4166,12 @@ bool PostGenericScheduler::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
-
   // Avoid critical resource consumption and balance the schedule.
   if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
               TryCand, Cand, ResourceReduce))
@@ -4329,9 +4356,11 @@ SUnit *PostGenericScheduler::pickNode(bool &IsTopNode) {
 void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
   if (IsTopNode) {
     SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
+    TopCluster = DAG->getCluster(SU->ParentClusterIdx);
     Top.bumpNode(SU);
   } else {
     SU->BotReadyCycle = std::max(SU->BotReadyCycle, Bot.getCurrCycle());
+    BotCluster = DAG->getCluster(SU->ParentClusterIdx);
     Bot.bumpNode(SU);
   }
 }
diff --git a/llvm/lib/CodeGen/MacroFusion.cpp b/llvm/lib/CodeGen/MacroFusion.cpp
index 5bd6ca0978a4b..c614e477a9d8f 100644
--- a/llvm/lib/CodeGen/MacroFusion.cpp
+++ b/llvm/lib/CodeGen/MacroFusion.cpp
@@ -61,6 +61,11 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   for (SDep &SI : SecondSU.Preds)
     if (SI.isCluster())
       return false;
+
+  unsigned FirstCluster = FirstSU.ParentClusterIdx;
+  unsigned SecondCluster = SecondSU.ParentClusterIdx;
+  assert(FirstCluster == InvalidClusterId && SecondCluster == InvalidClusterId);
+
   // Though the reachability checks above could be made more generic,
   // perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
   // the extra computation cost makes it less interesting in general cases.
@@ -70,6 +75,14 @@ bool llvm::fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
   if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
     return false;
 
+  auto &Clusters = DAG.getClusters();
+
+  FirstSU.ParentClusterIdx = Clusters.size();
+  SecondSU.ParentClusterIdx = Clusters.size();
+
+  SmallSet<SUnit *, 8> Cluster{{&FirstSU, &SecondSU}};
+  Clusters.emplace_back(Cluster);
+
   // TODO - If we want to chain more than two instructions, we need to create
   // artifical edges to make dependencies from the FirstSU also dependent
   // on other chained instructions, and other chained instructions also
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..e630b80e33ab4 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -365,6 +365,9 @@ LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeName(const SUnit &SU) const {
 LLVM_DUMP_METHOD void ScheduleDAG::dumpNodeAll(const SUnit &SU) const {
   dumpNode(SU);
   SU.dumpAttributes();
+  if (SU.ParentClusterIdx != InvalidClusterId)
+    dbgs() << "  Parent Cluster Index: " << SU.ParentClusterIdx << '\n';
+
   if (SU.Preds.size() > 0) {
     dbgs() << "  Predecessors:\n";
     for (const SDep &Dep : SU.Preds) {
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 5678512748569..6c6c81ab2b4cc 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -584,12 +584,11 @@ bool GCNMaxILPSchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid increasing the max critical pressure in the scheduled region.
@@ -659,12 +658,11 @@ bool GCNMaxMemoryClauseSchedStrategy::tryCandidate(SchedCandidate &Cand,
 
   // MaxMemoryClause-specific: We prioritize clustered instructions as we would
   // get more benefit from clausing these memory instructions.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // We only compare a subset of features when comparing nodes between
diff --git a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
index 03712879f7c49..5eb1f0128643d 100644
--- a/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
+++ b/llvm/lib/Target/PowerPC/PPCMachineScheduler.cpp
@@ -100,12 +100,11 @@ bool PPCPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
   // This is a best effort to set things up for a post-RA pass. Optimizations
   // like generating loads of multiple registers should ideally be done within
   // the scheduler pass by combining the loads during DAG postprocessing.
-  const SUnit *CandNextClusterSU =
-      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  const SUnit *TryCandNextClusterSU =
-      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
-  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
-                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   if (SameBoundary) {
@@ -190,8 +189,11 @@ bool PPCPostRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     return TryCand.Reason != NoCand;
 
   // Keep clustered nodes together.
-  if (tryGreater(TryCand.SU == DAG->getNextClusterSucc(),
-                 Cand.SU == DAG->getNextClusterSucc(), TryCand, Cand, Cluster))
+  const ClusterInfo *CandCluster = Cand.AtTop ? TopCluster : BotCluster;
+  const ClusterInfo *TryCandCluster = TryCand.AtTop ? TopCluster : BotCluster;
+  if (tryGreater(TryCandCluster && TryCandCluster->contains(TryCand.SU),
+                 CandCluster && CandCluster->contains(Cand.SU), TryCand, Cand,
+                 Cluster))
     return TryCand.Reason != NoCand;
 
   // Avoid critical resource consumption and balance the schedule.
diff --git a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
index b944194dae8fc..f9176bc9d3fa5 100644
--- a/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
+++ b/llvm/test/CodeGen/AArch64/argument-blocks-array-of-struct.ll
@@ -477,9 +477,8 @@ define void @callee_in_memory(%T_IN_MEMORY %a) {
 ; CHECK-NEXT:    add x8, x8, :lo12:in_memory_store
 ; CHECK-NEXT:    ldr d0, [sp, #64]
 ; CHECK-NEXT:    str d0, [x8, #64]
-; CHECK-NEXT:    ldr q0, [sp, #16]
 ; CHECK-NEXT:    str q2, [x8, #48]
-; CHECK-NEXT:    ldr q2, [sp]
+; CHECK-NEXT:    ldp q2, q0, [sp]
 ; CHECK-NEXT:    stp q0, q1, [x8, #16]
 ; CHECK-NEXT:    str q2, [x8]
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
index 7e72e8de01f4f..3bada9d5b3bb4 100644
--- a/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-dagcombiner-load-slicing.ll
@@ -7,8 +7,8 @@
 
 ; CHECK-LABEL: @test
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:s[0-9]+]], [[CPLX1_R:s[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:s[0-9]+]], [[CPLX2_R:s[0-9]+]], [[[BASE]], #64]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: fadd {{s[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -36,8 +36,8 @@ entry:
 
 ; CHECK-LABEL: @test_int
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #3
-; CHECK: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
+; CHECK-DAG: ldp [[CPLX1_I:w[0-9]+]], [[CPLX1_R:w[0-9]+]], [[[BASE]]]
+; CHECK-DAG: ldp [[CPLX2_I:w[0-9]+]], [[CPLX2_R:w[0-9]+]], [[[BASE]], #64]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_I]], [[CPLX1_I]]
 ; CHECK: add {{w[0-9]+}}, [[CPLX2_R]], [[CPLX1_R]]
 ; CHECK: ret
@@ -65,8 +65,8 @@ entry:
 
 ; CHECK-LABEL: @test_long
 ; CHECK: add [[BASE:x[0-9]+]], x0, x1, lsl #4
-; CHECK: ldp [[CPLX1_I:x[0-9]+]], [[CPLX1_R:x[0-9]+]], [[[BASE]]]
-; CHECK: ldp [[CPLX2_I:x[0-9]+]], [[CPLX2_R:x[0-9]+]], [[[BASE]], #128]
+; CHEC...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/137784