[llvm] [MachineCopyPropagation] Detect and fix suboptimal instruction order to enable optimizations (PR #98087)

Mon Jul 8 15:21:11 PDT 2024

llvmbot wrote:



@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-regalloc

Author: Gábor Spaits (spaits)

<details>
<summary>Changes</summary>


## The issue
In the Coremark becnhmark the following code can be found in core_state.c's core_bench_state function (https://github.com/eembc/coremark/blob/d5fad6bd094899101a4e5fd53af7298160ced6ab/core_state.c#L61):
```c
for (i = 0; i < NUM_CORE_STATES; i++)
{
    crc = crcu32(final_counts[i], crc);
    crc = crcu32(track_counts[i], crc);
}
```
Let's have some code that can be compiled by itself that also reproduces the issue:
```c
void init_var(int *v);
int chain(int c, int n);

void start() {
    int a, b, c;

    init_var(&a);
    init_var(&b);
    init_var(&c);

    int r = chain(b, a);
    r = chain(c, r);
}
```
https://godbolt.org/z/7h8E9nG5q

Clang produces the following assembly for the section between the two function call (on aarch64/arm64):
```asm
bl      _Z5chainii
ldr     w8, [sp, #4]
mov     w1, w0
mov     w0, w8
bl      _Z5chainii
```
While GCC produces this assembly:
```asm
bl      _Z5chainii
mov     w1, w0
ldr     w0, [sp, 28]
bl      _Z5chainii
```
I think we shouldn't move these values around so much.

As you can see gcc does not "shuffle" the values from register to regsiter, but solves the "switching" with only two instruction.


This problem is also present for riscv and it is even worse.
Clang generates:
```asm
call    _Z5chainii
lw      a1, 0(sp)
mv      a2, a0
mv      a0, a1
mv      a1, a2
call    _Z5chainii
```
GCC generates:
```asm
call    _Z5chainii
mv      a1,a0
lw      a0,12(sp)
call    _Z5chainii
```
Also see on godbolt: https://godbolt.org/z/77rncrb3b

The dissasembly of coremark will also have this suboptimal pattern:
```asm
jal	0x8000f596 <crcu32>
lw	a1, 0x24(sp)
mv	a2, a0
mv	a0, a1
mv	a1, a2
jal	0x8000f596 <crcu32>
```

## The cause
The reason for the suboptimal code is the inabilty of copy propagation to find optimization due to suboptimal order of instructions. The suboptimal order of instructions introduces unnecesary data dependencies between instructions. (Let's say there is data dependency between MI A and MI B if there exists a register unit that is used or defined by both A and B.)

The reason for this data dependency in the examples above is that, the scheduler places loads as early as possible.
This is usually a good thing see https://discourse.llvm.org/t/how-to-copy-propagate-physical-register-introduced-before-ra/74828/4.

After creating the current patch and checking the regression test I found out, that the issue is more general, than the scheduker prioritizing loads. If you look at the test cases you can see that the current state of the patch also enables optimization that have nothing to do with loads, so there are other cases when unnecesary data dependencies block optimizations. (See the changes in llvm/test/CodeGen/RISCV/double-fcmp-strict.ll llvm/test/CodeGen/AArch64/sve-vector-interleave.ll and many more not load reletad improvements.)

## The current solution
The issue is more generic that subotimal data dependecies occuring when prioritizing loads, so I think adjusting the scheduler is not enough. Also this affects almost all the targets (see the tests).

I think the proper way to solve this issue is to make the machine copy propagation
"smarter", so in the cases where an unnecesary data dependency(ies) is(are) blocking an optimization
it is recignized and resolved by moving the instruction(s) without changing the semantics.

I have created a new struct that can administer the data dependencies of instructions in a tree.
Also added logic that uses this tree to find unnecesary data dependencies.

For example let's have the following MIR:
```llvm
0. r1 = LOAD sp
1. r2 = COPY r5
2. r3 = ADD r1 r4
3. r1 = COPY r2
```

Let's see what are the dependncies:
Instruction 3 has a dependency on instruction 2 since instruction two uses a value that is in r1 and r1 is re-defined by 3. This means that instruction 2 must preceed instruction 3.

Instruction 3 has a dependency on instruction 0. Both of these instructions define r1. instruction 0 must preceed instruction 3.

And finnaly instruction 2 has also a dependency on isntruction 0, since the value in r1 in the time of executiong instruction 2 is put into r1 in instruction 0.

>From the above, we can deduce, that instruction zero must preceed instruction 2 and instruction 2 must preceed isntruction 4.

We can also observe that instruction 3 has a dapendency on instruction 1. Basically in 1 and 3 we are just moving these values around. Could we erase this moving around and instantly do r1 = COPY r5? We cant just do that, since as seen before instruction 2 must preceed any redefinition of r1. So in the current case this optimization is blocked by instruction 2.

What if we have switched instruction 2 ad instruction 1 they dont have dependency on eachother. Then we will get this:
```llvm
0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r2 = COPY r5 ; instruction 1 and 2 switched places
3. r1 = COPY r2
```

As we can see, the required order of instruction 0 2 and 3 are still maintained. The semantics of the program has stayed the same.
We can also see that the previously disabled optimization became vailble, so the machine copy propagation pass can do the following modfication legally:
```llvm
0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r1 = COPY r5
3. r1 = COPY r2
```

So later on it can be recognised that one instruction is unnecesary (instruction 3) and erase it, so we can end up with more efficent code like this:
```llvm
0. r1 = LOAD sp
2. r3 = ADD r1 r4
1. r1 = COPY r5
```

## Why the current soulution is the best
Suboptimal order of instructions that cause copy propagation optimizations to be missed can appear for many different reasons on almost all the targets.
For this reason, I think the best way to deal with them is in a general target independent analysis of the data dependencies between instructions that are relevant to copy propagation.  

## Limitations
Dependency trees can be complicated. An instruction can have multiple dependencies that can also have multiple dependecies and so on.
Also these dependency trees can have intersections at different points. This extension might be unnecesary since I don't think that there are many cases where we have this complex data dependencies in MIR. (I have only found one in llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access-zve32x.ll when checking the MIR during debugging.)

If we want to extend the PR to handle any tree, then the correct order of the instructions in the dependency tree can be calculated with
in order traversal of the dependncy graph (and also some merging based on the MI positions in the basic block).

## The current state of the PR
I have tried to check all the tests, but I could not decide if the code is actually correct everywhere. I think in the most places it is correct. I would be really glad if you could also check the tests.

There are some regression tests that are failing. These are hand written tests, whose resulst can not be generated by update llc or mir scripts. If ypu see a chance that this PR may get approved and merged I will fix those tests.

## Conclusion
I would like to ask your opinion on this topic. Is it a good direction?
I would be really happy to finish this work, since I enjoy creating stuff like this, if you think it is good direction and could be merged when it is in consistent state.

Do you have another solution in mind to optimize enable optimization in those cases, where unnecesary data dependencies are blocking it?

If you think that this approach is good and I shall continue I will write tests for it and also fix the hand written tests.

Thank you for checking the PR and giving me feedback.

---

Patch is 250.89 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98087.diff


69 Files Affected:

- (modified) llvm/lib/CodeGen/MachineCopyPropagation.cpp (+224-14) 
- (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll (+1-2) 
- (modified) llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll (+1-2) 
- (modified) llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll (+4-6) 
- (modified) llvm/test/CodeGen/AArch64/addp-shuffle.ll (+1-2) 
- (modified) llvm/test/CodeGen/AArch64/cgp-usubo.ll (+5-10) 
- (added) llvm/test/CodeGen/AArch64/machine-cp.mir (+215) 
- (modified) llvm/test/CodeGen/AArch64/neon-extmul.ll (+4-6) 
- (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-12) 
- (modified) llvm/test/CodeGen/AArch64/streaming-compatible-memory-ops.ll (+2-3) 
- (modified) llvm/test/CodeGen/AArch64/sve-vector-deinterleave.ll (+6-10) 
- (modified) llvm/test/CodeGen/AArch64/sve-vector-interleave.ll (+2-4) 
- (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+4-7) 
- (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+5-7) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+8-8) 
- (modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (+85-70) 
- (modified) llvm/test/CodeGen/LoongArch/tail-calls.ll (+2-3) 
- (modified) llvm/test/CodeGen/LoongArch/vector-fp-imm.ll (+3-6) 
- (modified) llvm/test/CodeGen/Mips/llvm-ir/or.ll (+42-60) 
- (modified) llvm/test/CodeGen/Mips/llvm-ir/xor.ll (+13-18) 
- (modified) llvm/test/CodeGen/PowerPC/BreakableToken-reduced.ll (+2-3) 
- (modified) llvm/test/CodeGen/PowerPC/atomics-i128-ldst.ll (+3-5) 
- (modified) llvm/test/CodeGen/PowerPC/legalize-invert-br_cc.ll (+1-2) 
- (modified) llvm/test/CodeGen/PowerPC/lower-intrinsics-afn-mass_notail.ll (+2-4) 
- (modified) llvm/test/CodeGen/PowerPC/lower-intrinsics-fast-mass_notail.ll (+2-4) 
- (modified) llvm/test/CodeGen/PowerPC/machine-backward-cp.mir (+65-56) 
- (modified) llvm/test/CodeGen/PowerPC/stack-restore-with-setjmp.ll (+1-2) 
- (modified) llvm/test/CodeGen/RISCV/alu64.ll (+1-2) 
- (modified) llvm/test/CodeGen/RISCV/condops.ll (+4-8) 
- (modified) llvm/test/CodeGen/RISCV/double-fcmp-strict.ll (+12-24) 
- (modified) llvm/test/CodeGen/RISCV/float-fcmp-strict.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/half-fcmp-strict.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/llvm.frexp.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/neg-abs.ll (+4-6) 
- (modified) llvm/test/CodeGen/RISCV/rv64-statepoint-call-lowering.ll (+1-2) 
- (modified) llvm/test/CodeGen/RISCV/rvv/constant-folding-crash.ll (+2-4) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmfeq.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmfge.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmfgt.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmfle.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmflt.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmfne.ll (+6-12) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmseq.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsge.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsgeu.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsgt.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsgtu.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsle.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsleu.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmslt.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsltu.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vmsne.ll (+10-20) 
- (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-regression.ll (+1-2) 
- (modified) llvm/test/CodeGen/RISCV/shifts.ll (+2-4) 
- (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+11-20) 
- (modified) llvm/test/CodeGen/RISCV/tail-calls.ll (+6-8) 
- (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+13-24) 
- (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+2-4) 
- (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+2-4) 
- (modified) llvm/test/CodeGen/SPARC/tailcall.ll (+4-6) 
- (modified) llvm/test/CodeGen/SystemZ/vector-constrained-fp-intrinsics.ll (+116-214) 
- (modified) llvm/test/CodeGen/Thumb2/mve-pred-ext.ll (+2-4) 
- (modified) llvm/test/CodeGen/Thumb2/mve-vmull-splat.ll (-4) 
- (modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+2-4) 
- (modified) llvm/test/CodeGen/X86/matrix-multiply.ll (+2-3) 
- (modified) llvm/test/CodeGen/X86/vec_saddo.ll (+5-9) 
- (modified) llvm/test/CodeGen/X86/vec_ssubo.ll (+2-3) 
- (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll (+1-2) 
- (modified) llvm/test/CodeGen/X86/xmulo.ll (+42-84) 


``````````diff

diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index bdc17e99d1fb07..de62f33c67cc4d 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -49,29 +49,36 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
+#include "llvm/Analysis/BlockFrequencyInfo.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/Register.h"
 #include "llvm/CodeGen/TargetInstrInfo.h"
 #include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
 #include "llvm/InitializePasses.h"
+#include "llvm/MC/MCRegister.h"
 #include "llvm/MC/MCRegisterInfo.h"
 #include "llvm/Pass.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/DebugCounter.h"
+#include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/raw_ostream.h"
+#include <algorithm>
 #include <cassert>
 #include <iterator>
+#include <utility>
 
 using namespace llvm;
 
@@ -105,7 +112,51 @@ static std::optional<DestSourcePair> isCopyInstr(const MachineInstr &MI,
   return std::nullopt;
 }
 
+static bool hasOverlaps(const MachineInstr &MI, Register &Reg,
+                        const TargetRegisterInfo *TRI) {
+  for (const MachineOperand &MI : MI.operands())
+    if (MI.isReg() && TRI->regsOverlap(Reg, MI.getReg()))
+      return true;
+  return false;
+}
+
+static bool twoMIsHaveMutualOperandRegisters(const MachineInstr &MI1,
+                                             const MachineInstr &MI2,
+                                             const TargetRegisterInfo *TRI) {
+  for (auto MI1OP : MI1.operands()) {
+    for (auto MI2OP : MI2.operands()) {
+      if (MI1OP.isReg() && MI2OP.isReg() &&
+          TRI->regsOverlap(MI1OP.getReg(), MI2OP.getReg())) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
 class CopyTracker {
+  // When conducting backward copy propagation we may need to move instructions
+  // that are in the dependency tree in this case the relative order of the
+  // moved instructions to each other must not change. This variable is used to
+  // see where we are in the basic block and based.
+  int Time = 0;
+
+  // A tree representing dependencies between instructions.
+  struct Dependency {
+    Dependency() = default;
+    Dependency(MachineInstr *MI) : MI(MI) {}
+    MachineInstr *MI;
+
+    // When the MI appears. It is used to keep the MIs relative order to
+    // each other.
+    int MIPosition = -1;
+
+    // The children in the dependency tree. These are the instructions
+    // that work with at least one common register with the current instruction.
+    llvm::SmallVector<MachineInstr *> MustPrecede;
+    bool AnythingMustPrecede = false;
+  };
+
   struct CopyInfo {
     MachineInstr *MI, *LastSeenUseInCopy;
     SmallVector<MCRegister, 4> DefRegs;
@@ -113,9 +164,10 @@ class CopyTracker {
   };
 
   DenseMap<MCRegister, CopyInfo> Copies;
+  DenseMap<MachineInstr *, Dependency> Dependencies;
 
 public:
-  /// Mark all of the given registers and their subregisters as unavailable for
+  /// Mark all of the given registers and their sub registers as unavailable for
   /// copying.
   void markRegsUnavailable(ArrayRef<MCRegister> Regs,
                            const TargetRegisterInfo &TRI) {
@@ -129,6 +181,27 @@ class CopyTracker {
     }
   }
 
+  void moveBAfterA(MachineInstr *A, MachineInstr *B) {
+    llvm::MachineBasicBlock *MBB = A->getParent();
+    assert(MBB == B->getParent() &&
+           "Both instructions must be in the same MachineBasicBlock");
+    assert(Dependencies.contains(B) &&
+           "Shall contain the instruction that blocks the optimization.");
+    Dependencies[B].MIPosition = Time++;
+    MBB->remove(B);
+    MBB->insertAfter(--A->getIterator(), B);
+  }
+
+  // Add a new Dependency to the already existing ones.
+  void addDependency(Dependency B) {
+    if (Dependencies.contains(B.MI))
+      return;
+
+    B.MIPosition = Time++;
+    Dependencies.insert({B.MI, B});
+  }
+
+  /// Only called for backward propagation
   /// Remove register from copy maps.
   void invalidateRegister(MCRegister Reg, const TargetRegisterInfo &TRI,
                           const TargetInstrInfo &TII, bool UseCopyInstr) {
@@ -222,6 +295,95 @@ class CopyTracker {
     }
   }
 
+  void setDependenciesForMI(MachineInstr *MI, const TargetRegisterInfo &TRI,
+                            const TargetInstrInfo &TII, bool UseCopyInstr) {
+    bool Blocks = !Dependencies.contains(MI);
+    Dependency b{MI};
+    Dependency *CurrentBlocker = nullptr;
+    if (!Blocks) {
+      CurrentBlocker = &Dependencies[MI];
+    } else {
+      CurrentBlocker = &b;
+    }
+
+    for (const MachineOperand &Operand : MI->operands()) {
+      if (!Operand.isReg())
+        continue;
+
+      Register OpReg = Operand.getReg();
+      MCRegister OpMCReg = OpReg.asMCReg();
+      if (!OpMCReg)
+        continue;
+
+      // Invalidate those copies that are affected by the definition or usage of
+      // the current MI.
+      for (MCRegUnit UsedOPMcRegUnit : TRI.regunits(OpMCReg)) {
+        auto CopyThatDependsOnIt = Copies.find(UsedOPMcRegUnit);
+        // Do not take debug usages into account.
+        if (CopyThatDependsOnIt != Copies.end() &&
+            !(Operand.isUse() && Operand.isDebug())) {
+          Copies.erase(CopyThatDependsOnIt);
+        }
+      }
+
+      for (std::pair<MachineInstr *, Dependency> &Dep : Dependencies) {
+        assert(
+            Dep.first == Dep.second.MI &&
+            "Inconsistent state: The key and MI of a blocker do not match\n");
+        MachineInstr *DepMI = Dep.first;
+        if (DepMI == MI)
+          continue;
+        if (Operand.isUse() && Operand.isDebug())
+          continue;
+        if (!hasOverlaps(*DepMI, OpReg, &TRI))
+          continue;
+
+        // The current instruction precedes the instruction in the dependency
+        // tree.
+        if (CurrentBlocker->MIPosition == -1 ||
+            Dep.second.MIPosition < CurrentBlocker->MIPosition) {
+          Dep.second.MustPrecede.push_back(MI);
+          Dep.second.AnythingMustPrecede = true;
+          continue;
+        }
+
+        // The current instruction is preceeded the instruction in the
+        // dependency tree. This can happen when other instruction are moved
+        // before the currently analyzed one to optimize data dependencies.
+        if (CurrentBlocker->MIPosition != -1 &&
+            Dep.second.MIPosition > CurrentBlocker->MIPosition) {
+          CurrentBlocker->MustPrecede.push_back(DepMI);
+          CurrentBlocker->AnythingMustPrecede = true;
+          continue;
+        }
+      }
+    }
+
+    if (Blocks) {
+      addDependency(*CurrentBlocker);
+    }
+  }
+
+  std::optional<llvm::SmallVector<MachineInstr *>>
+  getFirstPreviousDependencies(MachineInstr *MI,
+                               const TargetRegisterInfo &TRI) {
+    if (Dependencies.contains(MI)) {
+      auto PrevDefs = Dependencies[MI].MustPrecede;
+      if (std::all_of(PrevDefs.begin(), PrevDefs.end(), [&](auto OneUse) {
+            if (!OneUse) {
+              return false;
+            }
+            return !(Dependencies[OneUse].AnythingMustPrecede) &&
+                   (Dependencies[OneUse].MustPrecede.size() == 0);
+          })) {
+
+        return Dependencies[MI].MustPrecede;
+      }
+      return {};
+    }
+    return {{}};
+  }
+
   /// Add this copy's registers into the tracker's copy maps.
   void trackCopy(MachineInstr *MI, const TargetRegisterInfo &TRI,
                  const TargetInstrInfo &TII, bool UseCopyInstr) {
@@ -245,6 +407,9 @@ class CopyTracker {
         Copy.DefRegs.push_back(Def);
       Copy.LastSeenUseInCopy = MI;
     }
+    
+    Dependency b{MI};
+    addDependency(b);
   }
 
   bool hasAnyCopies() {
@@ -263,12 +428,16 @@ class CopyTracker {
   }
 
   MachineInstr *findCopyDefViaUnit(MCRegister RegUnit,
-                                   const TargetRegisterInfo &TRI) {
+                                   const TargetRegisterInfo &TRI,
+                                   bool CanUseLastSeenInCopy = false) {
     auto CI = Copies.find(RegUnit);
     if (CI == Copies.end())
       return nullptr;
     if (CI->second.DefRegs.size() != 1)
       return nullptr;
+    if (CanUseLastSeenInCopy)
+      return CI->second.LastSeenUseInCopy;
+
     MCRegUnit RU = *TRI.regunits(CI->second.DefRegs[0]).begin();
     return findCopyForUnit(RU, TRI, true);
   }
@@ -278,7 +447,7 @@ class CopyTracker {
                                       const TargetInstrInfo &TII,
                                       bool UseCopyInstr) {
     MCRegUnit RU = *TRI.regunits(Reg).begin();
-    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI);
+    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI, true);
 
     if (!AvailCopy)
       return nullptr;
@@ -376,6 +545,7 @@ class CopyTracker {
 
   void clear() {
     Copies.clear();
+    Dependencies.clear();
   }
 };
 
@@ -1029,6 +1199,50 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     if (hasOverlappingMultipleDef(MI, MODef, Def))
       continue;
 
+    LLVM_DEBUG(dbgs() << "Backward copy was found\n");
+    LLVM_DEBUG(Copy->dump());
+
+    // Let's see if we have any kind of previous dependencies for the copy,
+    // that have no other dependencies. (So the dependency tree is one level
+    // deep)
+    auto PreviousDependencies =
+        Tracker.getFirstPreviousDependencies(Copy, *TRI);
+
+    if (!PreviousDependencies)
+      // The dependency tree is more than one level deep.
+      continue;
+
+    LLVM_DEBUG(dbgs() << "Number of dependencies of the copy: "
+                      << PreviousDependencies->size() << "\n");
+    
+    // Check whether the current MI has a mutual registers with the optimization
+    // blocking instructions.
+    bool NoDependencyWithMI =
+        std::all_of(PreviousDependencies->begin(), PreviousDependencies->end(),
+                    [&](auto *MI1) {
+                      return !twoMIsHaveMutualOperandRegisters(*MI1, MI, TRI);
+                    });
+    if (!NoDependencyWithMI)
+      continue;
+    
+    // If there isn't any relationship between the blockers and the current MI
+    // then we can start moving them.
+
+    // Add the new instruction to the list of dependencies.
+    Tracker.addDependency({&MI});
+
+    // Then move the previous instruction before the current.
+    // Later the dependency analysis will run for the current MI. There
+    // the relationship between these moved instructions and the current
+    // MI is handled.
+    for (llvm::MachineInstr *I : llvm::reverse(*PreviousDependencies)) {
+      LLVM_DEBUG(dbgs() << "Moving ");
+      LLVM_DEBUG(I->dump());
+      LLVM_DEBUG(dbgs() << "Before ");
+      LLVM_DEBUG(MI.dump());
+      Tracker.moveBAfterA(&MI, I);
+    }
+
     LLVM_DEBUG(dbgs() << "MCP: Replacing " << printReg(MODef.getReg(), TRI)
                       << "\n     with " << printReg(Def, TRI) << "\n     in "
                       << MI << "     from " << *Copy);
@@ -1036,6 +1250,10 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     MODef.setReg(Def);
     MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
 
+    Tracker.invalidateRegister(MODef.getReg().asMCReg(), *TRI, *TII,
+                               UseCopyInstr);
+    Tracker.invalidateRegister(Def, *TRI, *TII, UseCopyInstr);
+
     LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
     MaybeDeadCopies.insert(Copy);
     Changed = true;
@@ -1060,10 +1278,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         // Unlike forward cp, we don't invoke propagateDefs here,
         // just let forward cp do COPY-to-COPY propagation.
         if (isBackwardPropagatableCopy(*CopyOperands, *MRI)) {
-          Tracker.invalidateRegister(SrcReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
-          Tracker.invalidateRegister(DefReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+          Tracker.setDependenciesForMI(&MI, *TRI, *TII, UseCopyInstr); // Maybe it is bad to call it here
           Tracker.trackCopy(&MI, *TRI, *TII, UseCopyInstr);
           continue;
         }
@@ -1080,6 +1295,8 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
       }
 
     propagateDefs(MI);
+    Tracker.setDependenciesForMI(&MI, *TRI, *TII, UseCopyInstr);
+
     for (const MachineOperand &MO : MI.operands()) {
       if (!MO.isReg())
         continue;
@@ -1087,10 +1304,6 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
       if (!MO.getReg())
         continue;
 
-      if (MO.isDef())
-        Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                   UseCopyInstr);
-
       if (MO.readsReg()) {
         if (MO.isDebug()) {
           //  Check if the register in the debug instruction is utilized
@@ -1101,9 +1314,6 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
               CopyDbgUsers[Copy].insert(&MI);
             }
           }
-        } else {
-          Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
         }
       }
     }
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
index b619aac709d985..21cc2e0e570776 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
@@ -106,11 +106,10 @@ define i32 @val_compare_and_swap_from_load(ptr %p, i32 %cmp, ptr %pnew) #0 {
 ; CHECK-OUTLINE-O1-LABEL: val_compare_and_swap_from_load:
 ; CHECK-OUTLINE-O1:       ; %bb.0:
 ; CHECK-OUTLINE-O1-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
-; CHECK-OUTLINE-O1-NEXT:    ldr w8, [x2]
 ; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w0, w1
+; CHECK-OUTLINE-O1-NEXT:    ldr w1, [x2]
 ; CHECK-OUTLINE-O1-NEXT:    mov x2, x3
-; CHECK-OUTLINE-O1-NEXT:    mov w1, w8
 ; CHECK-OUTLINE-O1-NEXT:    bl ___aarch64_cas4_acq
 ; CHECK-OUTLINE-O1-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
 ; CHECK-OUTLINE-O1-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll b/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll
index 7fd71b26fa1ba7..c8f8361e5ef885 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll
@@ -256,9 +256,8 @@ define dso_local i32 @load_between_stores(i32 %x, ptr %p, ptr %ptr) {
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    strh w0, [x1]
 ; CHECK-NEXT:    lsr w9, w0, #16
-; CHECK-NEXT:    ldr w8, [x2]
+; CHECK-NEXT:    ldr w0, [x2]
 ; CHECK-NEXT:    strh w9, [x1, #2]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %t1 = trunc i32 %x to i16
   %sh = lshr i32 %x, 16
diff --git a/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll b/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll
index 410c2d9021d6d5..a150a0f6ee40a2 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll
@@ -131,13 +131,12 @@ entry:
 define <16 x i32> @mla_i32(<16 x i8> %a, <16 x i8> %b, <16 x i32> %c) {
 ; CHECK-SD-LABEL: mla_i32:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    umull2 v7.8h, v0.16b, v1.16b
 ; CHECK-SD-NEXT:    umull v6.8h, v0.8b, v1.8b
-; CHECK-SD-NEXT:    uaddw2 v5.4s, v5.4s, v7.8h
+; CHECK-SD-NEXT:    umull2 v7.8h, v0.16b, v1.16b
 ; CHECK-SD-NEXT:    uaddw v0.4s, v2.4s, v6.4h
 ; CHECK-SD-NEXT:    uaddw2 v1.4s, v3.4s, v6.8h
+; CHECK-SD-NEXT:    uaddw2 v3.4s, v5.4s, v7.8h
 ; CHECK-SD-NEXT:    uaddw v2.4s, v4.4s, v7.4h
-; CHECK-SD-NEXT:    mov v3.16b, v5.16b
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: mla_i32:
@@ -170,18 +169,17 @@ define <16 x i64> @mla_i64(<16 x i8> %a, <16 x i8> %b, <16 x i64> %c) {
 ; CHECK-SD-NEXT:    umull2 v0.8h, v0.16b, v1.16b
 ; CHECK-SD-NEXT:    ldp q20, q21, [sp]
 ; CHECK-SD-NEXT:    ushll v17.4s, v16.4h, #0
+; CHECK-SD-NEXT:    ushll v18.4s, v0.4h, #0
 ; CHECK-SD-NEXT:    ushll2 v16.4s, v16.8h, #0
 ; CHECK-SD-NEXT:    ushll2 v19.4s, v0.8h, #0
-; CHECK-SD-NEXT:    ushll v18.4s, v0.4h, #0
 ; CHECK-SD-NEXT:    uaddw2 v1.2d, v3.2d, v17.4s
 ; CHECK-SD-NEXT:    uaddw v0.2d, v2.2d, v17.2s
 ; CHECK-SD-NEXT:    uaddw2 v3.2d, v5.2d, v16.4s
 ; CHECK-SD-NEXT:    uaddw v2.2d, v4.2d, v16.2s
-; CHECK-SD-NEXT:    uaddw2 v16.2d, v21.2d, v19.4s
 ; CHECK-SD-NEXT:    uaddw v4.2d, v6.2d, v18.2s
 ; CHECK-SD-NEXT:    uaddw2 v5.2d, v7.2d, v18.4s
+; CHECK-SD-NEXT:    uaddw2 v7.2d, v21.2d, v19.4s
 ; CHECK-SD-NEXT:    uaddw v6.2d, v20.2d, v19.2s
-; CHECK-SD-NEXT:    mov v7.16b, v16.16b
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: mla_i64:
diff --git a/llvm/test/CodeGen/AArch64/addp-shuffle.ll b/llvm/test/CodeGen/AArch64/addp-shuffle.ll
index fb96d11acc275a..3dd6068ea3c227 100644
--- a/llvm/test/CodeGen/AArch64/addp-shuffle.ll
+++ b/llvm/test/CodeGen/AArch64/addp-shuffle.ll
@@ -63,9 +63,8 @@ define <16 x i8> @deinterleave_shuffle_v32i8(<32 x i8> %a) {
 define <4 x i64> @deinterleave_shuffle_v8i64(<8 x i64> %a) {
 ; CHECK-LABEL: deinterleave_shuffle_v8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    addp v2.2d, v2.2d, v3.2d
 ; CHECK-NEXT:    addp v0.2d, v0.2d, v1.2d
-; CHECK-NEXT:    mov v1.16b, v2.16b
+; CHECK-NEXT:    addp v1.2d, v2.2d, v3.2d
 ; CHECK-NEXT:    ret
   %r0 = shufflevector <8 x i64> %a, <8 x i64> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
   %r1 = shufflevector <8 x i64> %a, <8 x i64> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
diff --git a/llvm/test/CodeGen/AArch64/cgp-usubo.ll b/llvm/test/CodeGen/AArch64/cgp-usubo.ll
index d307107fc07ee6..c8b73625086644 100644
--- a/llvm/test/CodeGen/AArch64/cgp-usubo.ll
+++ b/llvm/test/CodeGen/AArch64/cgp-usubo.ll
@@ -40,9 +40,8 @@ define i1 @usubo_ugt_constant_op0_i8(i8 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    mov w9, #42 // =0x2a
 ; CHECK-NEXT:    cmp w8, #42
 ; CHECK-NEXT:    sub w9, w9, w0
-; CHECK-NEXT:    cset w8, hi
+; CHECK-NEXT:    cset w0, hi
 ; CHECK-NEXT:    strb w9, [x1]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %s = sub i8 42, %x
   %ov = icmp ugt i8 %x, 42
@@ -59,9 +58,8 @@ define i1 @usubo_ult_constant_op0_i16(i16 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    mov w9, #43 // =0x2b
 ; CHECK-NEXT:    cmp w8, #43
 ; CHECK-NEXT:    sub w9, w9, w0
-; CHECK-NEXT:    cset w8, hi
+; CHECK-NEXT:    cset w0, hi
 ; CHECK-NEXT:    strh w9, [x1]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %s = sub i16 43, %x
   %ov = icmp ult i16 43, %x
@@ -78,8 +76,7 @@ define i1 @usubo_ult_constant_op1_i16(i16 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    sub w9, w0, #44
 ; CHECK-NEXT:    cmp w8, #44
 ; CHECK-NEXT:    strh w9, [x1]
-; CHECK-NEXT:    cset w8, lo
-; CHECK-NEXT:    mov w0, w8
+; CHECK-NEXT:    cset w0, lo
 ; CHECK-NEXT:    ret
   %s = add i16 %x, -44
   %ov = icmp ult i16 %x, 44
@@ -94,8 +91,7 @@ define i1 @usubo_ugt_constant_op1_i8(i8 %x, ptr %p) nounwind {
 ; CHECK-NEXT:    sub w9, w0, #45
 ; CHECK-NEXT:    cmp w8, #45
 ; CHECK-NEXT:    strb w9, [x1]
-; CHECK-NEXT:    cset w8, lo
-; CHECK-NEXT:    mov w0, w8
+; CHECK-NEXT:    cset w0, lo
 ; CHECK-NEXT:    ret
   %ov = icmp ugt i8 45, %x
   %s = add i8 %x, -45
@@ -110,9 +106,8 @@ define i1 @usubo_eq_constant1_op1_i32(i32 %x, ptr %p) nounwind {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    cmp w0, #0
 ; CHECK-NEXT:    sub w9, w0, #1
-; CHECK-NEXT:    cset w8, eq
+; CHECK-NEXT:    cset w0, eq
 ; CHECK-NEXT:    str w9, [x1]
-; CHECK-NEXT:    mov w0, w8
 ; CHECK-NEXT:    ret
   %s = add i32 %x, -1
   %ov = icmp eq i32 %x, 0
diff --git a/llvm/test/CodeGen/AArch64/machine-cp.mir b/llvm/test/CodeGen/AArch64/machine-cp.mir
new file mode 100644
index 00000000000000..de57627d82c57c
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/machine-cp.mir
@@ -0,0 +1,215 @@
+# NOTE: Assertions have been autogene...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/98087