[llvm-branch-commits] [llvm] PeepholeOpt: Fix looking for def of current copy to coalesce (PR #125533)

Matt Arsenault via llvm-branch-commits llvm-branch-commits at lists.llvm.org
Mon Feb 3 09:14:06 PST 2025


https://github.com/arsenm created https://github.com/llvm/llvm-project/pull/125533

This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.

The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.

This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.

There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.

The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.

We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.

>From a897dcf826c891f161b44efbf44ee376adf263bc Mon Sep 17 00:00:00 2001
From: Matt Arsenault <Matthew.Arsenault at amd.com>
Date: Mon, 3 Feb 2025 09:25:49 +0700
Subject: [PATCH] PeepholeOpt: Fix looking for def of current copy to coalesce

This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.

The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.

This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.

There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.

The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.

We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
---
 llvm/lib/CodeGen/PeepholeOptimizer.cpp        |  40 +-
 .../AArch64/Atomics/aarch64-atomicrmw-lse2.ll |  60 +-
 .../AArch64/Atomics/aarch64-atomicrmw-rcpc.ll |  60 +-
 .../Atomics/aarch64-atomicrmw-rcpc3.ll        |  60 +-
 .../AArch64/Atomics/aarch64-atomicrmw-v8a.ll  |  60 +-
 .../Atomics/aarch64_be-atomicrmw-lse2.ll      |  60 +-
 .../Atomics/aarch64_be-atomicrmw-rcpc.ll      |  60 +-
 .../Atomics/aarch64_be-atomicrmw-rcpc3.ll     |  60 +-
 .../Atomics/aarch64_be-atomicrmw-v8a.ll       |  60 +-
 .../AArch64/GlobalISel/arm64-atomic.ll        |   4 -
 .../AArch64/GlobalISel/arm64-pcsections.ll    |  74 ++-
 llvm/test/CodeGen/AArch64/addsub_ext.ll       |  26 +-
 llvm/test/CodeGen/AArch64/and-mask-removal.ll |   1 -
 llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll  |   4 -
 llvm/test/CodeGen/AArch64/arm64-vaddv.ll      |   1 -
 llvm/test/CodeGen/AArch64/arm64_32-addrs.ll   |   1 -
 llvm/test/CodeGen/AArch64/atomic-ops-msvc.ll  |  11 +-
 llvm/test/CodeGen/AArch64/atomic-ops.ll       |   4 -
 llvm/test/CodeGen/AArch64/atomicrmw-fadd.ll   |  22 +-
 llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll   |  22 +-
 llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll   |  22 +-
 llvm/test/CodeGen/AArch64/atomicrmw-fsub.ll   |  22 +-
 .../test/CodeGen/AArch64/atomicrmw-xchg-fp.ll |  10 +-
 llvm/test/CodeGen/AArch64/cmp-to-cmn.ll       |   8 -
 llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll   |   1 -
 llvm/test/CodeGen/AArch64/extract-bits.ll     |   6 -
 .../fold-int-pow2-with-fmul-or-fdiv.ll        |   2 -
 llvm/test/CodeGen/AArch64/fsh.ll              |   2 -
 llvm/test/CodeGen/AArch64/funnel-shift.ll     |   4 -
 ...st-and-by-const-from-lshr-in-eqcmp-zero.ll |   8 -
 ...ist-and-by-const-from-shl-in-eqcmp-zero.ll |  19 +-
 llvm/test/CodeGen/AArch64/logic-shift.ll      |   9 -
 .../CodeGen/AArch64/neon-insextbitcast.ll     |   2 -
 llvm/test/CodeGen/AArch64/shift-by-signext.ll |   2 -
 llvm/test/CodeGen/AArch64/shift.ll            |   6 -
 llvm/test/CodeGen/AArch64/sink-and-fold.ll    |   1 -
 .../AArch64/sve-fixed-length-subvector.ll     |  48 +-
 ...e-streaming-mode-fixed-length-fp-to-int.ll | 152 ++---
 ...g-mode-fixed-length-permute-zip-uzp-trn.ll |  30 +-
 .../AArch64/vec-combine-compare-to-bitmask.ll |  28 +-
 .../AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll |   7 +-
 llvm/test/CodeGen/AMDGPU/mul_int24.ll         |  42 +-
 .../peephole-opt-fold-reg-sequence-subreg.mir | 189 ++++++
 llvm/test/CodeGen/ARM/aes-erratum-fix.ll      |  70 +--
 .../ARM/arm-bf16-dotprod-intrinsics.ll        |   3 -
 .../ARM/armv8.2a-fp16-vector-intrinsics.ll    |   2 -
 .../CodeGen/ARM/bf16-create-get-set-dup.ll    |   1 -
 llvm/test/CodeGen/ARM/bf16-shuffle.ll         |   1 -
 .../CodeGen/ARM/minnum-maxnum-intrinsics.ll   |  22 +-
 llvm/test/CodeGen/ARM/neon-copy.ll            |   3 -
 llvm/test/CodeGen/ARM/neon-v8.1a.ll           |   8 -
 llvm/test/CodeGen/ARM/vdup.ll                 |   4 -
 llvm/test/CodeGen/ARM/vext.ll                 |  12 +-
 llvm/test/CodeGen/ARM/vmul.ll                 |   3 -
 llvm/test/CodeGen/ARM/vpadd.ll                |   8 +-
 llvm/test/CodeGen/ARM/vuzp.ll                 |  52 +-
 llvm/test/CodeGen/ARM/vzip.ll                 |  46 +-
 llvm/test/CodeGen/AVR/return.ll               |  72 +--
 llvm/test/CodeGen/BPF/is_trunc_free.ll        |   1 -
 .../aggressive-anti-dep-breaker-subreg.ll     |   2 +-
 .../CodeGen/PowerPC/mma-acc-copy-hints.ll     |  18 +-
 llvm/test/CodeGen/PowerPC/mma-acc-memops.ll   |  24 +-
 .../PowerPC/paired-vector-intrinsics.ll       |  20 +-
 .../CodeGen/PowerPC/peephole-subreg-def.mir   |  21 +-
 .../RISCV/rvv/fixed-vectors-cttz-vp.ll        | 126 ++--
 .../rvv/fixed-vectors-interleaved-access.ll   | 549 +++++++++--------
 .../RISCV/rvv/fixed-vectors-masked-scatter.ll |  60 +-
 .../CodeGen/RISCV/rvv/vector-interleave.ll    |  52 +-
 llvm/test/CodeGen/SPARC/fmuladd-soft-float.ll |  14 +-
 llvm/test/CodeGen/SPARC/fp128.ll              |  24 +-
 llvm/test/CodeGen/SPARC/fp16-promote.ll       |  32 +-
 llvm/test/CodeGen/SystemZ/int-uadd-01.ll      |  16 +-
 llvm/test/CodeGen/SystemZ/int-uadd-02.ll      |  16 +-
 llvm/test/CodeGen/SystemZ/pr60413.ll          |  40 +-
 ...ve-complex-deinterleaving-uniform-cases.ll |  80 +--
 .../CodeGen/Thumb2/mve-float32regloops.ll     |  56 +-
 llvm/test/CodeGen/Thumb2/mve-masked-ldst.ll   |  48 +-
 llvm/test/CodeGen/Thumb2/mve-shuffle.ll       |  56 +-
 llvm/test/CodeGen/Thumb2/mve-vcvt16.ll        |   7 +-
 llvm/test/CodeGen/Thumb2/mve-vld2.ll          |   7 +-
 llvm/test/CodeGen/Thumb2/mve-vld3.ll          | 552 +++++++-----------
 llvm/test/CodeGen/Thumb2/mve-vld4.ll          | 112 ++--
 llvm/test/CodeGen/Thumb2/mve-vldst4.ll        | 176 +++---
 llvm/test/CodeGen/Thumb2/mve-vst2.ll          |  35 +-
 llvm/test/CodeGen/Thumb2/mve-vst3.ll          | 541 +++++++++--------
 llvm/test/CodeGen/Thumb2/mve-vst4.ll          | 152 +++--
 llvm/test/CodeGen/VE/Scalar/select.ll         |   4 +-
 llvm/test/CodeGen/VE/Scalar/va_caller.ll      |  21 +-
 .../CodeGen/X86/AMX/amx-ldtilecfg-insert.ll   |   4 +-
 llvm/test/CodeGen/X86/avx512-calling-conv.ll  |   4 +-
 llvm/test/CodeGen/X86/avx512-ext.ll           | 392 +++++++------
 .../X86/div-rem-pair-recomposition-signed.ll  |  11 +-
 llvm/test/CodeGen/X86/fminimum-fmaximum.ll    |  70 +--
 .../CodeGen/X86/fminimumnum-fmaximumnum.ll    | 288 ++++-----
 llvm/test/CodeGen/X86/half.ll                 |  10 +-
 llvm/test/CodeGen/X86/smax.ll                 |  10 +-
 llvm/test/CodeGen/X86/smin.ll                 |  10 +-
 llvm/test/CodeGen/X86/test-shrink.ll          |   1 -
 llvm/test/CodeGen/X86/umax.ll                 |  10 +-
 llvm/test/CodeGen/X86/umin.ll                 |  10 +-
 llvm/test/CodeGen/X86/vector-compress.ll      |  98 ++--
 llvm/test/CodeGen/X86/vector-fshl-256.ll      |   1 -
 .../X86/wide-scalar-shift-legalization.ll     |  48 +-
 ...ad-of-small-alloca-with-zero-upper-half.ll |   3 +-
 .../CodeGen/X86/widen-load-of-small-alloca.ll |   3 +-
 105 files changed, 2727 insertions(+), 2725 deletions(-)
 create mode 100644 llvm/test/CodeGen/AMDGPU/peephole-opt-fold-reg-sequence-subreg.mir

diff --git a/llvm/lib/CodeGen/PeepholeOptimizer.cpp b/llvm/lib/CodeGen/PeepholeOptimizer.cpp
index e0053fb243369c..6739199a802231 100644
--- a/llvm/lib/CodeGen/PeepholeOptimizer.cpp
+++ b/llvm/lib/CodeGen/PeepholeOptimizer.cpp
@@ -465,7 +465,8 @@ class PeepholeOptimizer : private MachineFunction::Delegate {
   bool optimizeUncoalescableCopy(MachineInstr &MI,
                                  SmallPtrSetImpl<MachineInstr *> &LocalMIs);
   bool optimizeRecurrence(MachineInstr &PHI);
-  bool findNextSource(RegSubRegPair RegSubReg, RewriteMapTy &RewriteMap);
+  bool findNextSource(const TargetRegisterClass *DefRC, unsigned DefSubReg,
+                      RegSubRegPair RegSubReg, RewriteMapTy &RewriteMap);
   bool isMoveImmediate(MachineInstr &MI, SmallSet<Register, 4> &ImmDefRegs,
                        DenseMap<Register, MachineInstr *> &ImmDefMIs);
   bool foldImmediate(MachineInstr &MI, SmallSet<Register, 4> &ImmDefRegs,
@@ -1002,17 +1003,15 @@ bool PeepholeOptimizer::optimizeCondBranch(MachineInstr &MI) {
 /// share the same register file as \p Reg and \p SubReg. The client should
 /// then be capable to rewrite all intermediate PHIs to get the next source.
 /// \return False if no alternative sources are available. True otherwise.
-bool PeepholeOptimizer::findNextSource(RegSubRegPair RegSubReg,
+bool PeepholeOptimizer::findNextSource(const TargetRegisterClass *DefRC,
+                                       unsigned DefSubReg,
+                                       RegSubRegPair RegSubReg,
                                        RewriteMapTy &RewriteMap) {
   // Do not try to find a new source for a physical register.
   // So far we do not have any motivating example for doing that.
   // Thus, instead of maintaining untested code, we will revisit that if
   // that changes at some point.
   Register Reg = RegSubReg.Reg;
-  if (Reg.isPhysical())
-    return false;
-  const TargetRegisterClass *DefRC = MRI->getRegClass(Reg);
-
   SmallVector<RegSubRegPair, 4> SrcToLook;
   RegSubRegPair CurSrcPair = RegSubReg;
   SrcToLook.push_back(CurSrcPair);
@@ -1076,7 +1075,7 @@ bool PeepholeOptimizer::findNextSource(RegSubRegPair RegSubReg,
 
       // Keep following the chain if the value isn't any better yet.
       const TargetRegisterClass *SrcRC = MRI->getRegClass(CurSrcPair.Reg);
-      if (!TRI->shouldRewriteCopySrc(DefRC, RegSubReg.SubReg, SrcRC,
+      if (!TRI->shouldRewriteCopySrc(DefRC, DefSubReg, SrcRC,
                                      CurSrcPair.SubReg))
         continue;
 
@@ -1184,21 +1183,33 @@ bool PeepholeOptimizer::optimizeCoalescableCopyImpl(Rewriter &&CpyRewriter) {
   bool Changed = false;
   // Get the right rewriter for the current copy.
   // Rewrite each rewritable source.
-  RegSubRegPair Src;
+  RegSubRegPair Dst;
   RegSubRegPair TrackPair;
-  while (CpyRewriter.getNextRewritableSource(Src, TrackPair)) {
+  while (CpyRewriter.getNextRewritableSource(TrackPair, Dst)) {
+    if (Dst.Reg.isPhysical()) {
+      // Do not try to find a new source for a physical register.
+      // So far we do not have any motivating example for doing that.
+      // Thus, instead of maintaining untested code, we will revisit that if
+      // that changes at some point.
+      continue;
+    }
+
+    const TargetRegisterClass *DefRC = MRI->getRegClass(Dst.Reg);
+
     // Keep track of PHI nodes and its incoming edges when looking for sources.
     RewriteMapTy RewriteMap;
     // Try to find a more suitable source. If we failed to do so, or get the
     // actual source, move to the next source.
-    if (!findNextSource(TrackPair, RewriteMap))
+    if (!findNextSource(DefRC, Dst.SubReg, TrackPair, RewriteMap))
       continue;
 
     // Get the new source to rewrite. TODO: Only enable handling of multiple
     // sources (PHIs) once we have a motivating example and testcases for it.
     RegSubRegPair NewSrc = getNewSource(MRI, TII, TrackPair, RewriteMap,
                                         /*HandleMultipleSources=*/false);
-    if (Src.Reg == NewSrc.Reg || NewSrc.Reg == 0)
+    assert(TrackPair.Reg != NewSrc.Reg &&
+           "should not rewrite source to original value");
+    if (!NewSrc.Reg)
       continue;
 
     // Rewrite source.
@@ -1325,9 +1336,14 @@ bool PeepholeOptimizer::optimizeUncoalescableCopy(
     if (Def.Reg.isPhysical())
       return false;
 
+    // FIXME: Uncoalescable copies are treated differently by
+    // UncoalescableRewriter, and this probably should not share
+    // API. getNextRewritableSource really finds rewritable defs.
+    const TargetRegisterClass *DefRC = MRI->getRegClass(Def.Reg);
+
     // If we do not know how to rewrite this definition, there is no point
     // in trying to kill this instruction.
-    if (!findNextSource(Def, RewriteMap))
+    if (!findNextSource(DefRC, Def.SubReg, Def, RewriteMap))
       return false;
 
     RewritePairs.push_back(Def);
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-lse2.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-lse2.ll
index d93ef6f8b2869b..94d46148f37e3a 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-lse2.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-lse2.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -86,8 +86,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -100,8 +100,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -114,8 +114,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -128,8 +128,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -142,8 +142,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -392,8 +392,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -407,8 +407,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -422,8 +422,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -437,8 +437,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -452,8 +452,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc.ll
index 912d87dcd2b9b6..57cfeb78b69802 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -86,8 +86,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -100,8 +100,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -114,8 +114,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -128,8 +128,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -142,8 +142,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -392,8 +392,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -407,8 +407,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -422,8 +422,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -437,8 +437,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -452,8 +452,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc3.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc3.ll
index 725558f2dcf727..28ee1a2a70c4d7 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc3.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-rcpc3.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -86,8 +86,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -100,8 +100,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -114,8 +114,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -128,8 +128,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -142,8 +142,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -392,8 +392,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -407,8 +407,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -422,8 +422,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -437,8 +437,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -452,8 +452,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-v8a.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-v8a.ll
index 004e433d9408a6..69220a6cf8132f 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-v8a.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64-atomicrmw-v8a.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -86,8 +86,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -100,8 +100,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -114,8 +114,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -128,8 +128,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -142,8 +142,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w8, w8, w9, uxth
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -392,8 +392,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -407,8 +407,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -422,8 +422,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -437,8 +437,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -452,8 +452,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w8, w8, w10, uxtb
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-lse2.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-lse2.ll
index 01317e09028c35..d92899d9aa3eb8 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-lse2.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-lse2.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -87,8 +87,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -102,8 +102,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -117,8 +117,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -132,8 +132,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -147,8 +147,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -397,8 +397,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -412,8 +412,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -427,8 +427,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -442,8 +442,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -457,8 +457,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc.ll
index 1bead6d694c652..c09a2c593f0f38 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -87,8 +87,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -102,8 +102,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -117,8 +117,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -132,8 +132,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -147,8 +147,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -397,8 +397,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -412,8 +412,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -427,8 +427,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -442,8 +442,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -457,8 +457,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc3.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc3.ll
index 51d9766f6a8f92..30166b4e741163 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc3.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-rcpc3.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -87,8 +87,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -102,8 +102,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -117,8 +117,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -132,8 +132,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -147,8 +147,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -397,8 +397,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -412,8 +412,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -427,8 +427,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -442,8 +442,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -457,8 +457,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-v8a.ll b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-v8a.ll
index a58e5a987bb4c9..e9c1b33dc3d217 100644
--- a/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-v8a.ll
+++ b/llvm/test/CodeGen/AArch64/Atomics/aarch64_be-atomicrmw-v8a.ll
@@ -12,8 +12,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_monotonic(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -27,8 +27,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -42,8 +42,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -57,8 +57,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -72,8 +72,8 @@ define dso_local i8 @atomicrmw_xchg_i8_aligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_aligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
@@ -87,8 +87,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_monotonic(ptr %ptr, i16 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_monotonic:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value monotonic, align 2
     ret i16 %r
 }
@@ -102,8 +102,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acquire(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acquire:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acquire, align 2
     ret i16 %r
 }
@@ -117,8 +117,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_release(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_release:
-; -O1:    ldxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value release, align 2
     ret i16 %r
 }
@@ -132,8 +132,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_acq_rel(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_acq_rel:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value acq_rel, align 2
     ret i16 %r
 }
@@ -147,8 +147,8 @@ define dso_local i16 @atomicrmw_xchg_i16_aligned_seq_cst(ptr %ptr, i16 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i16_aligned_seq_cst:
-; -O1:    ldaxrh w8, [x0]
-; -O1:    stlxrh w9, w1, [x0]
+; -O1:    ldaxrh w0, [x8]
+; -O1:    stlxrh w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i16 %value seq_cst, align 2
     ret i16 %r
 }
@@ -397,8 +397,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_monotonic(ptr %ptr, i8 %value)
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_monotonic:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value monotonic, align 1
     ret i8 %r
 }
@@ -412,8 +412,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acquire(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acquire:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acquire, align 1
     ret i8 %r
 }
@@ -427,8 +427,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_release(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_release:
-; -O1:    ldxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value release, align 1
     ret i8 %r
 }
@@ -442,8 +442,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_acq_rel(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_acq_rel:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value acq_rel, align 1
     ret i8 %r
 }
@@ -457,8 +457,8 @@ define dso_local i8 @atomicrmw_xchg_i8_unaligned_seq_cst(ptr %ptr, i8 %value) {
 ; -O0:    subs w9, w9, #1
 ;
 ; -O1-LABEL: atomicrmw_xchg_i8_unaligned_seq_cst:
-; -O1:    ldaxrb w8, [x0]
-; -O1:    stlxrb w9, w1, [x0]
+; -O1:    ldaxrb w0, [x8]
+; -O1:    stlxrb w9, w1, [x8]
     %r = atomicrmw xchg ptr %ptr, i8 %value seq_cst, align 1
     ret i8 %r
 }
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
index de3f323891a36a..5bc041aef88bad 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
@@ -1938,7 +1938,6 @@ define i8 @atomicrmw_add_i8(ptr %ptr, i8 %rhs) {
 define i8 @atomicrmw_xchg_i8(ptr %ptr, i8 %rhs) {
 ; CHECK-NOLSE-O1-LABEL: atomicrmw_xchg_i8:
 ; CHECK-NOLSE-O1:       ; %bb.0:
-; CHECK-NOLSE-O1-NEXT:    ; kill: def $w1 killed $w1 def $x1
 ; CHECK-NOLSE-O1-NEXT:  LBB28_1: ; %atomicrmw.start
 ; CHECK-NOLSE-O1-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NOLSE-O1-NEXT:    ldxrb w8, [x0]
@@ -2993,7 +2992,6 @@ define i16 @atomicrmw_add_i16(ptr %ptr, i16 %rhs) {
 define i16 @atomicrmw_xchg_i16(ptr %ptr, i16 %rhs) {
 ; CHECK-NOLSE-O1-LABEL: atomicrmw_xchg_i16:
 ; CHECK-NOLSE-O1:       ; %bb.0:
-; CHECK-NOLSE-O1-NEXT:    ; kill: def $w1 killed $w1 def $x1
 ; CHECK-NOLSE-O1-NEXT:  LBB38_1: ; %atomicrmw.start
 ; CHECK-NOLSE-O1-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NOLSE-O1-NEXT:    ldxrh w8, [x0]
@@ -5996,7 +5994,6 @@ define { i8, i1 } @cmpxchg_i8(ptr %ptr, i8 %desired, i8 %new) {
 ; CHECK-NOLSE-O1-LABEL: cmpxchg_i8:
 ; CHECK-NOLSE-O1:       ; %bb.0:
 ; CHECK-NOLSE-O1-NEXT:    mov x8, x0
-; CHECK-NOLSE-O1-NEXT:    ; kill: def $w2 killed $w2 def $x2
 ; CHECK-NOLSE-O1-NEXT:  LBB67_1: ; %cmpxchg.start
 ; CHECK-NOLSE-O1-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NOLSE-O1-NEXT:    ldxrb w0, [x8]
@@ -6103,7 +6100,6 @@ define { i16, i1 } @cmpxchg_i16(ptr %ptr, i16 %desired, i16 %new) {
 ; CHECK-NOLSE-O1-LABEL: cmpxchg_i16:
 ; CHECK-NOLSE-O1:       ; %bb.0:
 ; CHECK-NOLSE-O1-NEXT:    mov x8, x0
-; CHECK-NOLSE-O1-NEXT:    ; kill: def $w2 killed $w2 def $x2
 ; CHECK-NOLSE-O1-NEXT:  LBB68_1: ; %cmpxchg.start
 ; CHECK-NOLSE-O1-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NOLSE-O1-NEXT:    ldxrh w0, [x8]
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll
index c6819ff39ed33e..2779e89c373fc6 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll
@@ -727,8 +727,8 @@ define i8 @atomicrmw_add_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDAXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = ADDWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = ADDWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -746,11 +746,9 @@ define i8 @atomicrmw_xchg_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $w1 = KILL $w1, implicit-def $x1
-  ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.atomicrmw.start:
   ; CHECK-NEXT:   successors: %bb.1(0x7c000000), %bb.2(0x04000000)
-  ; CHECK-NEXT:   liveins: $x0, $x1
+  ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
   ; CHECK-NEXT:   early-clobber renamable $w9 = STXRB renamable $w1, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
@@ -776,8 +774,8 @@ define i8 @atomicrmw_sub_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDAXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = SUBWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = SUBWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -800,8 +798,8 @@ define i8 @atomicrmw_and_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = ANDWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = ANDWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -824,8 +822,8 @@ define i8 @atomicrmw_or_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDAXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = ORRWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = ORRWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -848,8 +846,8 @@ define i8 @atomicrmw_xor_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = EORWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = EORWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -874,8 +872,8 @@ define i8 @atomicrmw_min_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   renamable $w8 = LDAXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
   ; CHECK-NEXT:   renamable $w9 = SBFMWri renamable $w8, 0, 7, pcsections !0
   ; CHECK-NEXT:   dead $wzr = SUBSWrx killed renamable $w9, renamable $w1, 32, implicit-def $nzcv, pcsections !0
-  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 11, implicit killed $nzcv, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 11, implicit killed $nzcv, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -900,8 +898,8 @@ define i8 @atomicrmw_max_i8(ptr %ptr, i8 %rhs) {
   ; CHECK-NEXT:   renamable $w8 = LDXRB renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s8) from %ir.ptr)
   ; CHECK-NEXT:   renamable $w9 = SBFMWri renamable $w8, 0, 7, pcsections !0
   ; CHECK-NEXT:   dead $wzr = SUBSWrx killed renamable $w9, renamable $w1, 32, implicit-def $nzcv, pcsections !0
-  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 12, implicit killed $nzcv, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s8) into %ir.ptr)
+  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 12, implicit killed $nzcv, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRB killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -980,8 +978,8 @@ define i16 @atomicrmw_add_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDAXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = ADDWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = ADDWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -999,11 +997,9 @@ define i16 @atomicrmw_xchg_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $w1 = KILL $w1, implicit-def $x1
-  ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.atomicrmw.start:
   ; CHECK-NEXT:   successors: %bb.1(0x7c000000), %bb.2(0x04000000)
-  ; CHECK-NEXT:   liveins: $x0, $x1
+  ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
   ; CHECK-NEXT:   early-clobber renamable $w9 = STXRH renamable $w1, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
@@ -1029,8 +1025,8 @@ define i16 @atomicrmw_sub_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDAXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = SUBWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = SUBWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -1053,8 +1049,8 @@ define i16 @atomicrmw_and_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = ANDWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = ANDWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -1077,8 +1073,8 @@ define i16 @atomicrmw_or_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDAXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = ORRWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = ORRWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -1101,8 +1097,8 @@ define i16 @atomicrmw_xor_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   liveins: $w1, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w8 = LDXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
-  ; CHECK-NEXT:   $w9 = EORWrs renamable $w8, renamable $w1, 0, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   $w9 = EORWrs renamable $w8, renamable $w1, 0, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -1127,8 +1123,8 @@ define i16 @atomicrmw_min_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   renamable $w8 = LDAXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
   ; CHECK-NEXT:   renamable $w9 = SBFMWri renamable $w8, 0, 15, pcsections !0
   ; CHECK-NEXT:   dead $wzr = SUBSWrx killed renamable $w9, renamable $w1, 40, implicit-def $nzcv, pcsections !0
-  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 11, implicit killed $nzcv, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 11, implicit killed $nzcv, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -1153,8 +1149,8 @@ define i16 @atomicrmw_max_i16(ptr %ptr, i16 %rhs) {
   ; CHECK-NEXT:   renamable $w8 = LDXRH renamable $x0, implicit-def $x8, pcsections !0 :: (volatile load (s16) from %ir.ptr)
   ; CHECK-NEXT:   renamable $w9 = SBFMWri renamable $w8, 0, 15, pcsections !0
   ; CHECK-NEXT:   dead $wzr = SUBSWrx killed renamable $w9, renamable $w1, 40, implicit-def $nzcv, pcsections !0
-  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 12, implicit killed $nzcv, implicit-def $x9, pcsections !0
-  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH renamable $w9, renamable $x0, implicit killed $x9, pcsections !0 :: (volatile store (s16) into %ir.ptr)
+  ; CHECK-NEXT:   renamable $w9 = CSELWr renamable $w8, renamable $w1, 12, implicit killed $nzcv, pcsections !0
+  ; CHECK-NEXT:   early-clobber renamable $w10 = STLXRH killed renamable $w9, renamable $x0, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w10, %bb.1, pcsections !0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.atomicrmw.end:
@@ -1229,11 +1225,10 @@ define { i8, i1 } @cmpxchg_i8(ptr %ptr, i8 %desired, i8 %new) {
   ; CHECK-NEXT:   liveins: $w1, $w2, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   $x8 = ORRXrs $xzr, $x0, 0
-  ; CHECK-NEXT:   renamable $w2 = KILL $w2, implicit-def $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.cmpxchg.start:
   ; CHECK-NEXT:   successors: %bb.2(0x7c000000), %bb.4(0x04000000)
-  ; CHECK-NEXT:   liveins: $w1, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w1, $w2, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w0 = LDXRB renamable $x8, implicit-def $x0, pcsections !0 :: (volatile load (s8) from %ir.ptr)
   ; CHECK-NEXT:   renamable $w9 = ANDWri renamable $w0, 7, pcsections !0
@@ -1242,7 +1237,7 @@ define { i8, i1 } @cmpxchg_i8(ptr %ptr, i8 %desired, i8 %new) {
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.cmpxchg.trystore:
   ; CHECK-NEXT:   successors: %bb.3(0x04000000), %bb.1(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w1, $x0, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w1, $w2, $x0, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   early-clobber renamable $w9 = STXRB renamable $w2, renamable $x8, pcsections !0 :: (volatile store (s8) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w9, %bb.1
@@ -1272,11 +1267,10 @@ define { i16, i1 } @cmpxchg_i16(ptr %ptr, i16 %desired, i16 %new) {
   ; CHECK-NEXT:   liveins: $w1, $w2, $x0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   $x8 = ORRXrs $xzr, $x0, 0
-  ; CHECK-NEXT:   renamable $w2 = KILL $w2, implicit-def $x2
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.1.cmpxchg.start:
   ; CHECK-NEXT:   successors: %bb.2(0x7c000000), %bb.4(0x04000000)
-  ; CHECK-NEXT:   liveins: $w1, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w1, $w2, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   renamable $w0 = LDXRH renamable $x8, implicit-def $x0, pcsections !0 :: (volatile load (s16) from %ir.ptr)
   ; CHECK-NEXT:   renamable $w9 = ANDWri renamable $w0, 15, pcsections !0
@@ -1285,7 +1279,7 @@ define { i16, i1 } @cmpxchg_i16(ptr %ptr, i16 %desired, i16 %new) {
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT: bb.2.cmpxchg.trystore:
   ; CHECK-NEXT:   successors: %bb.3(0x04000000), %bb.1(0x7c000000)
-  ; CHECK-NEXT:   liveins: $w1, $x0, $x2, $x8
+  ; CHECK-NEXT:   liveins: $w1, $w2, $x0, $x8
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   early-clobber renamable $w9 = STXRH renamable $w2, renamable $x8, pcsections !0 :: (volatile store (s16) into %ir.ptr)
   ; CHECK-NEXT:   CBNZW killed renamable $w9, %bb.1
diff --git a/llvm/test/CodeGen/AArch64/addsub_ext.ll b/llvm/test/CodeGen/AArch64/addsub_ext.ll
index 04a98bd5088803..6c3564375279e9 100644
--- a/llvm/test/CodeGen/AArch64/addsub_ext.ll
+++ b/llvm/test/CodeGen/AArch64/addsub_ext.ll
@@ -26,7 +26,6 @@ define i32 @add_z_shli8i32(i8 %v, i32 %lhs) minsize {
 define i64 @add_z_i8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_z_i8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, uxtb
 ; CHECK-NEXT:    ret
   %vz = zext i8 %v to i64
@@ -37,7 +36,6 @@ define i64 @add_z_i8i64(i8 %v, i64 %lhs) minsize {
 define i64 @add_z_shli8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_z_shli8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, uxtb #3
 ; CHECK-NEXT:    ret
   %vz = zext i8 %v to i64
@@ -70,7 +68,6 @@ define i32 @add_s_shli8i32(i8 %v, i32 %lhs) minsize {
 define i64 @add_s_i8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_s_i8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, sxtb
 ; CHECK-NEXT:    ret
   %vz = sext i8 %v to i64
@@ -81,7 +78,6 @@ define i64 @add_s_i8i64(i8 %v, i64 %lhs) minsize {
 define i64 @add_s_shli8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_s_shli8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, sxtb #3
 ; CHECK-NEXT:    ret
   %vz = sext i8 %v to i64
@@ -114,7 +110,6 @@ define i32 @add_z_shli16i32(i16 %v, i32 %lhs) minsize {
 define i64 @add_z_i16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_z_i16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, uxth
 ; CHECK-NEXT:    ret
   %vz = zext i16 %v to i64
@@ -125,7 +120,6 @@ define i64 @add_z_i16i64(i16 %v, i64 %lhs) minsize {
 define i64 @add_z_shli16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_z_shli16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, uxth #3
 ; CHECK-NEXT:    ret
   %vz = zext i16 %v to i64
@@ -179,7 +173,6 @@ define i32 @add_s_shli16i32(i16 %v, i32 %lhs) minsize {
 define i64 @add_s_i16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_s_i16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, sxth
 ; CHECK-NEXT:    ret
   %vz = sext i16 %v to i64
@@ -190,7 +183,6 @@ define i64 @add_s_i16i64(i16 %v, i64 %lhs) minsize {
 define i64 @add_s_shli16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: add_s_shli16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    add x0, x1, w0, sxth #3
 ; CHECK-NEXT:    ret
   %vz = sext i16 %v to i64
@@ -244,7 +236,6 @@ define i32 @sub_z_shli8i32(i8 %v, i32 %lhs) minsize {
 define i64 @sub_z_i8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_z_i8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, uxtb
 ; CHECK-NEXT:    ret
   %vz = zext i8 %v to i64
@@ -255,7 +246,6 @@ define i64 @sub_z_i8i64(i8 %v, i64 %lhs) minsize {
 define i64 @sub_z_shli8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_z_shli8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, uxtb #3
 ; CHECK-NEXT:    ret
   %vz = zext i8 %v to i64
@@ -288,7 +278,6 @@ define i32 @sub_s_shli8i32(i8 %v, i32 %lhs) minsize {
 define i64 @sub_s_i8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_s_i8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, sxtb
 ; CHECK-NEXT:    ret
   %vz = sext i8 %v to i64
@@ -299,7 +288,6 @@ define i64 @sub_s_i8i64(i8 %v, i64 %lhs) minsize {
 define i64 @sub_s_shli8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_s_shli8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, sxtb #3
 ; CHECK-NEXT:    ret
   %vz = sext i8 %v to i64
@@ -332,7 +320,6 @@ define i32 @sub_z_shli16i32(i16 %v, i32 %lhs) minsize {
 define i64 @sub_z_i16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_z_i16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, uxth
 ; CHECK-NEXT:    ret
   %vz = zext i16 %v to i64
@@ -343,7 +330,6 @@ define i64 @sub_z_i16i64(i16 %v, i64 %lhs) minsize {
 define i64 @sub_z_shli16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_z_shli16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, uxth #3
 ; CHECK-NEXT:    ret
   %vz = zext i16 %v to i64
@@ -397,7 +383,6 @@ define i32 @sub_s_shli16i32(i16 %v, i32 %lhs) minsize {
 define i64 @sub_s_i16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_s_i16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, sxth
 ; CHECK-NEXT:    ret
   %vz = sext i16 %v to i64
@@ -408,7 +393,6 @@ define i64 @sub_s_i16i64(i16 %v, i64 %lhs) minsize {
 define i64 @sub_s_shli16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: sub_s_shli16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    sub x0, x1, w0, sxth #3
 ; CHECK-NEXT:    ret
   %vz = sext i16 %v to i64
@@ -444,7 +428,7 @@ define i32 @cmp_s_i8i32(i8 %v, i32 %lhs) minsize {
 ; CHECK-NEXT:    cmp w1, w0, uxtb
 ; CHECK-NEXT:    b.ge .LBB40_2
 ; CHECK-NEXT:  // %bb.1: // %then
-; CHECK-NEXT:    mov w0, #1
+; CHECK-NEXT:    mov w0, #1 // =0x1
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB40_2: // %end
 ; CHECK-NEXT:    mov w0, w1
@@ -461,11 +445,10 @@ end:
 define i64 @cmp_s_i8i64(i8 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: cmp_s_i8i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    cmp x1, w0, sxtb
 ; CHECK-NEXT:    b.ge .LBB41_2
 ; CHECK-NEXT:  // %bb.1: // %then
-; CHECK-NEXT:    mov w0, #1
+; CHECK-NEXT:    mov w0, #1 // =0x1
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB41_2: // %end
 ; CHECK-NEXT:    mov x0, x1
@@ -485,7 +468,7 @@ define i32 @cmp_s_i16i32(i16 %v, i32 %lhs) minsize {
 ; CHECK-NEXT:    cmp w1, w0, uxth
 ; CHECK-NEXT:    b.ge .LBB42_2
 ; CHECK-NEXT:  // %bb.1: // %then
-; CHECK-NEXT:    mov w0, #1
+; CHECK-NEXT:    mov w0, #1 // =0x1
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB42_2: // %end
 ; CHECK-NEXT:    mov w0, w1
@@ -502,11 +485,10 @@ end:
 define i64 @cmp_s_i16i64(i16 %v, i64 %lhs) minsize {
 ; CHECK-LABEL: cmp_s_i16i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    cmp x1, w0, sxth
 ; CHECK-NEXT:    b.ge .LBB43_2
 ; CHECK-NEXT:  // %bb.1: // %then
-; CHECK-NEXT:    mov w0, #1
+; CHECK-NEXT:    mov w0, #1 // =0x1
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB43_2: // %end
 ; CHECK-NEXT:    mov x0, x1
diff --git a/llvm/test/CodeGen/AArch64/and-mask-removal.ll b/llvm/test/CodeGen/AArch64/and-mask-removal.ll
index 09f00b3845f25f..5046c0571ad2bc 100644
--- a/llvm/test/CodeGen/AArch64/and-mask-removal.ll
+++ b/llvm/test/CodeGen/AArch64/and-mask-removal.ll
@@ -11,7 +11,6 @@ define void @new_position(i32 %pos) {
 ; CHECK-SD-LABEL: new_position:
 ; CHECK-SD:       ; %bb.0: ; %entry
 ; CHECK-SD-NEXT:    adrp x8, _board at GOTPAGE
-; CHECK-SD-NEXT:    ; kill: def $w0 killed $w0 def $x0
 ; CHECK-SD-NEXT:    ldr x8, [x8, _board at GOTPAGEOFF]
 ; CHECK-SD-NEXT:    ldrb w8, [x8, w0, sxtw]
 ; CHECK-SD-NEXT:    sub w8, w8, #1
diff --git a/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll b/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll
index d69d1b6eb4a2ab..ccd191f163b01c 100644
--- a/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-ldxr-stxr.ll
@@ -127,7 +127,6 @@ declare i64 @llvm.aarch64.ldxr.p0(ptr) nounwind
 define dso_local i32 @test_store_i8(i32, i8 %val, ptr %addr) {
 ; CHECK-LABEL: test_store_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    stxrb w0, w1, [x2]
 ; CHECK-NEXT:    ret
   %extval = zext i8 %val to i64
@@ -138,7 +137,6 @@ define dso_local i32 @test_store_i8(i32, i8 %val, ptr %addr) {
 define dso_local i32 @test_store_i16(i32, i16 %val, ptr %addr) {
 ; CHECK-LABEL: test_store_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    stxrh w0, w1, [x2]
 ; CHECK-NEXT:    ret
   %extval = zext i16 %val to i64
@@ -299,7 +297,6 @@ declare i64 @llvm.aarch64.ldaxr.p0(ptr) nounwind
 define dso_local i32 @test_store_release_i8(i32, i8 %val, ptr %addr) {
 ; CHECK-LABEL: test_store_release_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    stlxrb w0, w1, [x2]
 ; CHECK-NEXT:    ret
   %extval = zext i8 %val to i64
@@ -310,7 +307,6 @@ define dso_local i32 @test_store_release_i8(i32, i8 %val, ptr %addr) {
 define dso_local i32 @test_store_release_i16(i32, i16 %val, ptr %addr) {
 ; CHECK-LABEL: test_store_release_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    stlxrh w0, w1, [x2]
 ; CHECK-NEXT:    ret
   %extval = zext i16 %val to i64
diff --git a/llvm/test/CodeGen/AArch64/arm64-vaddv.ll b/llvm/test/CodeGen/AArch64/arm64-vaddv.ll
index adfe28ea95893b..d60b6d70973083 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vaddv.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vaddv.ll
@@ -267,7 +267,6 @@ define <1 x i64> @test_vaddv_u64_to_vec(<2 x i64> %a1) {
 ; CHECK-LABEL: test_vaddv_u64_to_vec:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    addp d0, v0.2d
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
   %vaddv.i = tail call i64 @llvm.aarch64.neon.uaddv.i64.v2i64(<2 x i64> %a1)
diff --git a/llvm/test/CodeGen/AArch64/arm64_32-addrs.ll b/llvm/test/CodeGen/AArch64/arm64_32-addrs.ll
index ad073d96a14830..cc70dc446324da 100644
--- a/llvm/test/CodeGen/AArch64/arm64_32-addrs.ll
+++ b/llvm/test/CodeGen/AArch64/arm64_32-addrs.ll
@@ -43,7 +43,6 @@ define i8 @test_valid_wrap_optimizable2(ptr %base, i32 %offset) {
 ; CHECK-LABEL: test_valid_wrap_optimizable2:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    mov w8, #-100 ; =0xffffff9c
-; CHECK-NEXT:    ; kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    ldrb w0, [x8, w1, sxtw]
 ; CHECK-NEXT:    ret
 
diff --git a/llvm/test/CodeGen/AArch64/atomic-ops-msvc.ll b/llvm/test/CodeGen/AArch64/atomic-ops-msvc.ll
index fdb14606d463b7..42cb3d4e9589de 100644
--- a/llvm/test/CodeGen/AArch64/atomic-ops-msvc.ll
+++ b/llvm/test/CodeGen/AArch64/atomic-ops-msvc.ll
@@ -374,7 +374,6 @@ define dso_local i64 @test_atomic_load_xor_i64(i64 %offset) nounwind {
 define dso_local i8 @test_atomic_load_xchg_i8(i8 %offset) nounwind {
 ; CHECK-LABEL: test_atomic_load_xchg_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    adrp x9, var8
 ; CHECK-NEXT:    add x9, x9, :lo12:var8
 ; CHECK-NEXT:  .LBB20_1: // %atomicrmw.start
@@ -392,16 +391,16 @@ define dso_local i8 @test_atomic_load_xchg_i8(i8 %offset) nounwind {
 define dso_local i16 @test_atomic_load_xchg_i16(i16 %offset) nounwind {
 ; CHECK-LABEL: test_atomic_load_xchg_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
+; CHECK-NEXT:    mov w8, w0
 ; CHECK-NEXT:    adrp x9, var16
 ; CHECK-NEXT:    add x9, x9, :lo12:var16
 ; CHECK-NEXT:  .LBB21_1: // %atomicrmw.start
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ldaxrh w8, [x9]
-; CHECK-NEXT:    stlxrh w10, w0, [x9]
+; CHECK-NEXT:    ldaxrh w0, [x9]
+; CHECK-NEXT:    stlxrh w10, w8, [x9]
 ; CHECK-NEXT:    cbnz w10, .LBB21_1
 ; CHECK-NEXT:  // %bb.2: // %atomicrmw.end
-; CHECK-NEXT:    mov w0, w8
+; CHECK-NEXT:    // kill: def $w0 killed $w0 killed $x0
 ; CHECK-NEXT:    dmb ish
 ; CHECK-NEXT:    ret
    %old = atomicrmw xchg ptr @var16, i16 %offset seq_cst
@@ -763,7 +762,6 @@ define dso_local i64 @test_atomic_load_umax_i64(i64 %offset) nounwind {
 define dso_local i8 @test_atomic_cmpxchg_i8(i8 %wanted, i8 %new) nounwind {
 ; CHECK-LABEL: test_atomic_cmpxchg_i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    and w8, w0, #0xff
 ; CHECK-NEXT:    adrp x9, var8
 ; CHECK-NEXT:    add x9, x9, :lo12:var8
@@ -791,7 +789,6 @@ define dso_local i8 @test_atomic_cmpxchg_i8(i8 %wanted, i8 %new) nounwind {
 define dso_local i16 @test_atomic_cmpxchg_i16(i16 %wanted, i16 %new) nounwind {
 ; CHECK-LABEL: test_atomic_cmpxchg_i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    and w8, w0, #0xffff
 ; CHECK-NEXT:    adrp x9, var16
 ; CHECK-NEXT:    add x9, x9, :lo12:var16
diff --git a/llvm/test/CodeGen/AArch64/atomic-ops.ll b/llvm/test/CodeGen/AArch64/atomic-ops.ll
index 679065529090f0..d8ac89f76b321b 100644
--- a/llvm/test/CodeGen/AArch64/atomic-ops.ll
+++ b/llvm/test/CodeGen/AArch64/atomic-ops.ll
@@ -558,7 +558,6 @@ define dso_local i64 @test_atomic_load_xor_i64(i64 %offset) nounwind {
 define dso_local i8 @test_atomic_load_xchg_i8(i8 %offset) nounwind {
 ; INLINE_ATOMICS-LABEL: test_atomic_load_xchg_i8:
 ; INLINE_ATOMICS:       // %bb.0:
-; INLINE_ATOMICS-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; INLINE_ATOMICS-NEXT:    adrp x9, var8
 ; INLINE_ATOMICS-NEXT:    add x9, x9, :lo12:var8
 ; INLINE_ATOMICS-NEXT:  .LBB20_1: // %atomicrmw.start
@@ -585,7 +584,6 @@ define dso_local i8 @test_atomic_load_xchg_i8(i8 %offset) nounwind {
 define dso_local i16 @test_atomic_load_xchg_i16(i16 %offset) nounwind {
 ; INLINE_ATOMICS-LABEL: test_atomic_load_xchg_i16:
 ; INLINE_ATOMICS:       // %bb.0:
-; INLINE_ATOMICS-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; INLINE_ATOMICS-NEXT:    adrp x9, var16
 ; INLINE_ATOMICS-NEXT:    add x9, x9, :lo12:var16
 ; INLINE_ATOMICS-NEXT:  .LBB21_1: // %atomicrmw.start
@@ -978,7 +976,6 @@ define dso_local i64 @test_atomic_load_umax_i64(i64 %offset) nounwind {
 define dso_local i8 @test_atomic_cmpxchg_i8(i8 %wanted, i8 %new) nounwind {
 ; INLINE_ATOMICS-LABEL: test_atomic_cmpxchg_i8:
 ; INLINE_ATOMICS:       // %bb.0:
-; INLINE_ATOMICS-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; INLINE_ATOMICS-NEXT:    and w8, w0, #0xff
 ; INLINE_ATOMICS-NEXT:    adrp x9, var8
 ; INLINE_ATOMICS-NEXT:    add x9, x9, :lo12:var8
@@ -1015,7 +1012,6 @@ define dso_local i8 @test_atomic_cmpxchg_i8(i8 %wanted, i8 %new) nounwind {
 define dso_local i16 @test_atomic_cmpxchg_i16(i16 %wanted, i16 %new) nounwind {
 ; INLINE_ATOMICS-LABEL: test_atomic_cmpxchg_i16:
 ; INLINE_ATOMICS:       // %bb.0:
-; INLINE_ATOMICS-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; INLINE_ATOMICS-NEXT:    and w8, w0, #0xffff
 ; INLINE_ATOMICS-NEXT:    adrp x9, var16
 ; INLINE_ATOMICS-NEXT:    add x9, x9, :lo12:var16
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-fadd.ll b/llvm/test/CodeGen/AArch64/atomicrmw-fadd.ll
index fb40dfcbe101db..0c3a40d93d640d 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-fadd.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-fadd.ll
@@ -476,40 +476,38 @@ define double @test_atomicrmw_fadd_f32_seq_cst_align8(ptr %ptr, double %value) #
 ; SOFTFP-NOLSE-LABEL: test_atomicrmw_fadd_f32_seq_cst_align8:
 ; SOFTFP-NOLSE:       // %bb.0:
 ; SOFTFP-NOLSE-NEXT:    stp x30, x21, [sp, #-32]! // 16-byte Folded Spill
-; SOFTFP-NOLSE-NEXT:    ldr x21, [x0]
 ; SOFTFP-NOLSE-NEXT:    stp x20, x19, [sp, #16] // 16-byte Folded Spill
 ; SOFTFP-NOLSE-NEXT:    mov x19, x0
+; SOFTFP-NOLSE-NEXT:    ldr x0, [x0]
 ; SOFTFP-NOLSE-NEXT:    mov x20, x1
 ; SOFTFP-NOLSE-NEXT:    b .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_1: // %cmpxchg.nostore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, wzr
+; SOFTFP-NOLSE-NEXT:    mov w8, wzr
 ; SOFTFP-NOLSE-NEXT:    clrex
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_6
+; SOFTFP-NOLSE-NEXT:    cbnz w8, .LBB5_6
 ; SOFTFP-NOLSE-NEXT:  .LBB5_2: // %atomicrmw.start
 ; SOFTFP-NOLSE-NEXT:    // =>This Loop Header: Depth=1
 ; SOFTFP-NOLSE-NEXT:    // Child Loop BB5_3 Depth 2
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    mov x1, x20
+; SOFTFP-NOLSE-NEXT:    mov x21, x0
 ; SOFTFP-NOLSE-NEXT:    bl __adddf3
+; SOFTFP-NOLSE-NEXT:    mov x8, x0
 ; SOFTFP-NOLSE-NEXT:  .LBB5_3: // %cmpxchg.start
 ; SOFTFP-NOLSE-NEXT:    // Parent Loop BB5_2 Depth=1
 ; SOFTFP-NOLSE-NEXT:    // => This Inner Loop Header: Depth=2
-; SOFTFP-NOLSE-NEXT:    ldaxr x8, [x19]
-; SOFTFP-NOLSE-NEXT:    cmp x8, x21
+; SOFTFP-NOLSE-NEXT:    ldaxr x0, [x19]
+; SOFTFP-NOLSE-NEXT:    cmp x0, x21
 ; SOFTFP-NOLSE-NEXT:    b.ne .LBB5_1
 ; SOFTFP-NOLSE-NEXT:  // %bb.4: // %cmpxchg.trystore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_3 Depth=2
-; SOFTFP-NOLSE-NEXT:    stlxr w9, x0, [x19]
+; SOFTFP-NOLSE-NEXT:    stlxr w9, x8, [x19]
 ; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_3
 ; SOFTFP-NOLSE-NEXT:  // %bb.5: // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, #1 // =0x1
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbz w9, .LBB5_2
+; SOFTFP-NOLSE-NEXT:    mov w8, #1 // =0x1
+; SOFTFP-NOLSE-NEXT:    cbz w8, .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_6: // %atomicrmw.end
 ; SOFTFP-NOLSE-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; SOFTFP-NOLSE-NEXT:    ret
   %res = atomicrmw fadd ptr %ptr, double %value seq_cst, align 8
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll b/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll
index 818dcf3a0b4876..24088998f36d10 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-fmax.ll
@@ -478,40 +478,38 @@ define double @test_atomicrmw_fmax_f32_seq_cst_align8(ptr %ptr, double %value) #
 ; SOFTFP-NOLSE-LABEL: test_atomicrmw_fmax_f32_seq_cst_align8:
 ; SOFTFP-NOLSE:       // %bb.0:
 ; SOFTFP-NOLSE-NEXT:    stp x30, x21, [sp, #-32]! // 16-byte Folded Spill
-; SOFTFP-NOLSE-NEXT:    ldr x21, [x0]
 ; SOFTFP-NOLSE-NEXT:    stp x20, x19, [sp, #16] // 16-byte Folded Spill
 ; SOFTFP-NOLSE-NEXT:    mov x19, x0
+; SOFTFP-NOLSE-NEXT:    ldr x0, [x0]
 ; SOFTFP-NOLSE-NEXT:    mov x20, x1
 ; SOFTFP-NOLSE-NEXT:    b .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_1: // %cmpxchg.nostore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, wzr
+; SOFTFP-NOLSE-NEXT:    mov w8, wzr
 ; SOFTFP-NOLSE-NEXT:    clrex
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_6
+; SOFTFP-NOLSE-NEXT:    cbnz w8, .LBB5_6
 ; SOFTFP-NOLSE-NEXT:  .LBB5_2: // %atomicrmw.start
 ; SOFTFP-NOLSE-NEXT:    // =>This Loop Header: Depth=1
 ; SOFTFP-NOLSE-NEXT:    // Child Loop BB5_3 Depth 2
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    mov x1, x20
+; SOFTFP-NOLSE-NEXT:    mov x21, x0
 ; SOFTFP-NOLSE-NEXT:    bl fmax
+; SOFTFP-NOLSE-NEXT:    mov x8, x0
 ; SOFTFP-NOLSE-NEXT:  .LBB5_3: // %cmpxchg.start
 ; SOFTFP-NOLSE-NEXT:    // Parent Loop BB5_2 Depth=1
 ; SOFTFP-NOLSE-NEXT:    // => This Inner Loop Header: Depth=2
-; SOFTFP-NOLSE-NEXT:    ldaxr x8, [x19]
-; SOFTFP-NOLSE-NEXT:    cmp x8, x21
+; SOFTFP-NOLSE-NEXT:    ldaxr x0, [x19]
+; SOFTFP-NOLSE-NEXT:    cmp x0, x21
 ; SOFTFP-NOLSE-NEXT:    b.ne .LBB5_1
 ; SOFTFP-NOLSE-NEXT:  // %bb.4: // %cmpxchg.trystore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_3 Depth=2
-; SOFTFP-NOLSE-NEXT:    stlxr w9, x0, [x19]
+; SOFTFP-NOLSE-NEXT:    stlxr w9, x8, [x19]
 ; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_3
 ; SOFTFP-NOLSE-NEXT:  // %bb.5: // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, #1 // =0x1
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbz w9, .LBB5_2
+; SOFTFP-NOLSE-NEXT:    mov w8, #1 // =0x1
+; SOFTFP-NOLSE-NEXT:    cbz w8, .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_6: // %atomicrmw.end
 ; SOFTFP-NOLSE-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; SOFTFP-NOLSE-NEXT:    ret
   %res = atomicrmw fmax ptr %ptr, double %value seq_cst, align 8
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll b/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll
index b969241e8bf905..65f1f4863c1730 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-fmin.ll
@@ -478,40 +478,38 @@ define double @test_atomicrmw_fmin_f32_seq_cst_align8(ptr %ptr, double %value) #
 ; SOFTFP-NOLSE-LABEL: test_atomicrmw_fmin_f32_seq_cst_align8:
 ; SOFTFP-NOLSE:       // %bb.0:
 ; SOFTFP-NOLSE-NEXT:    stp x30, x21, [sp, #-32]! // 16-byte Folded Spill
-; SOFTFP-NOLSE-NEXT:    ldr x21, [x0]
 ; SOFTFP-NOLSE-NEXT:    stp x20, x19, [sp, #16] // 16-byte Folded Spill
 ; SOFTFP-NOLSE-NEXT:    mov x19, x0
+; SOFTFP-NOLSE-NEXT:    ldr x0, [x0]
 ; SOFTFP-NOLSE-NEXT:    mov x20, x1
 ; SOFTFP-NOLSE-NEXT:    b .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_1: // %cmpxchg.nostore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, wzr
+; SOFTFP-NOLSE-NEXT:    mov w8, wzr
 ; SOFTFP-NOLSE-NEXT:    clrex
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_6
+; SOFTFP-NOLSE-NEXT:    cbnz w8, .LBB5_6
 ; SOFTFP-NOLSE-NEXT:  .LBB5_2: // %atomicrmw.start
 ; SOFTFP-NOLSE-NEXT:    // =>This Loop Header: Depth=1
 ; SOFTFP-NOLSE-NEXT:    // Child Loop BB5_3 Depth 2
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    mov x1, x20
+; SOFTFP-NOLSE-NEXT:    mov x21, x0
 ; SOFTFP-NOLSE-NEXT:    bl fmin
+; SOFTFP-NOLSE-NEXT:    mov x8, x0
 ; SOFTFP-NOLSE-NEXT:  .LBB5_3: // %cmpxchg.start
 ; SOFTFP-NOLSE-NEXT:    // Parent Loop BB5_2 Depth=1
 ; SOFTFP-NOLSE-NEXT:    // => This Inner Loop Header: Depth=2
-; SOFTFP-NOLSE-NEXT:    ldaxr x8, [x19]
-; SOFTFP-NOLSE-NEXT:    cmp x8, x21
+; SOFTFP-NOLSE-NEXT:    ldaxr x0, [x19]
+; SOFTFP-NOLSE-NEXT:    cmp x0, x21
 ; SOFTFP-NOLSE-NEXT:    b.ne .LBB5_1
 ; SOFTFP-NOLSE-NEXT:  // %bb.4: // %cmpxchg.trystore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_3 Depth=2
-; SOFTFP-NOLSE-NEXT:    stlxr w9, x0, [x19]
+; SOFTFP-NOLSE-NEXT:    stlxr w9, x8, [x19]
 ; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_3
 ; SOFTFP-NOLSE-NEXT:  // %bb.5: // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, #1 // =0x1
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbz w9, .LBB5_2
+; SOFTFP-NOLSE-NEXT:    mov w8, #1 // =0x1
+; SOFTFP-NOLSE-NEXT:    cbz w8, .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_6: // %atomicrmw.end
 ; SOFTFP-NOLSE-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; SOFTFP-NOLSE-NEXT:    ret
   %res = atomicrmw fmin ptr %ptr, double %value seq_cst, align 8
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-fsub.ll b/llvm/test/CodeGen/AArch64/atomicrmw-fsub.ll
index e603337e7a5694..0f1a2f03c98c36 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-fsub.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-fsub.ll
@@ -476,40 +476,38 @@ define double @test_atomicrmw_fsub_f32_seq_cst_align8(ptr %ptr, double %value) #
 ; SOFTFP-NOLSE-LABEL: test_atomicrmw_fsub_f32_seq_cst_align8:
 ; SOFTFP-NOLSE:       // %bb.0:
 ; SOFTFP-NOLSE-NEXT:    stp x30, x21, [sp, #-32]! // 16-byte Folded Spill
-; SOFTFP-NOLSE-NEXT:    ldr x21, [x0]
 ; SOFTFP-NOLSE-NEXT:    stp x20, x19, [sp, #16] // 16-byte Folded Spill
 ; SOFTFP-NOLSE-NEXT:    mov x19, x0
+; SOFTFP-NOLSE-NEXT:    ldr x0, [x0]
 ; SOFTFP-NOLSE-NEXT:    mov x20, x1
 ; SOFTFP-NOLSE-NEXT:    b .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_1: // %cmpxchg.nostore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, wzr
+; SOFTFP-NOLSE-NEXT:    mov w8, wzr
 ; SOFTFP-NOLSE-NEXT:    clrex
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_6
+; SOFTFP-NOLSE-NEXT:    cbnz w8, .LBB5_6
 ; SOFTFP-NOLSE-NEXT:  .LBB5_2: // %atomicrmw.start
 ; SOFTFP-NOLSE-NEXT:    // =>This Loop Header: Depth=1
 ; SOFTFP-NOLSE-NEXT:    // Child Loop BB5_3 Depth 2
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    mov x1, x20
+; SOFTFP-NOLSE-NEXT:    mov x21, x0
 ; SOFTFP-NOLSE-NEXT:    bl __subdf3
+; SOFTFP-NOLSE-NEXT:    mov x8, x0
 ; SOFTFP-NOLSE-NEXT:  .LBB5_3: // %cmpxchg.start
 ; SOFTFP-NOLSE-NEXT:    // Parent Loop BB5_2 Depth=1
 ; SOFTFP-NOLSE-NEXT:    // => This Inner Loop Header: Depth=2
-; SOFTFP-NOLSE-NEXT:    ldaxr x8, [x19]
-; SOFTFP-NOLSE-NEXT:    cmp x8, x21
+; SOFTFP-NOLSE-NEXT:    ldaxr x0, [x19]
+; SOFTFP-NOLSE-NEXT:    cmp x0, x21
 ; SOFTFP-NOLSE-NEXT:    b.ne .LBB5_1
 ; SOFTFP-NOLSE-NEXT:  // %bb.4: // %cmpxchg.trystore
 ; SOFTFP-NOLSE-NEXT:    // in Loop: Header=BB5_3 Depth=2
-; SOFTFP-NOLSE-NEXT:    stlxr w9, x0, [x19]
+; SOFTFP-NOLSE-NEXT:    stlxr w9, x8, [x19]
 ; SOFTFP-NOLSE-NEXT:    cbnz w9, .LBB5_3
 ; SOFTFP-NOLSE-NEXT:  // %bb.5: // in Loop: Header=BB5_2 Depth=1
-; SOFTFP-NOLSE-NEXT:    mov w9, #1 // =0x1
-; SOFTFP-NOLSE-NEXT:    mov x21, x8
-; SOFTFP-NOLSE-NEXT:    cbz w9, .LBB5_2
+; SOFTFP-NOLSE-NEXT:    mov w8, #1 // =0x1
+; SOFTFP-NOLSE-NEXT:    cbz w8, .LBB5_2
 ; SOFTFP-NOLSE-NEXT:  .LBB5_6: // %atomicrmw.end
 ; SOFTFP-NOLSE-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; SOFTFP-NOLSE-NEXT:    mov x0, x21
 ; SOFTFP-NOLSE-NEXT:    ldp x30, x21, [sp], #32 // 16-byte Folded Reload
 ; SOFTFP-NOLSE-NEXT:    ret
   %res = atomicrmw fsub ptr %ptr, double %value seq_cst, align 8
diff --git a/llvm/test/CodeGen/AArch64/atomicrmw-xchg-fp.ll b/llvm/test/CodeGen/AArch64/atomicrmw-xchg-fp.ll
index 98033a8e449ffb..0e728cdb895ed2 100644
--- a/llvm/test/CodeGen/AArch64/atomicrmw-xchg-fp.ll
+++ b/llvm/test/CodeGen/AArch64/atomicrmw-xchg-fp.ll
@@ -1,4 +1,4 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --force-update
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -verify-machineinstrs -mtriple=aarch64-- -O1 -fast-isel=0 -global-isel=false %s -o - | FileCheck %s -check-prefix=NOLSE
 ; RUN: llc -verify-machineinstrs -mtriple=aarch64-- -mattr=+lse -O1 -fast-isel=0 -global-isel=false %s -o - | FileCheck %s -check-prefix=LSE
 
@@ -6,14 +6,14 @@ define half @test_rmw_xchg_f16(ptr %dst, half %new) {
 ; NOLSE-LABEL: test_rmw_xchg_f16:
 ; NOLSE:       // %bb.0:
 ; NOLSE-NEXT:    // kill: def $h0 killed $h0 def $s0
-; NOLSE-NEXT:    fmov w8, s0
+; NOLSE-NEXT:    fmov w9, s0
 ; NOLSE-NEXT:  .LBB0_1: // %atomicrmw.start
 ; NOLSE-NEXT:    // =>This Inner Loop Header: Depth=1
-; NOLSE-NEXT:    ldaxrh w9, [x0]
-; NOLSE-NEXT:    stlxrh w10, w8, [x0]
+; NOLSE-NEXT:    ldaxrh w8, [x0]
+; NOLSE-NEXT:    stlxrh w10, w9, [x0]
 ; NOLSE-NEXT:    cbnz w10, .LBB0_1
 ; NOLSE-NEXT:  // %bb.2: // %atomicrmw.end
-; NOLSE-NEXT:    fmov s0, w9
+; NOLSE-NEXT:    fmov s0, w8
 ; NOLSE-NEXT:    // kill: def $h0 killed $h0 killed $s0
 ; NOLSE-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AArch64/cmp-to-cmn.ll b/llvm/test/CodeGen/AArch64/cmp-to-cmn.ll
index 1cc194e77b94b1..e87d43161a895e 100644
--- a/llvm/test/CodeGen/AArch64/cmp-to-cmn.ll
+++ b/llvm/test/CodeGen/AArch64/cmp-to-cmn.ll
@@ -31,7 +31,6 @@ entry:
 define i1 @test_EQ_IlsEbT(i64 %a, i16 %b) {
 ; CHECK-LABEL: test_EQ_IlsEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    cmn x0, w1, sxth
 ; CHECK-NEXT:    cset w0, eq
 ; CHECK-NEXT:    ret
@@ -45,7 +44,6 @@ entry:
 define i1 @test_EQ_IlcEbT(i64 %a, i8 %b) {
 ; CHECK-LABEL: test_EQ_IlcEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    cmn x0, w1, uxtb
 ; CHECK-NEXT:    cset w0, eq
 ; CHECK-NEXT:    ret
@@ -110,7 +108,6 @@ entry:
 define i1 @test_EQ_IslEbT(i16 %a, i64 %b) {
 ; CHECK-LABEL: test_EQ_IslEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    cmn x1, w0, sxth
 ; CHECK-NEXT:    cset w0, eq
 ; CHECK-NEXT:    ret
@@ -167,7 +164,6 @@ entry:
 define i1 @test_EQ_IclEbT(i8 %a, i64 %b) {
 ; CHECK-LABEL: test_EQ_IclEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    cmn x1, w0, uxtb
 ; CHECK-NEXT:    cset w0, eq
 ; CHECK-NEXT:    ret
@@ -249,7 +245,6 @@ entry:
 define i1 @test_NE_IlsEbT(i64 %a, i16 %b) {
 ; CHECK-LABEL: test_NE_IlsEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    cmn x0, w1, sxth
 ; CHECK-NEXT:    cset w0, ne
 ; CHECK-NEXT:    ret
@@ -263,7 +258,6 @@ entry:
 define i1 @test_NE_IlcEbT(i64 %a, i8 %b) {
 ; CHECK-LABEL: test_NE_IlcEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    cmn x0, w1, uxtb
 ; CHECK-NEXT:    cset w0, ne
 ; CHECK-NEXT:    ret
@@ -328,7 +322,6 @@ entry:
 define i1 @test_NE_IslEbT(i16 %a, i64 %b) {
 ; CHECK-LABEL: test_NE_IslEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    cmn x1, w0, sxth
 ; CHECK-NEXT:    cset w0, ne
 ; CHECK-NEXT:    ret
@@ -385,7 +378,6 @@ entry:
 define i1 @test_NE_IclEbT(i8 %a, i64 %b) {
 ; CHECK-LABEL: test_NE_IclEbT:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    cmn x1, w0, uxtb
 ; CHECK-NEXT:    cset w0, ne
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll b/llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll
index 186d191444feb6..b7817ebe59b9b5 100644
--- a/llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll
+++ b/llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll
@@ -53,7 +53,6 @@ define i1 @test_return_bool(ptr %value, i8 %oldValue, i8 %newValue) {
 ; CHECK-LABEL: test_return_bool:
 ; CHECK:       ; %bb.0:
 ; CHECK-NEXT:    and w8, w1, #0xff
-; CHECK-NEXT:    ; kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:  LBB1_1: ; %cmpxchg.start
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ldaxrb w9, [x0]
diff --git a/llvm/test/CodeGen/AArch64/extract-bits.ll b/llvm/test/CodeGen/AArch64/extract-bits.ll
index aaa6c7eb4a30f4..8e822d19a19b92 100644
--- a/llvm/test/CodeGen/AArch64/extract-bits.ll
+++ b/llvm/test/CodeGen/AArch64/extract-bits.ll
@@ -492,7 +492,6 @@ define i32 @bextr64_32_b1(i64 %val, i64 %numskipbits, i8 %numlowbits) nounwind {
 ; CHECK-LABEL: bextr64_32_b1:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    mov w8, #-1 // =0xffffffff
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    lsr x9, x0, x1
 ; CHECK-NEXT:    lsl w8, w8, w2
 ; CHECK-NEXT:    bic w0, w9, w8
@@ -512,7 +511,6 @@ define i32 @bextr64_32_b2(i64 %val, i64 %numskipbits, i8 %numlowbits) nounwind {
 ; CHECK-LABEL: bextr64_32_b2:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    mov w8, #-1 // =0xffffffff
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    lsr x9, x0, x1
 ; CHECK-NEXT:    lsl w8, w8, w2
 ; CHECK-NEXT:    bic w0, w9, w8
@@ -552,7 +550,6 @@ define i32 @bextr32_c1_indexzext(i32 %val, i8 %numskipbits, i8 %numlowbits) noun
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    mov w8, #32 // =0x20
 ; CHECK-NEXT:    mov w9, #-1 // =0xffffffff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w10, w0, w1
 ; CHECK-NEXT:    sub w8, w8, w2
 ; CHECK-NEXT:    lsr w8, w9, w8
@@ -592,7 +589,6 @@ define i32 @bextr32_c3_load_indexzext(ptr %w, i8 %numskipbits, i8 %numlowbits) n
 ; CHECK-NEXT:    mov w9, #32 // =0x20
 ; CHECK-NEXT:    mov w10, #-1 // =0xffffffff
 ; CHECK-NEXT:    sub w9, w9, w2
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    lsr w9, w10, w9
 ; CHECK-NEXT:    and w0, w9, w8
@@ -797,7 +793,6 @@ define i32 @bextr32_d0(i32 %val, i32 %numskipbits, i32 %numlowbits) nounwind {
 define i32 @bextr32_d1_indexzext(i32 %val, i8 %numskipbits, i8 %numlowbits) nounwind {
 ; CHECK-LABEL: bextr32_d1_indexzext:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w0, w1
 ; CHECK-NEXT:    mov w9, #32 // =0x20
 ; CHECK-NEXT:    sub w9, w9, w2
@@ -834,7 +829,6 @@ define i32 @bextr32_d3_load_indexzext(ptr %w, i8 %numskipbits, i8 %numlowbits) n
 ; CHECK-LABEL: bextr32_d3_load_indexzext:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ldr w8, [x0]
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    mov w9, #32 // =0x20
 ; CHECK-NEXT:    sub w9, w9, w2
 ; CHECK-NEXT:    lsr w8, w8, w1
diff --git a/llvm/test/CodeGen/AArch64/fold-int-pow2-with-fmul-or-fdiv.ll b/llvm/test/CodeGen/AArch64/fold-int-pow2-with-fmul-or-fdiv.ll
index a78addc490086d..b40c0656a60e41 100644
--- a/llvm/test/CodeGen/AArch64/fold-int-pow2-with-fmul-or-fdiv.ll
+++ b/llvm/test/CodeGen/AArch64/fold-int-pow2-with-fmul-or-fdiv.ll
@@ -224,7 +224,6 @@ define double @fmul_pow_mul_max_pow2(i16 %cnt) nounwind {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    mov w8, #2 // =0x2
 ; CHECK-NEXT:    mov w9, #1 // =0x1
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    fmov d1, #3.00000000
 ; CHECK-NEXT:    lsl w8, w8, w0
 ; CHECK-NEXT:    lsl w9, w9, w0
@@ -433,7 +432,6 @@ define double @fmul_pow_shl_cnt_safe(i16 %cnt) nounwind {
 ; CHECK-LABEL: fmul_pow_shl_cnt_safe:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    mov w8, #1 // =0x1
-; CHECK-NEXT:    // kill: def $w0 killed $w0 def $x0
 ; CHECK-NEXT:    lsl w8, w8, w0
 ; CHECK-NEXT:    and w8, w8, #0xffff
 ; CHECK-NEXT:    ucvtf d0, w8
diff --git a/llvm/test/CodeGen/AArch64/fsh.ll b/llvm/test/CodeGen/AArch64/fsh.ll
index 9196d511941fb3..c084813760b808 100644
--- a/llvm/test/CodeGen/AArch64/fsh.ll
+++ b/llvm/test/CodeGen/AArch64/fsh.ll
@@ -394,7 +394,6 @@ define i32 @fshl_i32(i32 %a, i32 %b, i32 %c) {
 ; CHECK-SD-LABEL: fshl_i32:
 ; CHECK-SD:       // %bb.0: // %entry
 ; CHECK-SD-NEXT:    lsr w8, w1, #1
-; CHECK-SD-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-SD-NEXT:    mvn w9, w2
 ; CHECK-SD-NEXT:    lsl w10, w0, w2
 ; CHECK-SD-NEXT:    lsr w8, w8, w9
@@ -420,7 +419,6 @@ define i32 @fshr_i32(i32 %a, i32 %b, i32 %c) {
 ; CHECK-SD-LABEL: fshr_i32:
 ; CHECK-SD:       // %bb.0: // %entry
 ; CHECK-SD-NEXT:    lsl w8, w0, #1
-; CHECK-SD-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-SD-NEXT:    mvn w9, w2
 ; CHECK-SD-NEXT:    lsr w10, w1, w2
 ; CHECK-SD-NEXT:    lsl w8, w8, w9
diff --git a/llvm/test/CodeGen/AArch64/funnel-shift.ll b/llvm/test/CodeGen/AArch64/funnel-shift.ll
index 3037a9552bc27e..e5aa360f804c1f 100644
--- a/llvm/test/CodeGen/AArch64/funnel-shift.ll
+++ b/llvm/test/CodeGen/AArch64/funnel-shift.ll
@@ -21,7 +21,6 @@ define i32 @fshl_i32(i32 %x, i32 %y, i32 %z) {
 ; CHECK-SD-LABEL: fshl_i32:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    lsr w8, w1, #1
-; CHECK-SD-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-SD-NEXT:    mvn w9, w2
 ; CHECK-SD-NEXT:    lsl w10, w0, w2
 ; CHECK-SD-NEXT:    lsr w8, w8, w9
@@ -266,7 +265,6 @@ define i32 @fshr_i32(i32 %x, i32 %y, i32 %z) {
 ; CHECK-SD-LABEL: fshr_i32:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    lsl w8, w0, #1
-; CHECK-SD-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-SD-NEXT:    mvn w9, w2
 ; CHECK-SD-NEXT:    lsr w10, w1, w2
 ; CHECK-SD-NEXT:    lsl w8, w8, w9
@@ -667,7 +665,6 @@ define i32 @or_shl_fshl_simplify(i32 %x, i32 %y, i32 %s) {
 ; CHECK-SD-LABEL: or_shl_fshl_simplify:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    lsr w8, w0, #1
-; CHECK-SD-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-SD-NEXT:    mvn w9, w2
 ; CHECK-SD-NEXT:    lsl w10, w1, w2
 ; CHECK-SD-NEXT:    lsr w8, w8, w9
@@ -696,7 +693,6 @@ define i32 @or_lshr_fshr_simplify(i32 %x, i32 %y, i32 %s) {
 ; CHECK-SD-LABEL: or_lshr_fshr_simplify:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    lsl w8, w0, #1
-; CHECK-SD-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-SD-NEXT:    mvn w9, w2
 ; CHECK-SD-NEXT:    lsr w10, w1, w2
 ; CHECK-SD-NEXT:    lsl w8, w8, w9
diff --git a/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-lshr-in-eqcmp-zero.ll b/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-lshr-in-eqcmp-zero.ll
index cb9f04a7fac48a..c3fdc7db2abbee 100644
--- a/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-lshr-in-eqcmp-zero.ll
+++ b/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-lshr-in-eqcmp-zero.ll
@@ -15,7 +15,6 @@
 define i1 @scalar_i8_signbit_eq(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_signbit_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    tst w8, #0x80
 ; CHECK-NEXT:    cset w0, eq
@@ -29,7 +28,6 @@ define i1 @scalar_i8_signbit_eq(i8 %x, i8 %y) nounwind {
 define i1 @scalar_i8_lowestbit_eq(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_lowestbit_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    tst w8, #0x1
 ; CHECK-NEXT:    cset w0, eq
@@ -43,7 +41,6 @@ define i1 @scalar_i8_lowestbit_eq(i8 %x, i8 %y) nounwind {
 define i1 @scalar_i8_bitsinmiddle_eq(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_bitsinmiddle_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    tst w8, #0x18
 ; CHECK-NEXT:    cset w0, eq
@@ -59,7 +56,6 @@ define i1 @scalar_i8_bitsinmiddle_eq(i8 %x, i8 %y) nounwind {
 define i1 @scalar_i16_signbit_eq(i16 %x, i16 %y) nounwind {
 ; CHECK-LABEL: scalar_i16_signbit_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    tst w8, #0x8000
 ; CHECK-NEXT:    cset w0, eq
@@ -73,7 +69,6 @@ define i1 @scalar_i16_signbit_eq(i16 %x, i16 %y) nounwind {
 define i1 @scalar_i16_lowestbit_eq(i16 %x, i16 %y) nounwind {
 ; CHECK-LABEL: scalar_i16_lowestbit_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    tst w8, #0x1
 ; CHECK-NEXT:    cset w0, eq
@@ -87,7 +82,6 @@ define i1 @scalar_i16_lowestbit_eq(i16 %x, i16 %y) nounwind {
 define i1 @scalar_i16_bitsinmiddle_eq(i16 %x, i16 %y) nounwind {
 ; CHECK-LABEL: scalar_i16_bitsinmiddle_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    tst w8, #0xff0
 ; CHECK-NEXT:    cset w0, eq
@@ -268,7 +262,6 @@ define <4 x i1> @vec_4xi32_nonsplat_undef2_eq(<4 x i32> %x, <4 x i32> %y) nounwi
 define i1 @scalar_i8_signbit_ne(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_signbit_ne:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsl w8, w0, w1
 ; CHECK-NEXT:    ubfx w0, w8, #7, #1
 ; CHECK-NEXT:    ret
@@ -325,7 +318,6 @@ define i1 @scalar_i8_signbit_eq_with_nonzero(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_signbit_eq_with_nonzero:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    mov w8, #128 // =0x80
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    and w8, w8, w0
 ; CHECK-NEXT:    cmp w8, #1
diff --git a/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-shl-in-eqcmp-zero.ll b/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-shl-in-eqcmp-zero.ll
index 32a62453202f40..4a73b10811d293 100644
--- a/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-shl-in-eqcmp-zero.ll
+++ b/llvm/test/CodeGen/AArch64/hoist-and-by-const-from-shl-in-eqcmp-zero.ll
@@ -16,7 +16,6 @@ define i1 @scalar_i8_signbit_eq(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_signbit_eq:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    tst w8, #0x80
 ; CHECK-NEXT:    cset w0, eq
@@ -31,7 +30,6 @@ define i1 @scalar_i8_lowestbit_eq(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_lowestbit_eq:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    tst w8, #0x1
 ; CHECK-NEXT:    cset w0, eq
@@ -46,7 +44,6 @@ define i1 @scalar_i8_bitsinmiddle_eq(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_bitsinmiddle_eq:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    tst w8, #0x18
 ; CHECK-NEXT:    cset w0, eq
@@ -63,7 +60,6 @@ define i1 @scalar_i16_signbit_eq(i16 %x, i16 %y) nounwind {
 ; CHECK-LABEL: scalar_i16_signbit_eq:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    tst w8, #0x8000
 ; CHECK-NEXT:    cset w0, eq
@@ -78,7 +74,6 @@ define i1 @scalar_i16_lowestbit_eq(i16 %x, i16 %y) nounwind {
 ; CHECK-LABEL: scalar_i16_lowestbit_eq:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    tst w8, #0x1
 ; CHECK-NEXT:    cset w0, eq
@@ -93,7 +88,6 @@ define i1 @scalar_i16_bitsinmiddle_eq(i16 %x, i16 %y) nounwind {
 ; CHECK-LABEL: scalar_i16_bitsinmiddle_eq:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xffff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    tst w8, #0xff0
 ; CHECK-NEXT:    cset w0, eq
@@ -272,7 +266,6 @@ define i1 @scalar_i8_signbit_ne(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_signbit_ne:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, #0xff
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-NEXT:    lsr w8, w8, w1
 ; CHECK-NEXT:    lsr w0, w8, #7
 ; CHECK-NEXT:    ret
@@ -289,7 +282,7 @@ define i1 @scalar_i8_signbit_ne(i8 %x, i8 %y) nounwind {
 define i1 @scalar_i32_x_is_const_eq(i32 %y) nounwind {
 ; CHECK-LABEL: scalar_i32_x_is_const_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov w8, #43605
+; CHECK-NEXT:    mov w8, #43605 // =0xaa55
 ; CHECK-NEXT:    movk w8, #43605, lsl #16
 ; CHECK-NEXT:    lsl w8, w8, w0
 ; CHECK-NEXT:    tst w8, #0x1
@@ -303,8 +296,8 @@ define i1 @scalar_i32_x_is_const_eq(i32 %y) nounwind {
 define i1 @scalar_i32_x_is_const2_eq(i32 %y) nounwind {
 ; CHECK-LABEL: scalar_i32_x_is_const2_eq:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov w8, #1
-; CHECK-NEXT:    mov w9, #43605
+; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    mov w9, #43605 // =0xaa55
 ; CHECK-NEXT:    lsl w8, w8, w0
 ; CHECK-NEXT:    movk w9, #43605, lsl #16
 ; CHECK-NEXT:    tst w8, w9
@@ -319,8 +312,7 @@ define i1 @scalar_i32_x_is_const2_eq(i32 %y) nounwind {
 define i1 @scalar_i8_bitsinmiddle_slt(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_bitsinmiddle_slt:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov w8, #24
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
+; CHECK-NEXT:    mov w8, #24 // =0x18
 ; CHECK-NEXT:    lsl w8, w8, w1
 ; CHECK-NEXT:    and w8, w8, w0
 ; CHECK-NEXT:    ubfx w0, w8, #7, #1
@@ -334,8 +326,7 @@ define i1 @scalar_i8_bitsinmiddle_slt(i8 %x, i8 %y) nounwind {
 define i1 @scalar_i8_signbit_eq_with_nonzero(i8 %x, i8 %y) nounwind {
 ; CHECK-LABEL: scalar_i8_signbit_eq_with_nonzero:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov w8, #-128
-; CHECK-NEXT:    // kill: def $w1 killed $w1 def $x1
+; CHECK-NEXT:    mov w8, #-128 // =0xffffff80
 ; CHECK-NEXT:    lsl w8, w8, w1
 ; CHECK-NEXT:    and w8, w8, w0
 ; CHECK-NEXT:    and w8, w8, #0x80
diff --git a/llvm/test/CodeGen/AArch64/logic-shift.ll b/llvm/test/CodeGen/AArch64/logic-shift.ll
index 31047954401cf5..0a62f8076a1be0 100644
--- a/llvm/test/CodeGen/AArch64/logic-shift.ll
+++ b/llvm/test/CodeGen/AArch64/logic-shift.ll
@@ -5,7 +5,6 @@ define i8 @or_lshr_commute0(i8 %x0, i8 %x1, i8 %y, i8 %z) {
 ; CHECK-LABEL: or_lshr_commute0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    orr w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    and w8, w8, #0xff
 ; CHECK-NEXT:    lsr w8, w8, w2
 ; CHECK-NEXT:    orr w0, w8, w3
@@ -65,7 +64,6 @@ define i16 @or_ashr_commute0(i16 %x0, i16 %x1, i16 %y, i16 %z) {
 ; CHECK-LABEL: or_ashr_commute0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    orr w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    sxth w8, w8
 ; CHECK-NEXT:    asr w8, w8, w2
 ; CHECK-NEXT:    orr w0, w8, w3
@@ -139,7 +137,6 @@ define i8 @or_shl_commute1(i8 %x0, i8 %x1, i8 %y, i8 %z) {
 ; CHECK-LABEL: or_shl_commute1:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    orr w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    lsl w8, w8, w2
 ; CHECK-NEXT:    orr w0, w8, w3
 ; CHECK-NEXT:    ret
@@ -233,7 +230,6 @@ define i8 @xor_lshr_commute0(i8 %x0, i8 %x1, i8 %y, i8 %z) {
 ; CHECK-LABEL: xor_lshr_commute0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    eor w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    and w8, w8, #0xff
 ; CHECK-NEXT:    lsr w8, w8, w2
 ; CHECK-NEXT:    eor w0, w8, w3
@@ -293,7 +289,6 @@ define i16 @xor_ashr_commute0(i16 %x0, i16 %x1, i16 %y, i16 %z) {
 ; CHECK-LABEL: xor_ashr_commute0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    eor w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    sxth w8, w8
 ; CHECK-NEXT:    asr w8, w8, w2
 ; CHECK-NEXT:    eor w0, w8, w3
@@ -367,7 +362,6 @@ define i8 @xor_shl_commute1(i8 %x0, i8 %x1, i8 %y, i8 %z) {
 ; CHECK-LABEL: xor_shl_commute1:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    eor w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    lsl w8, w8, w2
 ; CHECK-NEXT:    eor w0, w8, w3
 ; CHECK-NEXT:    ret
@@ -461,7 +455,6 @@ define i8 @and_lshr_commute0(i8 %x0, i8 %x1, i8 %y, i8 %z) {
 ; CHECK-LABEL: and_lshr_commute0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    and w8, w8, #0xff
 ; CHECK-NEXT:    lsr w8, w8, w2
 ; CHECK-NEXT:    and w0, w8, w3
@@ -521,7 +514,6 @@ define i16 @and_ashr_commute0(i16 %x0, i16 %x1, i16 %y, i16 %z) {
 ; CHECK-LABEL: and_ashr_commute0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    sxth w8, w8
 ; CHECK-NEXT:    asr w8, w8, w2
 ; CHECK-NEXT:    and w0, w8, w3
@@ -595,7 +587,6 @@ define i8 @and_shl_commute1(i8 %x0, i8 %x1, i8 %y, i8 %z) {
 ; CHECK-LABEL: and_shl_commute1:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    and w8, w0, w1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    lsl w8, w8, w2
 ; CHECK-NEXT:    and w0, w8, w3
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/neon-insextbitcast.ll b/llvm/test/CodeGen/AArch64/neon-insextbitcast.ll
index 28961376d2b6d6..ebff3f13700406 100644
--- a/llvm/test/CodeGen/AArch64/neon-insextbitcast.ll
+++ b/llvm/test/CodeGen/AArch64/neon-insextbitcast.ll
@@ -163,8 +163,6 @@ entry:
 define double @test_vext_v1i64(<1 x i64> %a) {
 ; CHECK-LABEL: test_vext_v1i64:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
 ; CHECK-NEXT:    ret
 entry:
   %b = extractelement <1 x i64> %a, i32 0
diff --git a/llvm/test/CodeGen/AArch64/shift-by-signext.ll b/llvm/test/CodeGen/AArch64/shift-by-signext.ll
index 67e2da96084efb..3bcaee523f2e97 100644
--- a/llvm/test/CodeGen/AArch64/shift-by-signext.ll
+++ b/llvm/test/CodeGen/AArch64/shift-by-signext.ll
@@ -81,7 +81,6 @@ define i32 @n6_fshl(i32 %x, i32 %y, i8 %shamt) nounwind {
 ; CHECK-LABEL: n6_fshl:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    lsr w8, w1, #1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    mvn w9, w2
 ; CHECK-NEXT:    lsl w10, w0, w2
 ; CHECK-NEXT:    lsr w8, w8, w9
@@ -95,7 +94,6 @@ define i32 @n7_fshr(i32 %x, i32 %y, i8 %shamt) nounwind {
 ; CHECK-LABEL: n7_fshr:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    lsl w8, w0, #1
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    mvn w9, w2
 ; CHECK-NEXT:    lsr w10, w1, w2
 ; CHECK-NEXT:    lsl w8, w8, w9
diff --git a/llvm/test/CodeGen/AArch64/shift.ll b/llvm/test/CodeGen/AArch64/shift.ll
index 066928687cc02d..5f24af82d83429 100644
--- a/llvm/test/CodeGen/AArch64/shift.ll
+++ b/llvm/test/CodeGen/AArch64/shift.ll
@@ -21,7 +21,6 @@ define i1 @shl_i1(i1 %0, i1 %1){
 define i8 @shl_i8(i8 %0, i8 %1){
 ; CHECK-SD-LABEL: shl_i8:
 ; CHECK-SD:       // %bb.0:
-; CHECK-SD-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-SD-NEXT:    lsl w0, w0, w1
 ; CHECK-SD-NEXT:    ret
 ;
@@ -37,7 +36,6 @@ define i8 @shl_i8(i8 %0, i8 %1){
 define i16 @shl_i16(i16 %0, i16 %1){
 ; CHECK-SD-LABEL: shl_i16:
 ; CHECK-SD:       // %bb.0:
-; CHECK-SD-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-SD-NEXT:    lsl w0, w0, w1
 ; CHECK-SD-NEXT:    ret
 ;
@@ -123,7 +121,6 @@ define i8 @ashr_i8(i8 %0, i8 %1){
 ; CHECK-SD-LABEL: ashr_i8:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    sxtb w8, w0
-; CHECK-SD-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-SD-NEXT:    asr w0, w8, w1
 ; CHECK-SD-NEXT:    ret
 ;
@@ -141,7 +138,6 @@ define i16 @ashr_i16(i16 %0, i16 %1){
 ; CHECK-SD-LABEL: ashr_i16:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    sxth w8, w0
-; CHECK-SD-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-SD-NEXT:    asr w0, w8, w1
 ; CHECK-SD-NEXT:    ret
 ;
@@ -230,7 +226,6 @@ define i8 @lshr_i8(i8 %0, i8 %1){
 ; CHECK-SD-LABEL: lshr_i8:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    and w8, w0, #0xff
-; CHECK-SD-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-SD-NEXT:    lsr w0, w8, w1
 ; CHECK-SD-NEXT:    ret
 ;
@@ -248,7 +243,6 @@ define i16 @lshr_i16(i16 %0, i16 %1){
 ; CHECK-SD-LABEL: lshr_i16:
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    and w8, w0, #0xffff
-; CHECK-SD-NEXT:    // kill: def $w1 killed $w1 def $x1
 ; CHECK-SD-NEXT:    lsr w0, w8, w1
 ; CHECK-SD-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AArch64/sink-and-fold.ll b/llvm/test/CodeGen/AArch64/sink-and-fold.ll
index f65a08ae2acea0..4d383fefc43c7b 100644
--- a/llvm/test/CodeGen/AArch64/sink-and-fold.ll
+++ b/llvm/test/CodeGen/AArch64/sink-and-fold.ll
@@ -298,7 +298,6 @@ exit:
 define i32 @f6(i1 %c, ptr %a, i32 %i) {
 ; CHECK-LABEL: f6:
 ; CHECK:       // %bb.0: // %entry
-; CHECK-NEXT:    // kill: def $w2 killed $w2 def $x2
 ; CHECK-NEXT:    tbz w0, #0, .LBB6_2
 ; CHECK-NEXT:  // %bb.1: // %if.then
 ; CHECK-NEXT:    mov w0, wzr
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-subvector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-subvector.ll
index aef19d23109b4f..3e6a7ce34a9ae5 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-subvector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-subvector.ll
@@ -48,10 +48,10 @@ define void @subvector_v32i16(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.h, vl16
 ; VBITS_GE_256-NEXT:    mov x8, #16 // =0x10
-; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT:    ld1h { z1.h }, p0/z, [x0]
-; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1, x8, lsl #1]
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1]
+; VBITS_GE_256-NEXT:    ld1h { z1.h }, p0/z, [x0, x8, lsl #1]
+; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1, x8, lsl #1]
+; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: subvector_v32i16:
@@ -103,10 +103,10 @@ define void @subvector_v16i32(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
 ; VBITS_GE_256-NEXT:    mov x8, #8 // =0x8
-; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT:    ld1w { z1.s }, p0/z, [x0]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    ld1w { z1.s }, p0/z, [x0, x8, lsl #2]
+; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: subvector_v16i32:
@@ -159,10 +159,10 @@ define void @subvector_v8i64(ptr %in, ptr %out) vscale_range(2,0) #0 {
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.d, vl4
 ; CHECK-NEXT:    mov x8, #4 // =0x4
-; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
-; CHECK-NEXT:    ld1d { z1.d }, p0/z, [x0]
-; CHECK-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
-; CHECK-NEXT:    st1d { z1.d }, p0, [x1]
+; CHECK-NEXT:    ld1d { z1.d }, p0/z, [x0, x8, lsl #3]
+; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    st1d { z1.d }, p0, [x1, x8, lsl #3]
+; CHECK-NEXT:    st1d { z0.d }, p0, [x1]
 ; CHECK-NEXT:    ret
   %a = load <8 x i64>, ptr %in
   br label %bb1
@@ -236,10 +236,10 @@ define void @subvector_v32f16(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.h, vl16
 ; VBITS_GE_256-NEXT:    mov x8, #16 // =0x10
-; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT:    ld1h { z1.h }, p0/z, [x0]
-; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1, x8, lsl #1]
-; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1]
+; VBITS_GE_256-NEXT:    ld1h { z1.h }, p0/z, [x0, x8, lsl #1]
+; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0]
+; VBITS_GE_256-NEXT:    st1h { z1.h }, p0, [x1, x8, lsl #1]
+; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: subvector_v32f16:
@@ -291,10 +291,10 @@ define void @subvector_v16f32(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
 ; VBITS_GE_256-NEXT:    mov x8, #8 // =0x8
-; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT:    ld1w { z1.s }, p0/z, [x0]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1, x8, lsl #2]
-; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1]
+; VBITS_GE_256-NEXT:    ld1w { z1.s }, p0/z, [x0, x8, lsl #2]
+; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0]
+; VBITS_GE_256-NEXT:    st1w { z1.s }, p0, [x1, x8, lsl #2]
+; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: subvector_v16f32:
@@ -345,10 +345,10 @@ define void @subvector_v8f64(ptr %in, ptr %out) #0 {
 ; VBITS_GE_256:       // %bb.0:
 ; VBITS_GE_256-NEXT:    ptrue p0.d, vl4
 ; VBITS_GE_256-NEXT:    mov x8, #4 // =0x4
-; VBITS_GE_256-NEXT:    ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
-; VBITS_GE_256-NEXT:    ld1d { z1.d }, p0/z, [x0]
-; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
-; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1]
+; VBITS_GE_256-NEXT:    ld1d { z1.d }, p0/z, [x0, x8, lsl #3]
+; VBITS_GE_256-NEXT:    ld1d { z0.d }, p0/z, [x0]
+; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1, x8, lsl #3]
+; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1]
 ; VBITS_GE_256-NEXT:    ret
 ;
 ; VBITS_GE_512-LABEL: subvector_v8f64:
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll
index b61c30af379944..94d756a36ab927 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll
@@ -1335,50 +1335,50 @@ define <8 x i16> @fcvtzu_v8f64_v8i16(ptr %a) {
 define void @fcvtzu_v16f64_v16i16(ptr %a, ptr %b) {
 ; CHECK-LABEL: fcvtzu_v16f64_v16i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q5, q6, [x0, #96]
+; CHECK-NEXT:    ldp q0, q1, [x0, #96]
 ; CHECK-NEXT:    ptrue p0.d, vl2
-; CHECK-NEXT:    ldp q0, q4, [x0, #32]
-; CHECK-NEXT:    ldp q2, q7, [x0, #64]
-; CHECK-NEXT:    ldp q1, q3, [x0]
-; CHECK-NEXT:    fcvtzs z6.d, p0/m, z6.d
-; CHECK-NEXT:    fcvtzs z4.d, p0/m, z4.d
-; CHECK-NEXT:    fcvtzs z5.d, p0/m, z5.d
+; CHECK-NEXT:    ldp q2, q3, [x0, #32]
+; CHECK-NEXT:    ldp q4, q5, [x0, #64]
+; CHECK-NEXT:    ldp q6, q7, [x0]
+; CHECK-NEXT:    fcvtzs z1.d, p0/m, z1.d
+; CHECK-NEXT:    fcvtzs z3.d, p0/m, z3.d
 ; CHECK-NEXT:    fcvtzs z0.d, p0/m, z0.d
-; CHECK-NEXT:    fcvtzs z7.d, p0/m, z7.d
 ; CHECK-NEXT:    fcvtzs z2.d, p0/m, z2.d
-; CHECK-NEXT:    fcvtzs z3.d, p0/m, z3.d
-; CHECK-NEXT:    fcvtzs z1.d, p0/m, z1.d
-; CHECK-NEXT:    uzp1 z6.s, z6.s, z6.s
-; CHECK-NEXT:    uzp1 z4.s, z4.s, z4.s
-; CHECK-NEXT:    uzp1 z5.s, z5.s, z5.s
+; CHECK-NEXT:    fcvtzs z5.d, p0/m, z5.d
+; CHECK-NEXT:    fcvtzs z4.d, p0/m, z4.d
+; CHECK-NEXT:    fcvtzs z7.d, p0/m, z7.d
+; CHECK-NEXT:    fcvtzs z6.d, p0/m, z6.d
+; CHECK-NEXT:    uzp1 z1.s, z1.s, z1.s
+; CHECK-NEXT:    uzp1 z3.s, z3.s, z3.s
 ; CHECK-NEXT:    uzp1 z0.s, z0.s, z0.s
-; CHECK-NEXT:    uzp1 z7.s, z7.s, z7.s
 ; CHECK-NEXT:    uzp1 z2.s, z2.s, z2.s
-; CHECK-NEXT:    uzp1 z3.s, z3.s, z3.s
-; CHECK-NEXT:    uzp1 z1.s, z1.s, z1.s
-; CHECK-NEXT:    mov z17.s, z6.s[1]
-; CHECK-NEXT:    mov z16.s, z4.s[1]
-; CHECK-NEXT:    mov z18.s, z5.s[1]
-; CHECK-NEXT:    mov z21.s, z0.s[1]
-; CHECK-NEXT:    mov z19.s, z7.s[1]
-; CHECK-NEXT:    mov z20.s, z2.s[1]
-; CHECK-NEXT:    mov z22.s, z3.s[1]
-; CHECK-NEXT:    mov z23.s, z1.s[1]
-; CHECK-NEXT:    zip1 z6.h, z6.h, z17.h
-; CHECK-NEXT:    zip1 z4.h, z4.h, z16.h
-; CHECK-NEXT:    zip1 z5.h, z5.h, z18.h
-; CHECK-NEXT:    zip1 z0.h, z0.h, z21.h
-; CHECK-NEXT:    zip1 z7.h, z7.h, z19.h
-; CHECK-NEXT:    zip1 z2.h, z2.h, z20.h
-; CHECK-NEXT:    zip1 z3.h, z3.h, z22.h
-; CHECK-NEXT:    zip1 z1.h, z1.h, z23.h
-; CHECK-NEXT:    zip1 z5.s, z5.s, z6.s
-; CHECK-NEXT:    zip1 z0.s, z0.s, z4.s
-; CHECK-NEXT:    zip1 z2.s, z2.s, z7.s
-; CHECK-NEXT:    zip1 z1.s, z1.s, z3.s
-; CHECK-NEXT:    zip1 z2.d, z2.d, z5.d
+; CHECK-NEXT:    uzp1 z5.s, z5.s, z5.s
+; CHECK-NEXT:    uzp1 z4.s, z4.s, z4.s
+; CHECK-NEXT:    uzp1 z7.s, z7.s, z7.s
+; CHECK-NEXT:    uzp1 z6.s, z6.s, z6.s
+; CHECK-NEXT:    mov z17.s, z1.s[1]
+; CHECK-NEXT:    mov z16.s, z3.s[1]
+; CHECK-NEXT:    mov z18.s, z0.s[1]
+; CHECK-NEXT:    mov z21.s, z2.s[1]
+; CHECK-NEXT:    mov z19.s, z5.s[1]
+; CHECK-NEXT:    mov z20.s, z4.s[1]
+; CHECK-NEXT:    mov z22.s, z7.s[1]
+; CHECK-NEXT:    mov z23.s, z6.s[1]
+; CHECK-NEXT:    zip1 z1.h, z1.h, z17.h
+; CHECK-NEXT:    zip1 z3.h, z3.h, z16.h
+; CHECK-NEXT:    zip1 z0.h, z0.h, z18.h
+; CHECK-NEXT:    zip1 z2.h, z2.h, z21.h
+; CHECK-NEXT:    zip1 z5.h, z5.h, z19.h
+; CHECK-NEXT:    zip1 z4.h, z4.h, z20.h
+; CHECK-NEXT:    zip1 z7.h, z7.h, z22.h
+; CHECK-NEXT:    zip1 z6.h, z6.h, z23.h
+; CHECK-NEXT:    zip1 z0.s, z0.s, z1.s
+; CHECK-NEXT:    zip1 z2.s, z2.s, z3.s
+; CHECK-NEXT:    zip1 z1.s, z4.s, z5.s
+; CHECK-NEXT:    zip1 z3.s, z6.s, z7.s
 ; CHECK-NEXT:    zip1 z0.d, z1.d, z0.d
-; CHECK-NEXT:    stp q0, q2, [x1]
+; CHECK-NEXT:    zip1 z1.d, z3.d, z2.d
+; CHECK-NEXT:    stp q1, q0, [x1]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: fcvtzu_v16f64_v16i16:
@@ -3033,50 +3033,50 @@ define <8 x i16> @fcvtzs_v8f64_v8i16(ptr %a) {
 define void @fcvtzs_v16f64_v16i16(ptr %a, ptr %b) {
 ; CHECK-LABEL: fcvtzs_v16f64_v16i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldp q5, q6, [x0, #96]
+; CHECK-NEXT:    ldp q0, q1, [x0, #96]
 ; CHECK-NEXT:    ptrue p0.d, vl2
-; CHECK-NEXT:    ldp q0, q4, [x0, #32]
-; CHECK-NEXT:    ldp q2, q7, [x0, #64]
-; CHECK-NEXT:    ldp q1, q3, [x0]
-; CHECK-NEXT:    fcvtzs z6.d, p0/m, z6.d
-; CHECK-NEXT:    fcvtzs z4.d, p0/m, z4.d
-; CHECK-NEXT:    fcvtzs z5.d, p0/m, z5.d
+; CHECK-NEXT:    ldp q2, q3, [x0, #32]
+; CHECK-NEXT:    ldp q4, q5, [x0, #64]
+; CHECK-NEXT:    ldp q6, q7, [x0]
+; CHECK-NEXT:    fcvtzs z1.d, p0/m, z1.d
+; CHECK-NEXT:    fcvtzs z3.d, p0/m, z3.d
 ; CHECK-NEXT:    fcvtzs z0.d, p0/m, z0.d
-; CHECK-NEXT:    fcvtzs z7.d, p0/m, z7.d
 ; CHECK-NEXT:    fcvtzs z2.d, p0/m, z2.d
-; CHECK-NEXT:    fcvtzs z3.d, p0/m, z3.d
-; CHECK-NEXT:    fcvtzs z1.d, p0/m, z1.d
-; CHECK-NEXT:    uzp1 z6.s, z6.s, z6.s
-; CHECK-NEXT:    uzp1 z4.s, z4.s, z4.s
-; CHECK-NEXT:    uzp1 z5.s, z5.s, z5.s
+; CHECK-NEXT:    fcvtzs z5.d, p0/m, z5.d
+; CHECK-NEXT:    fcvtzs z4.d, p0/m, z4.d
+; CHECK-NEXT:    fcvtzs z7.d, p0/m, z7.d
+; CHECK-NEXT:    fcvtzs z6.d, p0/m, z6.d
+; CHECK-NEXT:    uzp1 z1.s, z1.s, z1.s
+; CHECK-NEXT:    uzp1 z3.s, z3.s, z3.s
 ; CHECK-NEXT:    uzp1 z0.s, z0.s, z0.s
-; CHECK-NEXT:    uzp1 z7.s, z7.s, z7.s
 ; CHECK-NEXT:    uzp1 z2.s, z2.s, z2.s
-; CHECK-NEXT:    uzp1 z3.s, z3.s, z3.s
-; CHECK-NEXT:    uzp1 z1.s, z1.s, z1.s
-; CHECK-NEXT:    mov z17.s, z6.s[1]
-; CHECK-NEXT:    mov z16.s, z4.s[1]
-; CHECK-NEXT:    mov z18.s, z5.s[1]
-; CHECK-NEXT:    mov z21.s, z0.s[1]
-; CHECK-NEXT:    mov z19.s, z7.s[1]
-; CHECK-NEXT:    mov z20.s, z2.s[1]
-; CHECK-NEXT:    mov z22.s, z3.s[1]
-; CHECK-NEXT:    mov z23.s, z1.s[1]
-; CHECK-NEXT:    zip1 z6.h, z6.h, z17.h
-; CHECK-NEXT:    zip1 z4.h, z4.h, z16.h
-; CHECK-NEXT:    zip1 z5.h, z5.h, z18.h
-; CHECK-NEXT:    zip1 z0.h, z0.h, z21.h
-; CHECK-NEXT:    zip1 z7.h, z7.h, z19.h
-; CHECK-NEXT:    zip1 z2.h, z2.h, z20.h
-; CHECK-NEXT:    zip1 z3.h, z3.h, z22.h
-; CHECK-NEXT:    zip1 z1.h, z1.h, z23.h
-; CHECK-NEXT:    zip1 z5.s, z5.s, z6.s
-; CHECK-NEXT:    zip1 z0.s, z0.s, z4.s
-; CHECK-NEXT:    zip1 z2.s, z2.s, z7.s
-; CHECK-NEXT:    zip1 z1.s, z1.s, z3.s
-; CHECK-NEXT:    zip1 z2.d, z2.d, z5.d
+; CHECK-NEXT:    uzp1 z5.s, z5.s, z5.s
+; CHECK-NEXT:    uzp1 z4.s, z4.s, z4.s
+; CHECK-NEXT:    uzp1 z7.s, z7.s, z7.s
+; CHECK-NEXT:    uzp1 z6.s, z6.s, z6.s
+; CHECK-NEXT:    mov z17.s, z1.s[1]
+; CHECK-NEXT:    mov z16.s, z3.s[1]
+; CHECK-NEXT:    mov z18.s, z0.s[1]
+; CHECK-NEXT:    mov z21.s, z2.s[1]
+; CHECK-NEXT:    mov z19.s, z5.s[1]
+; CHECK-NEXT:    mov z20.s, z4.s[1]
+; CHECK-NEXT:    mov z22.s, z7.s[1]
+; CHECK-NEXT:    mov z23.s, z6.s[1]
+; CHECK-NEXT:    zip1 z1.h, z1.h, z17.h
+; CHECK-NEXT:    zip1 z3.h, z3.h, z16.h
+; CHECK-NEXT:    zip1 z0.h, z0.h, z18.h
+; CHECK-NEXT:    zip1 z2.h, z2.h, z21.h
+; CHECK-NEXT:    zip1 z5.h, z5.h, z19.h
+; CHECK-NEXT:    zip1 z4.h, z4.h, z20.h
+; CHECK-NEXT:    zip1 z7.h, z7.h, z22.h
+; CHECK-NEXT:    zip1 z6.h, z6.h, z23.h
+; CHECK-NEXT:    zip1 z0.s, z0.s, z1.s
+; CHECK-NEXT:    zip1 z2.s, z2.s, z3.s
+; CHECK-NEXT:    zip1 z1.s, z4.s, z5.s
+; CHECK-NEXT:    zip1 z3.s, z6.s, z7.s
 ; CHECK-NEXT:    zip1 z0.d, z1.d, z0.d
-; CHECK-NEXT:    stp q0, q2, [x1]
+; CHECK-NEXT:    zip1 z1.d, z3.d, z2.d
+; CHECK-NEXT:    stp q1, q0, [x1]
 ; CHECK-NEXT:    ret
 ;
 ; NONEON-NOSVE-LABEL: fcvtzs_v16f64_v16i16:
diff --git a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll
index 8b296d9fbc215d..168ad6a77badb4 100644
--- a/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll
+++ b/llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-permute-zip-uzp-trn.ll
@@ -151,20 +151,20 @@ define void @zip_v32i16(ptr %a, ptr %b) {
 ; CHECK-NEXT:    .cfi_offset b13, -48
 ; CHECK-NEXT:    .cfi_offset b14, -56
 ; CHECK-NEXT:    .cfi_offset b15, -64
-; CHECK-NEXT:    ldp q0, q1, [x0]
+; CHECK-NEXT:    ldp q1, q0, [x0]
 ; CHECK-NEXT:    ldp q2, q3, [x1]
-; CHECK-NEXT:    mov z5.h, z1.h[7]
-; CHECK-NEXT:    mov z7.h, z1.h[6]
-; CHECK-NEXT:    mov z17.h, z1.h[5]
+; CHECK-NEXT:    mov z5.h, z0.h[7]
+; CHECK-NEXT:    mov z7.h, z0.h[6]
+; CHECK-NEXT:    mov z17.h, z0.h[5]
 ; CHECK-NEXT:    mov z4.h, z3.h[7]
 ; CHECK-NEXT:    mov z6.h, z3.h[6]
 ; CHECK-NEXT:    mov z16.h, z3.h[5]
 ; CHECK-NEXT:    mov z20.h, z2.h[7]
-; CHECK-NEXT:    mov z21.h, z0.h[7]
+; CHECK-NEXT:    mov z21.h, z1.h[7]
 ; CHECK-NEXT:    mov z18.h, z3.h[4]
-; CHECK-NEXT:    mov z19.h, z1.h[4]
+; CHECK-NEXT:    mov z19.h, z0.h[4]
 ; CHECK-NEXT:    mov z22.h, z2.h[6]
-; CHECK-NEXT:    mov z23.h, z0.h[6]
+; CHECK-NEXT:    mov z23.h, z1.h[6]
 ; CHECK-NEXT:    zip1 z24.h, z5.h, z4.h
 ; CHECK-NEXT:    zip1 z25.h, z7.h, z6.h
 ; CHECK-NEXT:    zip1 z17.h, z17.h, z16.h
@@ -176,10 +176,10 @@ define void @zip_v32i16(ptr %a, ptr %b) {
 ; CHECK-NEXT:    zip1 z22.h, z23.h, z22.h
 ; CHECK-NEXT:    mov z23.h, z2.h[5]
 ; CHECK-NEXT:    mov z21.h, z6.h[7]
-; CHECK-NEXT:    mov z24.h, z0.h[5]
+; CHECK-NEXT:    mov z24.h, z1.h[5]
 ; CHECK-NEXT:    mov z25.h, z2.h[4]
 ; CHECK-NEXT:    mov z20.h, z7.h[7]
-; CHECK-NEXT:    mov z26.h, z0.h[4]
+; CHECK-NEXT:    mov z26.h, z1.h[4]
 ; CHECK-NEXT:    mov z27.h, z6.h[6]
 ; CHECK-NEXT:    mov z28.h, z7.h[5]
 ; CHECK-NEXT:    mov z29.h, z6.h[5]
@@ -212,22 +212,22 @@ define void @zip_v32i16(ptr %a, ptr %b) {
 ; CHECK-NEXT:    zip1 z20.s, z28.s, z27.s
 ; CHECK-NEXT:    zip1 z16.s, z22.s, z16.s
 ; CHECK-NEXT:    zip1 z21.s, z24.s, z23.s
-; CHECK-NEXT:    zip1 z1.h, z1.h, z3.h
+; CHECK-NEXT:    zip1 z0.h, z0.h, z3.h
 ; CHECK-NEXT:    zip1 z3.s, z26.s, z25.s
 ; CHECK-NEXT:    zip1 z22.s, z30.s, z29.s
 ; CHECK-NEXT:    zip1 z6.h, z6.h, z7.h
 ; CHECK-NEXT:    zip1 z7.d, z17.d, z19.d
 ; CHECK-NEXT:    zip1 z17.d, z20.d, z18.d
-; CHECK-NEXT:    zip1 z0.h, z0.h, z2.h
+; CHECK-NEXT:    zip1 z1.h, z1.h, z2.h
 ; CHECK-NEXT:    zip1 z2.h, z4.h, z5.h
 ; CHECK-NEXT:    zip1 z4.d, z21.d, z16.d
 ; CHECK-NEXT:    zip1 z3.d, z22.d, z3.d
-; CHECK-NEXT:    add z1.h, z1.h, z6.h
+; CHECK-NEXT:    add z0.h, z0.h, z6.h
 ; CHECK-NEXT:    add z5.h, z7.h, z17.h
-; CHECK-NEXT:    add z0.h, z0.h, z2.h
+; CHECK-NEXT:    add z1.h, z1.h, z2.h
 ; CHECK-NEXT:    add z2.h, z4.h, z3.h
-; CHECK-NEXT:    stp q1, q5, [x0, #32]
-; CHECK-NEXT:    stp q0, q2, [x0]
+; CHECK-NEXT:    stp q0, q5, [x0, #32]
+; CHECK-NEXT:    stp q1, q2, [x0]
 ; CHECK-NEXT:    ldp d15, d14, [sp], #64 // 16-byte Folded Reload
 ; CHECK-NEXT:    ret
 ;
diff --git a/llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll b/llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
index c9fe258f11556d..b0a30b71506374 100644
--- a/llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
+++ b/llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
@@ -378,36 +378,36 @@ define i4 @convert_to_bitmask_with_unknown_type_in_long_chain(<4 x i32> %vec1, <
 ; CHECK-GI-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-GI-NEXT:    mov w8, #1 ; =0x1
 ; CHECK-GI-NEXT:    mov w9, #0 ; =0x0
-; CHECK-GI-NEXT:    cmeq.4s v5, v0, #0
+; CHECK-GI-NEXT:    cmeq.4s v0, v0, #0
 ; CHECK-GI-NEXT:    fmov s2, w8
 ; CHECK-GI-NEXT:    fmov s4, w9
 ; CHECK-GI-NEXT:    cmeq.4s v1, v1, #0
 ; CHECK-GI-NEXT:    mov.16b v3, v2
-; CHECK-GI-NEXT:    mov.16b v0, v4
+; CHECK-GI-NEXT:    mov.16b v5, v4
 ; CHECK-GI-NEXT:    mov.h v4[1], w8
-; CHECK-GI-NEXT:    bic.16b v1, v1, v5
-; CHECK-GI-NEXT:    mov.16b v5, v2
+; CHECK-GI-NEXT:    bic.16b v0, v1, v0
+; CHECK-GI-NEXT:    mov.16b v1, v2
 ; CHECK-GI-NEXT:    mov.h v2[1], w8
 ; CHECK-GI-NEXT:    mov.h v3[1], w8
-; CHECK-GI-NEXT:    mov.h v0[1], w8
 ; CHECK-GI-NEXT:    mov.h v5[1], w8
+; CHECK-GI-NEXT:    mov.h v1[1], w8
 ; CHECK-GI-NEXT:    mov.h v4[2], w8
-; CHECK-GI-NEXT:    xtn.4h v1, v1
+; CHECK-GI-NEXT:    xtn.4h v0, v0
 ; CHECK-GI-NEXT:    mov.h v2[2], w8
 ; CHECK-GI-NEXT:    mov.h v3[2], w9
-; CHECK-GI-NEXT:    mov.h v0[2], w9
 ; CHECK-GI-NEXT:    mov.h v5[2], w9
+; CHECK-GI-NEXT:    mov.h v1[2], w9
 ; CHECK-GI-NEXT:    mov.h v4[3], w9
 ; CHECK-GI-NEXT:    mov.h v2[3], w9
 ; CHECK-GI-NEXT:    mov.h v3[3], w9
-; CHECK-GI-NEXT:    mov.h v0[3], w8
 ; CHECK-GI-NEXT:    mov.h v5[3], w8
-; CHECK-GI-NEXT:    orr.8b v1, v1, v3
-; CHECK-GI-NEXT:    eor.8b v0, v1, v0
-; CHECK-GI-NEXT:    eor.8b v1, v4, v1
-; CHECK-GI-NEXT:    and.8b v0, v0, v5
-; CHECK-GI-NEXT:    orr.8b v1, v2, v1
-; CHECK-GI-NEXT:    orr.8b v0, v0, v1
+; CHECK-GI-NEXT:    mov.h v1[3], w8
+; CHECK-GI-NEXT:    orr.8b v0, v0, v3
+; CHECK-GI-NEXT:    eor.8b v3, v0, v5
+; CHECK-GI-NEXT:    eor.8b v0, v4, v0
+; CHECK-GI-NEXT:    and.8b v1, v3, v1
+; CHECK-GI-NEXT:    orr.8b v0, v2, v0
+; CHECK-GI-NEXT:    orr.8b v0, v1, v0
 ; CHECK-GI-NEXT:    ushll.4s v0, v0, #0
 ; CHECK-GI-NEXT:    mov.s w8, v0[1]
 ; CHECK-GI-NEXT:    mov.s w9, v0[2]
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
index 990a6066adcd3d..1bdaa4c98127d0 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.init.whole.wave-w32.ll
@@ -428,18 +428,17 @@ define amdgpu_cs_chain void @control_flow(<3 x i32> inreg %sgpr, ptr inreg %call
 ; DAGISEL12-NEXT:    s_or_saveexec_b32 s8, -1
 ; DAGISEL12-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL12-NEXT:    v_cndmask_b32_e64 v0, 0x47, v1, s8
-; DAGISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_2)
+; DAGISEL12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; DAGISEL12-NEXT:    v_cmp_ne_u32_e64 s9, 0, v0
 ; DAGISEL12-NEXT:    s_mov_b32 exec_lo, s8
 ; DAGISEL12-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v13, v1
-; DAGISEL12-NEXT:    v_mov_b32_e32 v11, s9
 ; DAGISEL12-NEXT:    s_or_b32 s4, vcc_lo, s4
 ; DAGISEL12-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL12-NEXT:    s_and_not1_b32 exec_lo, exec_lo, s4
 ; DAGISEL12-NEXT:    s_cbranch_execnz .LBB3_2
 ; DAGISEL12-NEXT:  ; %bb.3: ; %tail.loopexit
 ; DAGISEL12-NEXT:    s_or_b32 exec_lo, exec_lo, s4
-; DAGISEL12-NEXT:    v_add_nc_u32_e32 v10, 42, v1
+; DAGISEL12-NEXT:    v_dual_mov_b32 v11, s9 :: v_dual_add_nc_u32 v10, 42, v1
 ; DAGISEL12-NEXT:  .LBB3_4: ; %Flow1
 ; DAGISEL12-NEXT:    s_wait_alu 0xfffe
 ; DAGISEL12-NEXT:    s_or_b32 exec_lo, exec_lo, s3
@@ -530,13 +529,13 @@ define amdgpu_cs_chain void @control_flow(<3 x i32> inreg %sgpr, ptr inreg %call
 ; DAGISEL10-NEXT:    v_cmp_ne_u32_e64 s9, 0, v0
 ; DAGISEL10-NEXT:    s_mov_b32 exec_lo, s8
 ; DAGISEL10-NEXT:    v_cmp_eq_u32_e32 vcc_lo, v13, v1
-; DAGISEL10-NEXT:    v_mov_b32_e32 v11, s9
 ; DAGISEL10-NEXT:    s_or_b32 s4, vcc_lo, s4
 ; DAGISEL10-NEXT:    s_andn2_b32 exec_lo, exec_lo, s4
 ; DAGISEL10-NEXT:    s_cbranch_execnz .LBB3_2
 ; DAGISEL10-NEXT:  ; %bb.3: ; %tail.loopexit
 ; DAGISEL10-NEXT:    s_or_b32 exec_lo, exec_lo, s4
 ; DAGISEL10-NEXT:    v_add_nc_u32_e32 v10, 42, v1
+; DAGISEL10-NEXT:    v_mov_b32_e32 v11, s9
 ; DAGISEL10-NEXT:  .LBB3_4: ; %Flow1
 ; DAGISEL10-NEXT:    s_or_b32 exec_lo, exec_lo, s3
 ; DAGISEL10-NEXT:    s_mov_b32 s3, exec_lo
diff --git a/llvm/test/CodeGen/AMDGPU/mul_int24.ll b/llvm/test/CodeGen/AMDGPU/mul_int24.ll
index 4302810089f0c9..5452c80a373e6f 100644
--- a/llvm/test/CodeGen/AMDGPU/mul_int24.ll
+++ b/llvm/test/CodeGen/AMDGPU/mul_int24.ll
@@ -740,7 +740,7 @@ define amdgpu_kernel void @simplify_i24_crash(ptr addrspace(1) %out, i32 %arg0,
 ; EG:       ; %bb.0: ; %bb
 ; EG-NEXT:    ALU_PUSH_BEFORE 1, @6, KC0[CB0:0-32], KC1[]
 ; EG-NEXT:    JUMP @5 POP:1
-; EG-NEXT:    ALU 14, @8, KC0[CB0:0-32], KC1[]
+; EG-NEXT:    ALU 12, @8, KC0[CB0:0-32], KC1[]
 ; EG-NEXT:    MEM_RAT_CACHELESS STORE_RAW T0.XY, T1.X, 0
 ; EG-NEXT:    POP @5 POP:1
 ; EG-NEXT:    CF_END
@@ -748,27 +748,25 @@ define amdgpu_kernel void @simplify_i24_crash(ptr addrspace(1) %out, i32 %arg0,
 ; EG-NEXT:     SETNE_INT * T0.W, KC0[2].Z, 0.0,
 ; EG-NEXT:     PRED_SETE_INT * ExecMask,PredicateBit (MASKED), PV.W, 0.0,
 ; EG-NEXT:    ALU clause starting at 8:
-; EG-NEXT:     MOV T0.X, KC0[3].Y,
-; EG-NEXT:     MOV * T1.X, KC0[2].W,
-; EG-NEXT:     LSHL T0.W, PS, literal.x,
-; EG-NEXT:     LSHL * T1.W, PV.X, literal.x,
+; EG-NEXT:     MOV T0.W, KC0[3].Y,
+; EG-NEXT:     MOV * T1.W, KC0[2].W,
+; EG-NEXT:     LSHL T1.W, PS, literal.x,
+; EG-NEXT:     LSHL * T0.W, PV.W, literal.x,
 ; EG-NEXT:    8(1.121039e-44), 0(0.000000e+00)
-; EG-NEXT:     ASHR T1.W, PS, literal.x,
-; EG-NEXT:     ASHR * T0.W, PV.W, literal.x,
+; EG-NEXT:     ASHR T0.W, PS, literal.x,
+; EG-NEXT:     ASHR * T1.W, PV.W, literal.x,
 ; EG-NEXT:    8(1.121039e-44), 0(0.000000e+00)
 ; EG-NEXT:     MOV T2.W, KC0[2].Y,
 ; EG-NEXT:     MULLO_INT * T0.X, PS, PV.W,
 ; EG-NEXT:     LSHR T1.X, PV.W, literal.x,
-; EG-NEXT:     MOV T0.Y, PS,
-; EG-NEXT:     MOV T0.W, KC0[3].X,
-; EG-NEXT:     MOV * T0.W, KC0[3].Z,
+; EG-NEXT:     MOV * T0.Y, PS,
 ; EG-NEXT:    2(2.802597e-45), 0(0.000000e+00)
 ;
 ; CM-LABEL: simplify_i24_crash:
 ; CM:       ; %bb.0: ; %bb
 ; CM-NEXT:    ALU_PUSH_BEFORE 1, @6, KC0[CB0:0-32], KC1[]
 ; CM-NEXT:    JUMP @5 POP:1
-; CM-NEXT:    ALU 17, @8, KC0[CB0:0-32], KC1[]
+; CM-NEXT:    ALU 15, @8, KC0[CB0:0-32], KC1[]
 ; CM-NEXT:    MEM_RAT_CACHELESS STORE_DWORD T0, T1.X
 ; CM-NEXT:    POP @5 POP:1
 ; CM-NEXT:    CF_END
@@ -776,23 +774,21 @@ define amdgpu_kernel void @simplify_i24_crash(ptr addrspace(1) %out, i32 %arg0,
 ; CM-NEXT:     SETNE_INT * T0.W, KC0[2].Z, 0.0,
 ; CM-NEXT:     PRED_SETE_INT * ExecMask,PredicateBit (MASKED), PV.W, 0.0,
 ; CM-NEXT:    ALU clause starting at 8:
-; CM-NEXT:     MOV * T0.X, KC0[3].Y,
-; CM-NEXT:     MOV * T1.X, KC0[2].W,
-; CM-NEXT:     LSHL T0.Z, PV.X, literal.x,
-; CM-NEXT:     LSHL * T0.W, T0.X, literal.x,
+; CM-NEXT:     MOV T0.Z, KC0[3].Y,
+; CM-NEXT:     MOV * T0.W, KC0[2].W,
+; CM-NEXT:     LSHL T1.Z, PV.W, literal.x,
+; CM-NEXT:     LSHL * T0.W, PV.Z, literal.x,
 ; CM-NEXT:    8(1.121039e-44), 0(0.000000e+00)
 ; CM-NEXT:     MOV T0.Y, KC0[2].Y,
-; CM-NEXT:     ASHR T1.Z, PV.W, literal.x,
+; CM-NEXT:     ASHR T0.Z, PV.W, literal.x,
 ; CM-NEXT:     ASHR * T0.W, PV.Z, literal.x,
 ; CM-NEXT:    8(1.121039e-44), 0(0.000000e+00)
-; CM-NEXT:     MULLO_INT T0.X, T0.W, T1.Z,
-; CM-NEXT:     MULLO_INT T0.Y (MASKED), T0.W, T1.Z,
-; CM-NEXT:     MULLO_INT T0.Z (MASKED), T0.W, T1.Z,
-; CM-NEXT:     MULLO_INT * T0.W (MASKED), T0.W, T1.Z,
+; CM-NEXT:     MULLO_INT T0.X, T0.W, T0.Z,
+; CM-NEXT:     MULLO_INT T0.Y (MASKED), T0.W, T0.Z,
+; CM-NEXT:     MULLO_INT T0.Z (MASKED), T0.W, T0.Z,
+; CM-NEXT:     MULLO_INT * T0.W (MASKED), T0.W, T0.Z,
 ; CM-NEXT:     LSHR T1.X, T0.Y, literal.x,
-; CM-NEXT:     MOV T0.Y, PV.X,
-; CM-NEXT:     MOV T0.Z, KC0[3].X,
-; CM-NEXT:     MOV * T0.W, KC0[3].Z,
+; CM-NEXT:     MOV * T0.Y, PV.X,
 ; CM-NEXT:    2(2.802597e-45), 0(0.000000e+00)
 bb:
   %cmp = icmp eq i32 %arg0, 0
diff --git a/llvm/test/CodeGen/AMDGPU/peephole-opt-fold-reg-sequence-subreg.mir b/llvm/test/CodeGen/AMDGPU/peephole-opt-fold-reg-sequence-subreg.mir
new file mode 100644
index 00000000000000..d32163bb692357
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/peephole-opt-fold-reg-sequence-subreg.mir
@@ -0,0 +1,189 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx940 -run-pass=peephole-opt -o - %s | FileCheck %s
+
+---
+name:            reg_sequence_extract_subreg_sub0_from_regsequence_sub0_sub1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0_vgpr1
+
+    ; CHECK-LABEL: name: reg_sequence_extract_subreg_sub0_from_regsequence_sub0_sub1
+    ; CHECK: liveins: $vgpr0_vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64_align2 = COPY $vgpr0_vgpr1
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[COPY]].sub0, %subreg.sub0, [[V_MOV_B32_e32_]], %subreg.sub1
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64_align2 = COPY $vgpr0_vgpr1
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_64_align2 = REG_SEQUENCE %0.sub0, %subreg.sub0, %1, %subreg.sub1
+    %3:vgpr_32 = COPY %2.sub0
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_extract_subreg_sub1_from_regsequence_sub0_sub1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0_vgpr1
+
+    ; CHECK-LABEL: name: reg_sequence_extract_subreg_sub1_from_regsequence_sub0_sub1
+    ; CHECK: liveins: $vgpr0_vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64_align2 = COPY $vgpr0_vgpr1
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[COPY]].sub0, %subreg.sub0, [[V_MOV_B32_e32_]], %subreg.sub1
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[V_MOV_B32_e32_]]
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64_align2 = COPY $vgpr0_vgpr1
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_64_align2 = REG_SEQUENCE %0.sub0, %subreg.sub0, %1, %subreg.sub1
+    %3:vgpr_32 = COPY %2.sub1
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_extract_subreg_sub0_from_regsequence_sub1_sub0
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0_vgpr1
+
+    ; CHECK-LABEL: name: reg_sequence_extract_subreg_sub0_from_regsequence_sub1_sub0
+    ; CHECK: liveins: $vgpr0_vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64_align2 = COPY $vgpr0_vgpr1
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[V_MOV_B32_e32_]], %subreg.sub1, [[COPY]].sub0, %subreg.sub0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64_align2 = COPY $vgpr0_vgpr1
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_64_align2 = REG_SEQUENCE %1, %subreg.sub1, %0.sub0, %subreg.sub0
+    %3:vgpr_32 = COPY %2.sub0
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_extract_subreg_sub1_from_regsequence_sub1_sub0
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0_vgpr1
+
+    ; CHECK-LABEL: name: reg_sequence_extract_subreg_sub1_from_regsequence_sub1_sub0
+    ; CHECK: liveins: $vgpr0_vgpr1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64_align2 = COPY $vgpr0_vgpr1
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[V_MOV_B32_e32_]], %subreg.sub1, [[COPY]].sub0, %subreg.sub0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[V_MOV_B32_e32_]]
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64_align2 = COPY $vgpr0_vgpr1
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_64_align2 = REG_SEQUENCE %1, %subreg.sub1, %0.sub0, %subreg.sub0
+    %3:vgpr_32 = COPY %2.sub1
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_extract_subreg_sub0_from_vreg96
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr1_vgpr2
+
+    ; CHECK-LABEL: name: reg_sequence_extract_subreg_sub0_from_vreg96
+    ; CHECK: liveins: $vgpr1_vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64 = COPY $vgpr1_vgpr2
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96 = REG_SEQUENCE [[COPY]].sub0, %subreg.sub0, [[V_MOV_B32_e32_]], %subreg.sub1, [[COPY]].sub1, %subreg.sub2
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64 = COPY $vgpr1_vgpr2
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_96 = REG_SEQUENCE %0.sub0, %subreg.sub0, %1, %subreg.sub1, %0.sub1, %subreg.sub2
+    %3:vgpr_32 = COPY %2.sub0
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_extract_subreg_sub1_from_vreg96
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr1_vgpr2
+
+    ; CHECK-LABEL: name: reg_sequence_extract_subreg_sub1_from_vreg96
+    ; CHECK: liveins: $vgpr1_vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64 = COPY $vgpr1_vgpr2
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96 = REG_SEQUENCE [[COPY]].sub0, %subreg.sub0, [[V_MOV_B32_e32_]], %subreg.sub1, [[COPY]].sub1, %subreg.sub2
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64 = COPY $vgpr1_vgpr2
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_96 = REG_SEQUENCE %0.sub0, %subreg.sub0, %1, %subreg.sub1, %0.sub1, %subreg.sub2
+    %3:vgpr_32 = COPY %2.sub0
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_compose_0
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr1_vgpr2
+
+    ; CHECK-LABEL: name: reg_sequence_compose_0
+    ; CHECK: liveins: $vgpr1_vgpr2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_64 = COPY $vgpr1_vgpr2
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96 = REG_SEQUENCE [[COPY]], %subreg.sub0_sub1, [[V_MOV_B32_e32_]], %subreg.sub2
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub1
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_64 = COPY $vgpr1_vgpr2
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_96 = REG_SEQUENCE %0, %subreg.sub0_sub1, %1, %subreg.sub2
+    %3:vgpr_32 = COPY %2.sub1
+    S_ENDPGM 0, implicit %3
+
+...
+
+---
+name:            reg_sequence_compose_1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr1_vgpr2_vgpr3_vgpr4
+
+    ; CHECK-LABEL: name: reg_sequence_compose_1
+    ; CHECK: liveins: $vgpr1_vgpr2_vgpr3_vgpr4
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:vreg_128 = COPY $vgpr1_vgpr2_vgpr3_vgpr4
+    ; CHECK-NEXT: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    ; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96 = REG_SEQUENCE [[COPY]].sub2_sub3, %subreg.sub0_sub1, [[COPY]].sub0, %subreg.sub2
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub1
+    ; CHECK-NEXT: S_ENDPGM 0, implicit [[COPY1]]
+    %0:vreg_128 = COPY $vgpr1_vgpr2_vgpr3_vgpr4
+    %1:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %2:vreg_96 = REG_SEQUENCE %0.sub2_sub3, %subreg.sub0_sub1, %0.sub0, %subreg.sub2
+    %3:vgpr_32 = COPY %2.sub1
+    S_ENDPGM 0, implicit %3
+
+...
+
+
+
diff --git a/llvm/test/CodeGen/ARM/aes-erratum-fix.ll b/llvm/test/CodeGen/ARM/aes-erratum-fix.ll
index 43c403fe6d64df..82f5bfd02a56e1 100644
--- a/llvm/test/CodeGen/ARM/aes-erratum-fix.ll
+++ b/llvm/test/CodeGen/ARM/aes-erratum-fix.ll
@@ -1447,12 +1447,13 @@ define arm_aapcs_vfpcc void @aese_setf16_cond_via_ptr(i1 zeroext %0, ptr %1, <16
 ; CHECK-CORTEX-FIX-NEXT:    .pad #24
 ; CHECK-CORTEX-FIX-NEXT:    sub sp, sp, #24
 ; CHECK-CORTEX-FIX-NEXT:    cmp r0, #0
-; CHECK-CORTEX-FIX-NEXT:    beq .LBB36_2
+; CHECK-CORTEX-FIX-NEXT:    beq .LBB36_3
 ; CHECK-CORTEX-FIX-NEXT:  @ %bb.1:
 ; CHECK-CORTEX-FIX-NEXT:    vld1.64 {d16, d17}, [r2]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[1]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r6, d17[0]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[2]
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r4, d17[3]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #20] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
@@ -1463,37 +1464,37 @@ define arm_aapcs_vfpcc void @aese_setf16_cond_via_ptr(i1 zeroext %0, ptr %1, <16
 ; CHECK-CORTEX-FIX-NEXT:    ldrh r3, [r1]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #16] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    mov r3, r6
-; CHECK-CORTEX-FIX-NEXT:    b .LBB36_3
+; CHECK-CORTEX-FIX-NEXT:    cmp r0, #0
+; CHECK-CORTEX-FIX-NEXT:    bne .LBB36_4
 ; CHECK-CORTEX-FIX-NEXT:  .LBB36_2:
-; CHECK-CORTEX-FIX-NEXT:    add r3, r2, #8
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r0, d0[0]
+; CHECK-CORTEX-FIX-NEXT:    b .LBB36_5
+; CHECK-CORTEX-FIX-NEXT:  .LBB36_3:
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d16[0]}, [r2:32]
+; CHECK-CORTEX-FIX-NEXT:    add r3, r2, #8
 ; CHECK-CORTEX-FIX-NEXT:    add r7, r2, #4
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d17[0]}, [r3:32]
 ; CHECK-CORTEX-FIX-NEXT:    add r3, r2, #12
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d16[1]}, [r7:32]
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d17[1]}, [r3:32]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[0]
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[1]
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r4, d17[3]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #16] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[1]
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[1]
-; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #20] @ 4-byte Spill
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    str r7, [sp, #12] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[2]
+; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #20] @ 4-byte Spill
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[3]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #4] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[0]
-; CHECK-CORTEX-FIX-NEXT:  .LBB36_3:
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r4, d17[3]
 ; CHECK-CORTEX-FIX-NEXT:    cmp r0, #0
-; CHECK-CORTEX-FIX-NEXT:    beq .LBB36_5
-; CHECK-CORTEX-FIX-NEXT:  @ %bb.4:
+; CHECK-CORTEX-FIX-NEXT:    beq .LBB36_2
+; CHECK-CORTEX-FIX-NEXT:  .LBB36_4:
 ; CHECK-CORTEX-FIX-NEXT:    ldrh r0, [r1]
-; CHECK-CORTEX-FIX-NEXT:    b .LBB36_6
 ; CHECK-CORTEX-FIX-NEXT:  .LBB36_5:
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r0, d0[0]
-; CHECK-CORTEX-FIX-NEXT:  .LBB36_6:
 ; CHECK-CORTEX-FIX-NEXT:    str r0, [sp] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    ldr r0, [sp, #8] @ 4-byte Reload
 ; CHECK-CORTEX-FIX-NEXT:    ldr r1, [sp, #4] @ 4-byte Reload
@@ -1695,8 +1696,8 @@ define arm_aapcs_vfpcc void @aese_setf16_cond_via_val(i1 zeroext %0, half %1, <1
 ; CHECK-CORTEX-FIX-NEXT:    vmov s0, lr
 ; CHECK-CORTEX-FIX-NEXT:    b .LBB37_5
 ; CHECK-CORTEX-FIX-NEXT:  .LBB37_3:
-; CHECK-CORTEX-FIX-NEXT:    add r2, r1, #8
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d16[0]}, [r1:32]
+; CHECK-CORTEX-FIX-NEXT:    add r2, r1, #8
 ; CHECK-CORTEX-FIX-NEXT:    add r3, r1, #4
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d17[0]}, [r2:32]
 ; CHECK-CORTEX-FIX-NEXT:    add r2, r1, #12
@@ -1706,11 +1707,11 @@ define arm_aapcs_vfpcc void @aese_setf16_cond_via_val(i1 zeroext %0, half %1, <1
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r2, d16[0]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r10, d16[3]
-; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
-; CHECK-CORTEX-FIX-NEXT:    vmov s2, r2
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[0]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r11, d17[2]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r6, d17[3]
+; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[0]
+; CHECK-CORTEX-FIX-NEXT:    vmov s2, r2
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #4] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[1]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp] @ 4-byte Spill
@@ -3600,12 +3601,13 @@ define arm_aapcs_vfpcc void @aesd_setf16_cond_via_ptr(i1 zeroext %0, ptr %1, <16
 ; CHECK-CORTEX-FIX-NEXT:    .pad #24
 ; CHECK-CORTEX-FIX-NEXT:    sub sp, sp, #24
 ; CHECK-CORTEX-FIX-NEXT:    cmp r0, #0
-; CHECK-CORTEX-FIX-NEXT:    beq .LBB82_2
+; CHECK-CORTEX-FIX-NEXT:    beq .LBB82_3
 ; CHECK-CORTEX-FIX-NEXT:  @ %bb.1:
 ; CHECK-CORTEX-FIX-NEXT:    vld1.64 {d16, d17}, [r2]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[1]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r6, d17[0]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[2]
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r4, d17[3]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #20] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
@@ -3616,37 +3618,37 @@ define arm_aapcs_vfpcc void @aesd_setf16_cond_via_ptr(i1 zeroext %0, ptr %1, <16
 ; CHECK-CORTEX-FIX-NEXT:    ldrh r3, [r1]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #16] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    mov r3, r6
-; CHECK-CORTEX-FIX-NEXT:    b .LBB82_3
+; CHECK-CORTEX-FIX-NEXT:    cmp r0, #0
+; CHECK-CORTEX-FIX-NEXT:    bne .LBB82_4
 ; CHECK-CORTEX-FIX-NEXT:  .LBB82_2:
-; CHECK-CORTEX-FIX-NEXT:    add r3, r2, #8
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r0, d0[0]
+; CHECK-CORTEX-FIX-NEXT:    b .LBB82_5
+; CHECK-CORTEX-FIX-NEXT:  .LBB82_3:
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d16[0]}, [r2:32]
+; CHECK-CORTEX-FIX-NEXT:    add r3, r2, #8
 ; CHECK-CORTEX-FIX-NEXT:    add r7, r2, #4
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d17[0]}, [r3:32]
 ; CHECK-CORTEX-FIX-NEXT:    add r3, r2, #12
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d16[1]}, [r7:32]
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d17[1]}, [r3:32]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[0]
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[1]
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r4, d17[3]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #16] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[1]
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[1]
-; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #20] @ 4-byte Spill
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    str r7, [sp, #12] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d17[2]
+; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #20] @ 4-byte Spill
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d16[3]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #4] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[0]
-; CHECK-CORTEX-FIX-NEXT:  .LBB82_3:
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r4, d17[3]
 ; CHECK-CORTEX-FIX-NEXT:    cmp r0, #0
-; CHECK-CORTEX-FIX-NEXT:    beq .LBB82_5
-; CHECK-CORTEX-FIX-NEXT:  @ %bb.4:
+; CHECK-CORTEX-FIX-NEXT:    beq .LBB82_2
+; CHECK-CORTEX-FIX-NEXT:  .LBB82_4:
 ; CHECK-CORTEX-FIX-NEXT:    ldrh r0, [r1]
-; CHECK-CORTEX-FIX-NEXT:    b .LBB82_6
 ; CHECK-CORTEX-FIX-NEXT:  .LBB82_5:
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r0, d0[0]
-; CHECK-CORTEX-FIX-NEXT:  .LBB82_6:
 ; CHECK-CORTEX-FIX-NEXT:    str r0, [sp] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    ldr r0, [sp, #8] @ 4-byte Reload
 ; CHECK-CORTEX-FIX-NEXT:    ldr r1, [sp, #4] @ 4-byte Reload
@@ -3848,8 +3850,8 @@ define arm_aapcs_vfpcc void @aesd_setf16_cond_via_val(i1 zeroext %0, half %1, <1
 ; CHECK-CORTEX-FIX-NEXT:    vmov s0, lr
 ; CHECK-CORTEX-FIX-NEXT:    b .LBB83_5
 ; CHECK-CORTEX-FIX-NEXT:  .LBB83_3:
-; CHECK-CORTEX-FIX-NEXT:    add r2, r1, #8
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d16[0]}, [r1:32]
+; CHECK-CORTEX-FIX-NEXT:    add r2, r1, #8
 ; CHECK-CORTEX-FIX-NEXT:    add r3, r1, #4
 ; CHECK-CORTEX-FIX-NEXT:    vld1.32 {d17[0]}, [r2:32]
 ; CHECK-CORTEX-FIX-NEXT:    add r2, r1, #12
@@ -3859,11 +3861,11 @@ define arm_aapcs_vfpcc void @aesd_setf16_cond_via_val(i1 zeroext %0, half %1, <1
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r2, d16[0]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r7, d16[2]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r10, d16[3]
-; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
-; CHECK-CORTEX-FIX-NEXT:    vmov s2, r2
-; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[0]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r11, d17[2]
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r6, d17[3]
+; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #8] @ 4-byte Spill
+; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[0]
+; CHECK-CORTEX-FIX-NEXT:    vmov s2, r2
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp, #4] @ 4-byte Spill
 ; CHECK-CORTEX-FIX-NEXT:    vmov.u16 r3, d17[1]
 ; CHECK-CORTEX-FIX-NEXT:    str r3, [sp] @ 4-byte Spill
diff --git a/llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll b/llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll
index 522726c4e8890a..f325fef03f936c 100644
--- a/llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll
+++ b/llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll
@@ -64,7 +64,6 @@ entry:
 define <4 x float> @test_vbfdotq_lane_f32(<4 x float> %r, <8 x bfloat> %a, <4 x bfloat> %b) {
 ; CHECK-LABEL: test_vbfdotq_lane_f32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vdot.bf16 q0, q1, d4[0]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -108,7 +107,6 @@ entry:
 define <4 x float> @test_vbfmlalbq_lane_f32(<4 x float> %r, <8 x bfloat> %a, <4 x bfloat> %b) {
 ; CHECK-LABEL: test_vbfmlalbq_lane_f32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vfmab.bf16 q0, q1, d4[0]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -131,7 +129,6 @@ entry:
 define <4 x float> @test_vbfmlaltq_lane_f32(<4 x float> %r, <8 x bfloat> %a, <4 x bfloat> %b) {
 ; CHECK-LABEL: test_vbfmlaltq_lane_f32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vfmat.bf16 q0, q1, d4[0]
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/ARM/armv8.2a-fp16-vector-intrinsics.ll b/llvm/test/CodeGen/ARM/armv8.2a-fp16-vector-intrinsics.ll
index 9570c70676dbb4..7b130f00078096 100644
--- a/llvm/test/CodeGen/ARM/armv8.2a-fp16-vector-intrinsics.ll
+++ b/llvm/test/CodeGen/ARM/armv8.2a-fp16-vector-intrinsics.ll
@@ -1167,7 +1167,6 @@ entry:
 define dso_local <8 x half> @test_vmulq_lane_f16(<8 x half> %a, <4 x half> %b) {
 ; CHECK-LABEL: test_vmulq_lane_f16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d2 killed $d2 def $q1
 ; CHECK-NEXT:    vmul.f16 q0, q0, d2[3]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -1389,7 +1388,6 @@ entry:
 define dso_local <8 x half> @test_vdupq_lane_f16(<4 x half> %a) {
 ; CHECK-LABEL: test_vdupq_lane_f16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.16 q0, d0[3]
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/ARM/bf16-create-get-set-dup.ll b/llvm/test/CodeGen/ARM/bf16-create-get-set-dup.ll
index b16993d9875bae..39416ccdd7765f 100644
--- a/llvm/test/CodeGen/ARM/bf16-create-get-set-dup.ll
+++ b/llvm/test/CodeGen/ARM/bf16-create-get-set-dup.ll
@@ -53,7 +53,6 @@ entry:
 define arm_aapcs_vfpcc <8 x bfloat> @test_vdupq_lane_bf16(<4 x bfloat> %v) {
 ; CHECK-LABEL: test_vdupq_lane_bf16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.16 q0, d0[1]
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/ARM/bf16-shuffle.ll b/llvm/test/CodeGen/ARM/bf16-shuffle.ll
index 9968e7887f4b37..a45ad8f698b232 100644
--- a/llvm/test/CodeGen/ARM/bf16-shuffle.ll
+++ b/llvm/test/CodeGen/ARM/bf16-shuffle.ll
@@ -229,7 +229,6 @@ entry:
 define dso_local <8 x bfloat> @test_vdupq_lane_bf16(<4 x bfloat> %a) {
 ; CHECK-LABEL: test_vdupq_lane_bf16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.16 q0, d0[3]
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/ARM/minnum-maxnum-intrinsics.ll b/llvm/test/CodeGen/ARM/minnum-maxnum-intrinsics.ll
index dd33b09fe83004..ec7516524ee671 100644
--- a/llvm/test/CodeGen/ARM/minnum-maxnum-intrinsics.ll
+++ b/llvm/test/CodeGen/ARM/minnum-maxnum-intrinsics.ll
@@ -610,20 +610,20 @@ define <4 x float> @fminnumv432_one_zero_intrinsic(<4 x float> %x) {
 ; ARMV7-LABEL: fminnumv432_one_zero_intrinsic:
 ; ARMV7:       @ %bb.0:
 ; ARMV7-NEXT:    vmov d1, r2, r3
-; ARMV7-NEXT:    vldr s8, .LCPI18_0
+; ARMV7-NEXT:    vldr s4, .LCPI18_0
 ; ARMV7-NEXT:    vmov d0, r0, r1
-; ARMV7-NEXT:    vmov.f32 s10, #-1.000000e+00
+; ARMV7-NEXT:    vmov.f32 s6, #-1.000000e+00
 ; ARMV7-NEXT:    vcmp.f32 s1, #0
 ; ARMV7-NEXT:    vmrs APSR_nzcv, fpscr
-; ARMV7-NEXT:    vmov.f32 s4, s3
-; ARMV7-NEXT:    vmin.f32 d6, d2, d5
-; ARMV7-NEXT:    vmin.f32 d3, d1, d5
-; ARMV7-NEXT:    vmin.f32 d2, d0, d5
-; ARMV7-NEXT:    vmovlt.f32 s8, s1
-; ARMV7-NEXT:    vmov.f32 s5, s8
-; ARMV7-NEXT:    vmov.f32 s7, s12
-; ARMV7-NEXT:    vmov r0, r1, d2
-; ARMV7-NEXT:    vmov r2, r3, d3
+; ARMV7-NEXT:    vmov.f32 s8, s3
+; ARMV7-NEXT:    vmin.f32 d7, d1, d3
+; ARMV7-NEXT:    vmin.f32 d6, d0, d3
+; ARMV7-NEXT:    vmin.f32 d4, d4, d3
+; ARMV7-NEXT:    vmovlt.f32 s4, s1
+; ARMV7-NEXT:    vmov.f32 s13, s4
+; ARMV7-NEXT:    vmov.f32 s15, s8
+; ARMV7-NEXT:    vmov r0, r1, d6
+; ARMV7-NEXT:    vmov r2, r3, d7
 ; ARMV7-NEXT:    bx lr
 ; ARMV7-NEXT:    .p2align 2
 ; ARMV7-NEXT:  @ %bb.1:
diff --git a/llvm/test/CodeGen/ARM/neon-copy.ll b/llvm/test/CodeGen/ARM/neon-copy.ll
index e356b7e2181da0..9fdadab39060fb 100644
--- a/llvm/test/CodeGen/ARM/neon-copy.ll
+++ b/llvm/test/CodeGen/ARM/neon-copy.ll
@@ -775,7 +775,6 @@ define <2 x i32> @test_vdup_lane_s32(<2 x i32> %v1) #0 {
 define <16 x i8> @test_vdupq_lane_s8(<8 x i8> %v1) #0 {
 ; CHECK-LABEL: test_vdupq_lane_s8:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.8 q0, d0[5]
 ; CHECK-NEXT:    bx lr
   %shuffle = shufflevector <8 x i8> %v1, <8 x i8> undef, <16 x i32> <i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5, i32 5>
@@ -785,7 +784,6 @@ define <16 x i8> @test_vdupq_lane_s8(<8 x i8> %v1) #0 {
 define <8 x i16> @test_vdupq_lane_s16(<4 x i16> %v1) #0 {
 ; CHECK-LABEL: test_vdupq_lane_s16:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.16 q0, d0[2]
 ; CHECK-NEXT:    bx lr
   %shuffle = shufflevector <4 x i16> %v1, <4 x i16> undef, <8 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
@@ -795,7 +793,6 @@ define <8 x i16> @test_vdupq_lane_s16(<4 x i16> %v1) #0 {
 define <4 x i32> @test_vdupq_lane_s32(<2 x i32> %v1) #0 {
 ; CHECK-LABEL: test_vdupq_lane_s32:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.32 q0, d0[1]
 ; CHECK-NEXT:    bx lr
   %shuffle = shufflevector <2 x i32> %v1, <2 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
diff --git a/llvm/test/CodeGen/ARM/neon-v8.1a.ll b/llvm/test/CodeGen/ARM/neon-v8.1a.ll
index 2b37d042e3b99d..6f4ceba1d667b7 100644
--- a/llvm/test/CodeGen/ARM/neon-v8.1a.ll
+++ b/llvm/test/CodeGen/ARM/neon-v8.1a.ll
@@ -138,7 +138,6 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @test_vqrdmulahq_lane_s16(<8 x i16> %acc, <8 x i16> %x, <4 x i16> %v) {
 ; CHECK-LABEL: test_vqrdmulahq_lane_s16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmulh.s16 q8, q1, d4[2]
 ; CHECK-NEXT:    vqadd.s16 q0, q0, q8
 ; CHECK-NEXT:    bx lr
@@ -165,7 +164,6 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @test_vqrdmulahq_lane_s32(<4 x i32> %acc,<4 x i32> %x, <2 x i32> %v) {
 ; CHECK-LABEL: test_vqrdmulahq_lane_s32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmulh.s32 q8, q1, d4[0]
 ; CHECK-NEXT:    vqadd.s32 q0, q0, q8
 ; CHECK-NEXT:    bx lr
@@ -192,7 +190,6 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @test_vqrdmulshq_lane_s16(<8 x i16> %acc, <8 x i16> %x, <4 x i16> %v) {
 ; CHECK-LABEL: test_vqrdmulshq_lane_s16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmulh.s16 q8, q1, d4[2]
 ; CHECK-NEXT:    vqsub.s16 q0, q0, q8
 ; CHECK-NEXT:    bx lr
@@ -219,7 +216,6 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @test_vqrdmulshq_lane_s32(<4 x i32> %acc,<4 x i32> %x, <2 x i32> %v) {
 ; CHECK-LABEL: test_vqrdmulshq_lane_s32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmulh.s32 q8, q1, d4[0]
 ; CHECK-NEXT:    vqsub.s32 q0, q0, q8
 ; CHECK-NEXT:    bx lr
@@ -297,7 +293,6 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @test_vqrdmlahq_lane_s16(<8 x i16> %a, <8 x i16> %b, <4 x i16> %c) {
 ; CHECK-LABEL: test_vqrdmlahq_lane_s16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmlah.s16 q0, q1, d4[3]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -309,7 +304,6 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @test_vqrdmlahq_lane_s32(<4 x i32> %a, <4 x i32> %b, <2 x i32> %c) {
 ; CHECK-LABEL: test_vqrdmlahq_lane_s32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmlah.s32 q0, q1, d4[1]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -383,7 +377,6 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @test_vqrdmlshq_lane_s16(<8 x i16> %a, <8 x i16> %b, <4 x i16> %c) {
 ; CHECK-LABEL: test_vqrdmlshq_lane_s16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmlsh.s16 q0, q1, d4[3]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -395,7 +388,6 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @test_vqrdmlshq_lane_s32(<4 x i32> %a, <4 x i32> %b, <2 x i32> %c) {
 ; CHECK-LABEL: test_vqrdmlshq_lane_s32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d4 killed $d4 def $q2
 ; CHECK-NEXT:    vqrdmlsh.s32 q0, q1, d4[1]
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/ARM/vdup.ll b/llvm/test/CodeGen/ARM/vdup.ll
index 3f6d38c929b98a..7d0680184623ff 100644
--- a/llvm/test/CodeGen/ARM/vdup.ll
+++ b/llvm/test/CodeGen/ARM/vdup.ll
@@ -258,7 +258,6 @@ define arm_aapcs_vfpcc <2 x float> @vduplanefloat(<2 x float> %A) nounwind {
 define arm_aapcs_vfpcc <16 x i8> @vduplaneQ8(<8 x i8> %A) nounwind {
 ; CHECK-LABEL: vduplaneQ8:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.8 q0, d0[1]
 ; CHECK-NEXT:    mov pc, lr
 	%tmp2 = shufflevector <8 x i8> %A, <8 x i8> undef, <16 x i32> < i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1 >
@@ -268,7 +267,6 @@ define arm_aapcs_vfpcc <16 x i8> @vduplaneQ8(<8 x i8> %A) nounwind {
 define arm_aapcs_vfpcc <8 x i16> @vduplaneQ16(<4 x i16> %A) nounwind {
 ; CHECK-LABEL: vduplaneQ16:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.16 q0, d0[1]
 ; CHECK-NEXT:    mov pc, lr
 	%tmp2 = shufflevector <4 x i16> %A, <4 x i16> undef, <8 x i32> < i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1 >
@@ -278,7 +276,6 @@ define arm_aapcs_vfpcc <8 x i16> @vduplaneQ16(<4 x i16> %A) nounwind {
 define arm_aapcs_vfpcc <4 x i32> @vduplaneQ32(<2 x i32> %A) nounwind {
 ; CHECK-LABEL: vduplaneQ32:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.32 q0, d0[1]
 ; CHECK-NEXT:    mov pc, lr
 	%tmp2 = shufflevector <2 x i32> %A, <2 x i32> undef, <4 x i32> < i32 1, i32 1, i32 1, i32 1 >
@@ -288,7 +285,6 @@ define arm_aapcs_vfpcc <4 x i32> @vduplaneQ32(<2 x i32> %A) nounwind {
 define arm_aapcs_vfpcc <4 x float> @vduplaneQfloat(<2 x float> %A) nounwind {
 ; CHECK-LABEL: vduplaneQfloat:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    @ kill: def $d0 killed $d0 def $q0
 ; CHECK-NEXT:    vdup.32 q0, d0[1]
 ; CHECK-NEXT:    mov pc, lr
 	%tmp2 = shufflevector <2 x float> %A, <2 x float> undef, <4 x i32> < i32 1, i32 1, i32 1, i32 1 >
diff --git a/llvm/test/CodeGen/ARM/vext.ll b/llvm/test/CodeGen/ARM/vext.ll
index 7ddf1d02834c38..46f778d3c2c918 100644
--- a/llvm/test/CodeGen/ARM/vext.ll
+++ b/llvm/test/CodeGen/ARM/vext.ll
@@ -257,22 +257,22 @@ define <8 x i16> @test_illegal(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: test_illegal:
 ; CHECK:       @ %bb.0:
 ; CHECK-NEXT:    vld1.64 {d16, d17}, [r0]
-; CHECK-NEXT:    vorr d22, d16, d16
+; CHECK-NEXT:    vorr d21, d16, d16
 ; CHECK-NEXT:    vmov.u16 r0, d16[0]
-; CHECK-NEXT:    vorr d23, d16, d16
+; CHECK-NEXT:    vorr d22, d16, d16
 ; CHECK-NEXT:    vmov.u16 r2, d17[3]
 ; CHECK-NEXT:    vmov.u16 r3, d17[1]
 ; CHECK-NEXT:    vld1.64 {d18, d19}, [r1]
 ; CHECK-NEXT:    vmov.u16 r1, d19[1]
-; CHECK-NEXT:    vuzp.16 d22, d23
-; CHECK-NEXT:    vuzp.16 d22, d18
+; CHECK-NEXT:    vuzp.16 d21, d22
+; CHECK-NEXT:    vuzp.16 d21, d18
+; CHECK-NEXT:    vext.16 d16, d16, d18, #3
 ; CHECK-NEXT:    vmov.16 d20[0], r0
 ; CHECK-NEXT:    vmov.16 d20[1], r2
 ; CHECK-NEXT:    vmov.16 d20[2], r3
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    vmov.16 d20[3], r1
-; CHECK-NEXT:    vext.16 d21, d16, d18, #3
 ; CHECK-NEXT:    vmov r0, r1, d20
-; CHECK-NEXT:    vmov r2, r3, d21
 ; CHECK-NEXT:    mov pc, lr
        %tmp1 = load <8 x i16>, ptr %A
        %tmp2 = load <8 x i16>, ptr %B
diff --git a/llvm/test/CodeGen/ARM/vmul.ll b/llvm/test/CodeGen/ARM/vmul.ll
index 9915e050de2a66..6f6386a81f9ab0 100644
--- a/llvm/test/CodeGen/ARM/vmul.ll
+++ b/llvm/test/CodeGen/ARM/vmul.ll
@@ -130,7 +130,6 @@ entry:
 define arm_aapcs_vfpcc <4 x float> @test_vmulQ_lanef32(<4 x float> %arg0_float32x4_t, <2 x float> %arg1_float32x2_t) nounwind readnone {
 ; CHECK-LABEL: test_vmulQ_lanef32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d2 killed $d2 def $q1
 ; CHECK-NEXT:    vmul.f32 q0, q0, d2[1]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -142,7 +141,6 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @test_vmulQ_lanes16(<8 x i16> %arg0_int16x8_t, <4 x i16> %arg1_int16x4_t) nounwind readnone {
 ; CHECK-LABEL: test_vmulQ_lanes16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d2 killed $d2 def $q1
 ; CHECK-NEXT:    vmul.i16 q0, q0, d2[1]
 ; CHECK-NEXT:    bx lr
 entry:
@@ -154,7 +152,6 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @test_vmulQ_lanes32(<4 x i32> %arg0_int32x4_t, <2 x i32> %arg1_int32x2_t) nounwind readnone {
 ; CHECK-LABEL: test_vmulQ_lanes32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    @ kill: def $d2 killed $d2 def $q1
 ; CHECK-NEXT:    vmul.i32 q0, q0, d2[1]
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/ARM/vpadd.ll b/llvm/test/CodeGen/ARM/vpadd.ll
index cc1d1e6a15fe70..0597d44bc7cbf2 100644
--- a/llvm/test/CodeGen/ARM/vpadd.ll
+++ b/llvm/test/CodeGen/ARM/vpadd.ll
@@ -508,14 +508,14 @@ define <2 x i16> @fromExtendingExtractVectorElt_2i16(<8 x i16> %in) {
 ; CHECK-LABEL: fromExtendingExtractVectorElt_2i16:
 ; CHECK:       @ %bb.0:
 ; CHECK-NEXT:    vmov d16, r0, r1
-; CHECK-NEXT:    vmov.u16 r0, d16[0]
 ; CHECK-NEXT:    vmov.u16 r1, d16[1]
-; CHECK-NEXT:    vmov.u16 r3, d16[3]
+; CHECK-NEXT:    vmov.u16 r0, d16[0]
 ; CHECK-NEXT:    vmov.u16 r2, d16[2]
-; CHECK-NEXT:    vmov.32 d16[0], r0
+; CHECK-NEXT:    vmov.u16 r3, d16[3]
 ; CHECK-NEXT:    vmov.32 d17[0], r1
-; CHECK-NEXT:    vmov.32 d16[1], r2
+; CHECK-NEXT:    vmov.32 d16[0], r0
 ; CHECK-NEXT:    vmov.32 d17[1], r3
+; CHECK-NEXT:    vmov.32 d16[1], r2
 ; CHECK-NEXT:    vadd.i32 d16, d17, d16
 ; CHECK-NEXT:    vmov r0, r1, d16
 ; CHECK-NEXT:    mov pc, lr
diff --git a/llvm/test/CodeGen/ARM/vuzp.ll b/llvm/test/CodeGen/ARM/vuzp.ll
index d54446a431ee98..7e1dfba34db2e5 100644
--- a/llvm/test/CodeGen/ARM/vuzp.ll
+++ b/llvm/test/CodeGen/ARM/vuzp.ll
@@ -21,11 +21,11 @@ define <8 x i8> @vuzpi8(ptr %A, ptr %B) nounwind {
 define <16 x i8> @vuzpi8_Qres(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: vuzpi8_Qres:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d16, [r0]
-; CHECK-NEXT:    vuzp.8 d16, d17
-; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vuzp.8 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 	%tmp1 = load <8 x i8>, ptr %A
 	%tmp2 = load <8 x i8>, ptr %B
@@ -53,11 +53,11 @@ define <4 x i16> @vuzpi16(ptr %A, ptr %B) nounwind {
 define <8 x i16> @vuzpi16_Qres(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: vuzpi16_Qres:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d16, [r0]
-; CHECK-NEXT:    vuzp.16 d16, d17
-; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vuzp.16 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 	%tmp1 = load <4 x i16>, ptr %A
 	%tmp2 = load <4 x i16>, ptr %B
@@ -221,11 +221,11 @@ define <8 x i8> @vuzpi8_undef(ptr %A, ptr %B) nounwind {
 define <16 x i8> @vuzpi8_undef_Qres(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: vuzpi8_undef_Qres:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d16, [r0]
-; CHECK-NEXT:    vuzp.8 d16, d17
-; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vuzp.8 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 	%tmp1 = load <8 x i8>, ptr %A
 	%tmp2 = load <8 x i8>, ptr %B
@@ -269,11 +269,11 @@ define <16 x i16> @vuzpQi16_undef_QQres(ptr %A, ptr %B) nounwind {
 define <8 x i16> @vuzp_lower_shufflemask_undef(ptr %A, ptr %B) {
 ; CHECK-LABEL: vuzp_lower_shufflemask_undef:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d18, [r0]
-; CHECK-NEXT:    vuzp.16 d18, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vuzp.16 d17, d16
 ; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 entry:
 	%tmp1 = load <4 x i16>, ptr %A
@@ -285,13 +285,13 @@ entry:
 define <4 x i32> @vuzp_lower_shufflemask_zeroed(ptr %A, ptr %B) {
 ; CHECK-LABEL: vuzp_lower_shufflemask_zeroed:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vldr d18, [r0]
-; CHECK-NEXT:    vorr d19, d18, d18
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vtrn.32 d19, d17
-; CHECK-NEXT:    vdup.32 d16, d18[0]
-; CHECK-NEXT:    vmov r2, r3, d17
-; CHECK-NEXT:    vmov r0, r1, d16
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vorr d18, d17, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vdup.32 d17, d17[0]
+; CHECK-NEXT:    vtrn.32 d18, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 entry:
   %tmp1 = load <2 x i32>, ptr %A
diff --git a/llvm/test/CodeGen/ARM/vzip.ll b/llvm/test/CodeGen/ARM/vzip.ll
index 68f00a23da177d..dda774abd85163 100644
--- a/llvm/test/CodeGen/ARM/vzip.ll
+++ b/llvm/test/CodeGen/ARM/vzip.ll
@@ -21,11 +21,11 @@ define <8 x i8> @vzipi8(ptr %A, ptr %B) nounwind {
 define <16 x i8> @vzipi8_Qres(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: vzipi8_Qres:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d16, [r0]
-; CHECK-NEXT:    vzip.8 d16, d17
-; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vzip.8 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 	%tmp1 = load <8 x i8>, ptr %A
 	%tmp2 = load <8 x i8>, ptr %B
@@ -53,11 +53,11 @@ define <4 x i16> @vzipi16(ptr %A, ptr %B) nounwind {
 define <8 x i16> @vzipi16_Qres(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: vzipi16_Qres:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d16, [r0]
-; CHECK-NEXT:    vzip.16 d16, d17
-; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vzip.16 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 	%tmp1 = load <4 x i16>, ptr %A
 	%tmp2 = load <4 x i16>, ptr %B
@@ -221,11 +221,11 @@ define <8 x i8> @vzipi8_undef(ptr %A, ptr %B) nounwind {
 define <16 x i8> @vzipi8_undef_Qres(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: vzipi8_undef_Qres:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d16, [r0]
-; CHECK-NEXT:    vzip.8 d16, d17
-; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vzip.8 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 	%tmp1 = load <8 x i8>, ptr %A
 	%tmp2 = load <8 x i8>, ptr %B
@@ -269,11 +269,11 @@ define <32 x i8> @vzipQi8_undef_QQres(ptr %A, ptr %B) nounwind {
 define <8 x i16> @vzip_lower_shufflemask_undef(ptr %A, ptr %B) {
 ; CHECK-LABEL: vzip_lower_shufflemask_undef:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vldr d17, [r1]
-; CHECK-NEXT:    vldr d18, [r0]
-; CHECK-NEXT:    vzip.16 d18, d17
+; CHECK-NEXT:    vldr d16, [r1]
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vzip.16 d17, d16
 ; CHECK-NEXT:    vmov r0, r1, d16
-; CHECK-NEXT:    vmov r2, r3, d17
+; CHECK-NEXT:    vmov r2, r3, d16
 ; CHECK-NEXT:    mov pc, lr
 entry:
 	%tmp1 = load <4 x i16>, ptr %A
@@ -289,10 +289,10 @@ define <8 x i16> @vzip_lower_shufflemask_undef_rev(ptr %A, ptr %B) {
 ; CHECK-LABEL: vzip_lower_shufflemask_undef_rev:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vldr d16, [r1]
-; CHECK-NEXT:    vldr d19, [r0]
-; CHECK-NEXT:    vtrn.16 d19, d16
-; CHECK-NEXT:    vmov r0, r1, d18
-; CHECK-NEXT:    vmov r2, r3, d19
+; CHECK-NEXT:    vldr d17, [r0]
+; CHECK-NEXT:    vtrn.16 d17, d16
+; CHECK-NEXT:    vmov r0, r1, d16
+; CHECK-NEXT:    vmov r2, r3, d17
 ; CHECK-NEXT:    mov pc, lr
 entry:
   %tmp1 = load <4 x i16>, ptr %A
diff --git a/llvm/test/CodeGen/AVR/return.ll b/llvm/test/CodeGen/AVR/return.ll
index 8cb9f1427c77a5..207ad2f23a737a 100644
--- a/llvm/test/CodeGen/AVR/return.ll
+++ b/llvm/test/CodeGen/AVR/return.ll
@@ -187,26 +187,26 @@ define i64 @return64_imm() {
 ; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 8
 ; TINY-NEXT:    sbci r31, 0
-; TINY-NEXT:    ldi r24, 25
-; TINY-NEXT:    ldi r25, 22
+; TINY-NEXT:    ldi r20, 25
+; TINY-NEXT:    ldi r21, 22
 ; TINY-NEXT:    subi r30, 252
 ; TINY-NEXT:    sbci r31, 255
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 6
 ; TINY-NEXT:    sbci r31, 0
-; TINY-NEXT:    ldi r24, 104
-; TINY-NEXT:    ldi r25, 37
+; TINY-NEXT:    ldi r20, 104
+; TINY-NEXT:    ldi r21, 37
 ; TINY-NEXT:    subi r30, 254
 ; TINY-NEXT:    sbci r31, 255
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 4
 ; TINY-NEXT:    sbci r31, 0
-; TINY-NEXT:    ldi r24, 204
-; TINY-NEXT:    ldi r25, 204
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    ldi r20, 204
+; TINY-NEXT:    ldi r21, 204
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    ret
     ret i64 13757395258967641292
 }
@@ -243,8 +243,8 @@ define i64 @return64_arg(i64 %x) {
 ; TINY-NEXT:    in r16, 63
 ; TINY-NEXT:    subi r28, 247
 ; TINY-NEXT:    sbci r29, 255
-; TINY-NEXT:    ld r24, Y+
-; TINY-NEXT:    ld r25, Y+
+; TINY-NEXT:    ld r20, Y+
+; TINY-NEXT:    ld r21, Y+
 ; TINY-NEXT:    subi r28, 2
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    subi r28, 9
@@ -252,15 +252,15 @@ define i64 @return64_arg(i64 %x) {
 ; TINY-NEXT:    out 63, r16
 ; TINY-NEXT:    subi r30, 252
 ; TINY-NEXT:    sbci r31, 255
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 6
 ; TINY-NEXT:    sbci r31, 0
 ; TINY-NEXT:    in r16, 63
 ; TINY-NEXT:    subi r28, 249
 ; TINY-NEXT:    sbci r29, 255
-; TINY-NEXT:    ld r24, Y+
-; TINY-NEXT:    ld r25, Y+
+; TINY-NEXT:    ld r20, Y+
+; TINY-NEXT:    ld r21, Y+
 ; TINY-NEXT:    subi r28, 2
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    subi r28, 7
@@ -268,22 +268,22 @@ define i64 @return64_arg(i64 %x) {
 ; TINY-NEXT:    out 63, r16
 ; TINY-NEXT:    subi r30, 254
 ; TINY-NEXT:    sbci r31, 255
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 4
 ; TINY-NEXT:    sbci r31, 0
 ; TINY-NEXT:    in r16, 63
 ; TINY-NEXT:    subi r28, 251
 ; TINY-NEXT:    sbci r29, 255
-; TINY-NEXT:    ld r24, Y+
-; TINY-NEXT:    ld r25, Y+
+; TINY-NEXT:    ld r20, Y+
+; TINY-NEXT:    ld r21, Y+
 ; TINY-NEXT:    subi r28, 2
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    subi r28, 5
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    out 63, r16
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    pop r29
 ; TINY-NEXT:    pop r28
 ; TINY-NEXT:    ret
@@ -336,8 +336,8 @@ define i64 @return64_arg2(i64 %x, i64 %y, i64 %z) {
 ; TINY-NEXT:    in r16, 63
 ; TINY-NEXT:    subi r28, 231
 ; TINY-NEXT:    sbci r29, 255
-; TINY-NEXT:    ld r24, Y+
-; TINY-NEXT:    ld r25, Y+
+; TINY-NEXT:    ld r20, Y+
+; TINY-NEXT:    ld r21, Y+
 ; TINY-NEXT:    subi r28, 2
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    subi r28, 25
@@ -345,15 +345,15 @@ define i64 @return64_arg2(i64 %x, i64 %y, i64 %z) {
 ; TINY-NEXT:    out 63, r16
 ; TINY-NEXT:    subi r30, 252
 ; TINY-NEXT:    sbci r31, 255
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 6
 ; TINY-NEXT:    sbci r31, 0
 ; TINY-NEXT:    in r16, 63
 ; TINY-NEXT:    subi r28, 233
 ; TINY-NEXT:    sbci r29, 255
-; TINY-NEXT:    ld r24, Y+
-; TINY-NEXT:    ld r25, Y+
+; TINY-NEXT:    ld r20, Y+
+; TINY-NEXT:    ld r21, Y+
 ; TINY-NEXT:    subi r28, 2
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    subi r28, 23
@@ -361,22 +361,22 @@ define i64 @return64_arg2(i64 %x, i64 %y, i64 %z) {
 ; TINY-NEXT:    out 63, r16
 ; TINY-NEXT:    subi r30, 254
 ; TINY-NEXT:    sbci r31, 255
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    subi r30, 4
 ; TINY-NEXT:    sbci r31, 0
 ; TINY-NEXT:    in r16, 63
 ; TINY-NEXT:    subi r28, 235
 ; TINY-NEXT:    sbci r29, 255
-; TINY-NEXT:    ld r24, Y+
-; TINY-NEXT:    ld r25, Y+
+; TINY-NEXT:    ld r20, Y+
+; TINY-NEXT:    ld r21, Y+
 ; TINY-NEXT:    subi r28, 2
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    subi r28, 21
 ; TINY-NEXT:    sbci r29, 0
 ; TINY-NEXT:    out 63, r16
-; TINY-NEXT:    st Z+, r24
-; TINY-NEXT:    st Z+, r25
+; TINY-NEXT:    st Z+, r20
+; TINY-NEXT:    st Z+, r21
 ; TINY-NEXT:    pop r29
 ; TINY-NEXT:    pop r28
 ; TINY-NEXT:    ret
diff --git a/llvm/test/CodeGen/BPF/is_trunc_free.ll b/llvm/test/CodeGen/BPF/is_trunc_free.ll
index 21c8a2aebeda17..fe007319286250 100644
--- a/llvm/test/CodeGen/BPF/is_trunc_free.ll
+++ b/llvm/test/CodeGen/BPF/is_trunc_free.ll
@@ -58,7 +58,6 @@ cleanup:                                          ; preds = %entry, %if.end10
 }
 
 ; CHECK: w{{[0-9]+}} = *(u32 *)(r{{[0-9]+}} + 0)
-; CHECK-NOT: w{{[0-9]+}} = w{{[0-9]+}}
 
 declare dso_local i32 @work(ptr, i32) local_unnamed_addr #1
 
diff --git a/llvm/test/CodeGen/PowerPC/aggressive-anti-dep-breaker-subreg.ll b/llvm/test/CodeGen/PowerPC/aggressive-anti-dep-breaker-subreg.ll
index 1dac1b0f685737..0846f7250ed931 100644
--- a/llvm/test/CodeGen/PowerPC/aggressive-anti-dep-breaker-subreg.ll
+++ b/llvm/test/CodeGen/PowerPC/aggressive-anti-dep-breaker-subreg.ll
@@ -9,7 +9,7 @@ entry:
 
 lnext:
   %elementArray = load ptr, ptr %elementArrayPtr, align 8
-; CHECK: lwz [[LDREG:[0-9]+]], 140(1)                   # 4-byte Folded Reload
+; CHECK: ld [[LDREG:[0-9]+]], 120(1)                   # 8-byte Folded Reload
 ; CHECK: # implicit-def: $x[[TEMPREG:[0-9]+]]
   %element = load i32, ptr %elementArray, align 4
 ; CHECK: mr [[TEMPREG]], [[LDREG]]
diff --git a/llvm/test/CodeGen/PowerPC/mma-acc-copy-hints.ll b/llvm/test/CodeGen/PowerPC/mma-acc-copy-hints.ll
index 6b188d7c210e30..5decd9a639af85 100644
--- a/llvm/test/CodeGen/PowerPC/mma-acc-copy-hints.ll
+++ b/llvm/test/CodeGen/PowerPC/mma-acc-copy-hints.ll
@@ -26,14 +26,18 @@ define void @testMultiply(ptr nocapture noundef readonly %a, ptr nocapture nound
 ; CHECK-NEXT:    std r29, -24(r30) # 8-byte Folded Spill
 ; CHECK-NEXT:    mr r29, r5
 ; CHECK-NEXT:    bl _Z15buildVectorPairPu13__vector_pairDv16_hS0_ at notoc
-; CHECK-NEXT:    xxsetaccz acc0
-; CHECK-NEXT:    xvf32gerpp acc0, v31, v30
+; CHECK-NEXT:    xxsetaccz acc1
+; CHECK-NEXT:    xvf32gerpp acc1, v31, v30
 ; CHECK-NEXT:    lxv v3, 32(r1)
-; CHECK-NEXT:    lxv v2, 48(r1)
-; CHECK-NEXT:    xvf32gerpp acc0, v3, v2
+; CHECK-NEXT:    lxv vs0, 48(r1)
+; CHECK-NEXT:    xvf32gerpp acc1, v3, vs0
 ; CHECK-NEXT:    lxv v31, -48(r30) # 16-byte Folded Reload
 ; CHECK-NEXT:    lxv v30, -64(r30) # 16-byte Folded Reload
-; CHECK-NEXT:    xxmfacc acc0
+; CHECK-NEXT:    xxmfacc acc1
+; CHECK-NEXT:    xxlor vs0, vs4, vs4
+; CHECK-NEXT:    xxlor vs1, vs5, vs5
+; CHECK-NEXT:    xxlor vs2, vs6, vs6
+; CHECK-NEXT:    xxlor vs3, vs7, vs7
 ; CHECK-NEXT:    stxv vs3, 0(r29)
 ; CHECK-NEXT:    pstxv vs2, 8(r29), 0
 ; CHECK-NEXT:    stxv vs1, 16(r29)
@@ -68,8 +72,8 @@ define void @testMultiply(ptr nocapture noundef readonly %a, ptr nocapture nound
 ; CHECK-BE-NEXT:    xxsetaccz acc1
 ; CHECK-BE-NEXT:    xvf32gerpp acc1, v31, v30
 ; CHECK-BE-NEXT:    lxv v3, 144(r1)
-; CHECK-BE-NEXT:    lxv v2, 128(r1)
-; CHECK-BE-NEXT:    xvf32gerpp acc1, v2, v3
+; CHECK-BE-NEXT:    lxv vs0, 128(r1)
+; CHECK-BE-NEXT:    xvf32gerpp acc1, vs0, v3
 ; CHECK-BE-NEXT:    lxv v31, -48(r30) # 16-byte Folded Reload
 ; CHECK-BE-NEXT:    lxv v30, -64(r30) # 16-byte Folded Reload
 ; CHECK-BE-NEXT:    xxmfacc acc1
diff --git a/llvm/test/CodeGen/PowerPC/mma-acc-memops.ll b/llvm/test/CodeGen/PowerPC/mma-acc-memops.ll
index 7c197449201faf..31ddc619d97626 100644
--- a/llvm/test/CodeGen/PowerPC/mma-acc-memops.ll
+++ b/llvm/test/CodeGen/PowerPC/mma-acc-memops.ll
@@ -376,8 +376,8 @@ define dso_local void @testLdStPair(i64 %SrcIdx, i64 %DstIdx) {
 ; LE-PAIRED-LABEL: testLdStPair:
 ; LE-PAIRED:       # %bb.0: # %entry
 ; LE-PAIRED-NEXT:    plxv v3, g at PCREL+32(0), 1
-; LE-PAIRED-NEXT:    plxv v2, g at PCREL+48(0), 1
-; LE-PAIRED-NEXT:    pstxv v2, g at PCREL+80(0), 1
+; LE-PAIRED-NEXT:    plxv vs0, g at PCREL+48(0), 1
+; LE-PAIRED-NEXT:    pstxv vs0, g at PCREL+80(0), 1
 ; LE-PAIRED-NEXT:    pstxv v3, g at PCREL+64(0), 1
 ; LE-PAIRED-NEXT:    blr
 ;
@@ -386,9 +386,9 @@ define dso_local void @testLdStPair(i64 %SrcIdx, i64 %DstIdx) {
 ; BE-PAIRED-NEXT:    addis r3, r2, g at toc@ha
 ; BE-PAIRED-NEXT:    addi r3, r3, g at toc@l
 ; BE-PAIRED-NEXT:    lxv v3, 48(r3)
-; BE-PAIRED-NEXT:    lxv v2, 32(r3)
+; BE-PAIRED-NEXT:    lxv vs0, 32(r3)
 ; BE-PAIRED-NEXT:    stxv v3, 80(r3)
-; BE-PAIRED-NEXT:    stxv v2, 64(r3)
+; BE-PAIRED-NEXT:    stxv vs0, 64(r3)
 ; BE-PAIRED-NEXT:    blr
 ;
 ; LE-PWR9-LABEL: testLdStPair:
@@ -455,9 +455,9 @@ define dso_local void @testXLdStPair(i64 %SrcIdx, i64 %DstIdx) {
 ; LE-PAIRED-NEXT:    lxvx v3, r5, r3
 ; LE-PAIRED-NEXT:    sldi r3, r4, 5
 ; LE-PAIRED-NEXT:    add r4, r5, r3
-; LE-PAIRED-NEXT:    lxv v2, 16(r6)
+; LE-PAIRED-NEXT:    lxv vs0, 16(r6)
 ; LE-PAIRED-NEXT:    stxvx v3, r5, r3
-; LE-PAIRED-NEXT:    stxv v2, 16(r4)
+; LE-PAIRED-NEXT:    stxv vs0, 16(r4)
 ; LE-PAIRED-NEXT:    blr
 ;
 ; BE-PAIRED-LABEL: testXLdStPair:
@@ -466,11 +466,11 @@ define dso_local void @testXLdStPair(i64 %SrcIdx, i64 %DstIdx) {
 ; BE-PAIRED-NEXT:    sldi r3, r3, 5
 ; BE-PAIRED-NEXT:    addi r5, r5, g at toc@l
 ; BE-PAIRED-NEXT:    add r6, r5, r3
-; BE-PAIRED-NEXT:    lxvx v2, r5, r3
+; BE-PAIRED-NEXT:    lxvx vs0, r5, r3
 ; BE-PAIRED-NEXT:    sldi r3, r4, 5
 ; BE-PAIRED-NEXT:    add r4, r5, r3
 ; BE-PAIRED-NEXT:    lxv v3, 16(r6)
-; BE-PAIRED-NEXT:    stxvx v2, r5, r3
+; BE-PAIRED-NEXT:    stxvx vs0, r5, r3
 ; BE-PAIRED-NEXT:    stxv v3, 16(r4)
 ; BE-PAIRED-NEXT:    blr
 ;
@@ -543,8 +543,8 @@ define dso_local void @testUnalignedLdStPair() {
 ; LE-PAIRED-LABEL: testUnalignedLdStPair:
 ; LE-PAIRED:       # %bb.0: # %entry
 ; LE-PAIRED-NEXT:    plxv v3, g at PCREL+11(0), 1
-; LE-PAIRED-NEXT:    plxv v2, g at PCREL+27(0), 1
-; LE-PAIRED-NEXT:    pstxv v2, g at PCREL+35(0), 1
+; LE-PAIRED-NEXT:    plxv vs0, g at PCREL+27(0), 1
+; LE-PAIRED-NEXT:    pstxv vs0, g at PCREL+35(0), 1
 ; LE-PAIRED-NEXT:    pstxv v3, g at PCREL+19(0), 1
 ; LE-PAIRED-NEXT:    blr
 ;
@@ -553,9 +553,9 @@ define dso_local void @testUnalignedLdStPair() {
 ; BE-PAIRED-NEXT:    addis r3, r2, g at toc@ha
 ; BE-PAIRED-NEXT:    addi r3, r3, g at toc@l
 ; BE-PAIRED-NEXT:    plxv v3, 27(r3), 0
-; BE-PAIRED-NEXT:    plxv v2, 11(r3), 0
+; BE-PAIRED-NEXT:    plxv vs0, 11(r3), 0
 ; BE-PAIRED-NEXT:    pstxv v3, 35(r3), 0
-; BE-PAIRED-NEXT:    pstxv v2, 19(r3), 0
+; BE-PAIRED-NEXT:    pstxv vs0, 19(r3), 0
 ; BE-PAIRED-NEXT:    blr
 ;
 ; LE-PWR9-LABEL: testUnalignedLdStPair:
diff --git a/llvm/test/CodeGen/PowerPC/paired-vector-intrinsics.ll b/llvm/test/CodeGen/PowerPC/paired-vector-intrinsics.ll
index cfa9315896af35..dd62b0d54f84b5 100644
--- a/llvm/test/CodeGen/PowerPC/paired-vector-intrinsics.ll
+++ b/llvm/test/CodeGen/PowerPC/paired-vector-intrinsics.ll
@@ -20,14 +20,14 @@ declare <256 x i1> @llvm.ppc.vsx.assemble.pair(<16 x i8>, <16 x i8>)
 define void @ass_pair(ptr %ptr, <16 x i8> %vc) {
 ; CHECK-LABEL: ass_pair:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    stxv v2, 0(r3)
 ; CHECK-NEXT:    stxv v2, 16(r3)
+; CHECK-NEXT:    stxv v2, 0(r3)
 ; CHECK-NEXT:    blr
 ;
 ; CHECK-NOMMA-LABEL: ass_pair:
 ; CHECK-NOMMA:       # %bb.0: # %entry
-; CHECK-NOMMA-NEXT:    stxv v2, 0(r3)
 ; CHECK-NOMMA-NEXT:    stxv v2, 16(r3)
+; CHECK-NOMMA-NEXT:    stxv v2, 0(r3)
 ; CHECK-NOMMA-NEXT:    blr
 ;
 ; CHECK-BE-LABEL: ass_pair:
@@ -53,32 +53,32 @@ define void @disass_pair(ptr %ptr1, ptr %ptr2, ptr %ptr3) {
 ; CHECK-LABEL: disass_pair:
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    lxv v3, 0(r3)
-; CHECK-NEXT:    lxv v2, 16(r3)
+; CHECK-NEXT:    lxv vs0, 16(r3)
 ; CHECK-NEXT:    stxv v3, 0(r4)
-; CHECK-NEXT:    stxv v2, 0(r5)
+; CHECK-NEXT:    stxv vs0, 0(r5)
 ; CHECK-NEXT:    blr
 ;
 ; CHECK-NOMMA-LABEL: disass_pair:
 ; CHECK-NOMMA:       # %bb.0: # %entry
 ; CHECK-NOMMA-NEXT:    lxv v3, 0(r3)
-; CHECK-NOMMA-NEXT:    lxv v2, 16(r3)
+; CHECK-NOMMA-NEXT:    lxv vs0, 16(r3)
 ; CHECK-NOMMA-NEXT:    stxv v3, 0(r4)
-; CHECK-NOMMA-NEXT:    stxv v2, 0(r5)
+; CHECK-NOMMA-NEXT:    stxv vs0, 0(r5)
 ; CHECK-NOMMA-NEXT:    blr
 ;
 ; CHECK-BE-LABEL: disass_pair:
 ; CHECK-BE:       # %bb.0: # %entry
 ; CHECK-BE-NEXT:    lxv v3, 16(r3)
-; CHECK-BE-NEXT:    lxv v2, 0(r3)
-; CHECK-BE-NEXT:    stxv v2, 0(r4)
+; CHECK-BE-NEXT:    lxv vs0, 0(r3)
+; CHECK-BE-NEXT:    stxv vs0, 0(r4)
 ; CHECK-BE-NEXT:    stxv v3, 0(r5)
 ; CHECK-BE-NEXT:    blr
 ;
 ; CHECK-BE-NOMMA-LABEL: disass_pair:
 ; CHECK-BE-NOMMA:       # %bb.0: # %entry
 ; CHECK-BE-NOMMA-NEXT:    lxv v3, 16(r3)
-; CHECK-BE-NOMMA-NEXT:    lxv v2, 0(r3)
-; CHECK-BE-NOMMA-NEXT:    stxv v2, 0(r4)
+; CHECK-BE-NOMMA-NEXT:    lxv vs0, 0(r3)
+; CHECK-BE-NOMMA-NEXT:    stxv vs0, 0(r4)
 ; CHECK-BE-NOMMA-NEXT:    stxv v3, 0(r5)
 ; CHECK-BE-NOMMA-NEXT:    blr
 entry:
diff --git a/llvm/test/CodeGen/PowerPC/peephole-subreg-def.mir b/llvm/test/CodeGen/PowerPC/peephole-subreg-def.mir
index e89e2068d92781..2a234e83f8e9a1 100644
--- a/llvm/test/CodeGen/PowerPC/peephole-subreg-def.mir
+++ b/llvm/test/CodeGen/PowerPC/peephole-subreg-def.mir
@@ -18,16 +18,17 @@ body:             |
 
     ; CHECK-LABEL: name: test_peephole_subreg_def
     ; CHECK: liveins: $x3
-    ; CHECK: [[COPY:%[0-9]+]]:g8rc_and_g8rc_nox0 = COPY $x3
-    ; CHECK: [[ADDI8_:%[0-9]+]]:g8rc = ADDI8 [[COPY]], 1
-    ; CHECK: [[EXTSW:%[0-9]+]]:g8rc_and_g8rc_nox0 = EXTSW [[ADDI8_]]
-    ; CHECK: [[LI8_:%[0-9]+]]:g8rc = LI8 0
-    ; CHECK: STB8 [[LI8_]], 0, [[EXTSW]]
-    ; CHECK: [[COPY1:%[0-9]+]]:gprc_and_gprc_nor0 = COPY [[EXTSW]].sub_32
-    ; CHECK: [[COPY2:%[0-9]+]]:gprc_and_gprc_nor0 = COPY [[COPY1]]
-    ; CHECK: [[ADDI:%[0-9]+]]:gprc = ADDI killed [[COPY2]], 1
-    ; CHECK: [[EXTSW_32_64_:%[0-9]+]]:g8rc_and_g8rc_nox0 = EXTSW_32_64 killed [[ADDI]]
-    ; CHECK: STB8 [[LI8_]], 0, killed [[EXTSW_32_64_]]
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:g8rc_and_g8rc_nox0 = COPY $x3
+    ; CHECK-NEXT: [[ADDI8_:%[0-9]+]]:g8rc = ADDI8 [[COPY]], 1
+    ; CHECK-NEXT: [[EXTSW:%[0-9]+]]:g8rc_and_g8rc_nox0 = EXTSW [[ADDI8_]]
+    ; CHECK-NEXT: [[LI8_:%[0-9]+]]:g8rc = LI8 0
+    ; CHECK-NEXT: STB8 [[LI8_]], 0, [[EXTSW]]
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gprc_and_gprc_nor0 = COPY [[EXTSW]].sub_32
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gprc_and_gprc_nor0 = COPY [[EXTSW]].sub_32
+    ; CHECK-NEXT: [[ADDI:%[0-9]+]]:gprc = ADDI killed [[COPY2]], 1
+    ; CHECK-NEXT: [[EXTSW_32_64_:%[0-9]+]]:g8rc_and_g8rc_nox0 = EXTSW_32_64 killed [[ADDI]]
+    ; CHECK-NEXT: STB8 [[LI8_]], 0, killed [[EXTSW_32_64_]]
     %0:g8rc_and_g8rc_nox0 = COPY $x3
     %1:g8rc = ADDI8 %0, 1
     %2:g8rc_and_g8rc_nox0 = EXTSW %1
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll
index cd4b19f11d1602..5f275da1740cbe 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll
@@ -1735,7 +1735,8 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    sub sp, sp, a1
 ; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x30, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 48 * vlenb
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    li a2, 24
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 48
 ; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
@@ -1770,8 +1771,7 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    vnot.v v8, v8, v0.t
 ; RV32-NEXT:    vand.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    li a5, 24
-; RV32-NEXT:    mul a4, a4, a5
+; RV32-NEXT:    slli a4, a4, 5
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 48
 ; RV32-NEXT:    vs8r.v v8, (a4) # Unknown-size Folded Spill
@@ -1786,13 +1786,12 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    addi a3, sp, 32
 ; RV32-NEXT:    vlse64.v v8, (a3), zero
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    slli a3, a3, 5
+; RV32-NEXT:    slli a3, a3, 4
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vs8r.v v8, (a3) # Unknown-size Folded Spill
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
@@ -1806,24 +1805,21 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
 ; RV32-NEXT:    vand.vv v24, v24, v16, v0.t
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsub.vv v24, v16, v24, v0.t
 ; RV32-NEXT:    vand.vv v16, v24, v8, v0.t
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vs8r.v v16, (a3) # Unknown-size Folded Spill
 ; RV32-NEXT:    vsrl.vi v16, v24, 2, v0.t
 ; RV32-NEXT:    vand.vv v16, v16, v8, v0.t
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v8, (a3) # Unknown-size Folded Reload
@@ -1841,8 +1837,7 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    vs8r.v v16, (a3) # Unknown-size Folded Spill
 ; RV32-NEXT:    vlse64.v v8, (a4), zero
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vs8r.v v8, (a3) # Unknown-size Folded Spill
@@ -1856,8 +1851,7 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    vadd.vv v8, v8, v24, v0.t
 ; RV32-NEXT:    vand.vv v16, v8, v16, v0.t
 ; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    li a3, 24
-; RV32-NEXT:    mul a2, a2, a3
+; RV32-NEXT:    slli a2, a2, 5
 ; RV32-NEXT:    add a2, sp, a2
 ; RV32-NEXT:    addi a2, a2, 48
 ; RV32-NEXT:    vl8r.v v8, (a2) # Unknown-size Folded Reload
@@ -1875,7 +1869,8 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    and a0, a0, a3
 ; RV32-NEXT:    vmv1r.v v0, v7
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    slli a3, a3, 4
+; RV32-NEXT:    li a4, 24
+; RV32-NEXT:    mul a3, a3, a4
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
@@ -1891,31 +1886,44 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    addi a0, a0, 48
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vand.vv v16, v24, v16, v0.t
-; RV32-NEXT:    vsub.vv v24, v8, v16, v0.t
+; RV32-NEXT:    vsub.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 5
+; RV32-NEXT:    li a1, 40
+; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
-; RV32-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
-; RV32-NEXT:    vand.vv v16, v24, v8, v0.t
+; RV32-NEXT:    vs8r.v v8, (a0) # Unknown-size Folded Spill
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 48
+; RV32-NEXT:    vl8r.v v24, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    li a1, 40
 ; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
-; RV32-NEXT:    vs8r.v v16, (a0) # Unknown-size Folded Spill
-; RV32-NEXT:    vsrl.vi v8, v24, 2, v0.t
+; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; RV32-NEXT:    vand.vv v16, v16, v24, v0.t
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 5
+; RV32-NEXT:    li a1, 24
+; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
-; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
-; RV32-NEXT:    vand.vv v8, v8, v16, v0.t
+; RV32-NEXT:    vs8r.v v16, (a0) # Unknown-size Folded Spill
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    li a1, 40
 ; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
+; RV32-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; RV32-NEXT:    vsrl.vi v8, v8, 2, v0.t
+; RV32-NEXT:    vand.vv v8, v8, v24, v0.t
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    li a1, 24
+; RV32-NEXT:    mul a0, a0, a1
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 48
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vadd.vv v8, v16, v8, v0.t
 ; RV32-NEXT:    vsrl.vi v16, v8, 4, v0.t
@@ -1924,8 +1932,7 @@ define <32 x i64> @vp_cttz_v32i64(<32 x i64> %va, <32 x i1> %m, i32 zeroext %evl
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vand.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    li a1, 24
-; RV32-NEXT:    mul a0, a0, a1
+; RV32-NEXT:    slli a0, a0, 5
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
@@ -3921,7 +3928,8 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    sub sp, sp, a1
 ; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x30, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 48 * vlenb
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    li a2, 24
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 48
 ; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
@@ -3956,8 +3964,7 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    vnot.v v8, v8, v0.t
 ; RV32-NEXT:    vand.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    li a5, 24
-; RV32-NEXT:    mul a4, a4, a5
+; RV32-NEXT:    slli a4, a4, 5
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 48
 ; RV32-NEXT:    vs8r.v v8, (a4) # Unknown-size Folded Spill
@@ -3972,13 +3979,12 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    addi a3, sp, 32
 ; RV32-NEXT:    vlse64.v v8, (a3), zero
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    slli a3, a3, 5
+; RV32-NEXT:    slli a3, a3, 4
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vs8r.v v8, (a3) # Unknown-size Folded Spill
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
@@ -3992,24 +3998,21 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
 ; RV32-NEXT:    vand.vv v24, v24, v16, v0.t
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsub.vv v24, v16, v24, v0.t
 ; RV32-NEXT:    vand.vv v16, v24, v8, v0.t
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vs8r.v v16, (a3) # Unknown-size Folded Spill
 ; RV32-NEXT:    vsrl.vi v16, v24, 2, v0.t
 ; RV32-NEXT:    vand.vv v16, v16, v8, v0.t
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v8, (a3) # Unknown-size Folded Reload
@@ -4027,8 +4030,7 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    vs8r.v v16, (a3) # Unknown-size Folded Spill
 ; RV32-NEXT:    vlse64.v v8, (a4), zero
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    li a4, 24
-; RV32-NEXT:    mul a3, a3, a4
+; RV32-NEXT:    slli a3, a3, 5
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vs8r.v v8, (a3) # Unknown-size Folded Spill
@@ -4042,8 +4044,7 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    vadd.vv v8, v8, v24, v0.t
 ; RV32-NEXT:    vand.vv v16, v8, v16, v0.t
 ; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    li a3, 24
-; RV32-NEXT:    mul a2, a2, a3
+; RV32-NEXT:    slli a2, a2, 5
 ; RV32-NEXT:    add a2, sp, a2
 ; RV32-NEXT:    addi a2, a2, 48
 ; RV32-NEXT:    vl8r.v v8, (a2) # Unknown-size Folded Reload
@@ -4061,7 +4062,8 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    and a0, a0, a3
 ; RV32-NEXT:    vmv1r.v v0, v7
 ; RV32-NEXT:    csrr a3, vlenb
-; RV32-NEXT:    slli a3, a3, 4
+; RV32-NEXT:    li a4, 24
+; RV32-NEXT:    mul a3, a3, a4
 ; RV32-NEXT:    add a3, sp, a3
 ; RV32-NEXT:    addi a3, a3, 48
 ; RV32-NEXT:    vl8r.v v16, (a3) # Unknown-size Folded Reload
@@ -4077,31 +4079,44 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    addi a0, a0, 48
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vand.vv v16, v24, v16, v0.t
-; RV32-NEXT:    vsub.vv v24, v8, v16, v0.t
+; RV32-NEXT:    vsub.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 5
+; RV32-NEXT:    li a1, 40
+; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
-; RV32-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
-; RV32-NEXT:    vand.vv v16, v24, v8, v0.t
+; RV32-NEXT:    vs8r.v v8, (a0) # Unknown-size Folded Spill
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 48
+; RV32-NEXT:    vl8r.v v24, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    li a1, 40
 ; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
-; RV32-NEXT:    vs8r.v v16, (a0) # Unknown-size Folded Spill
-; RV32-NEXT:    vsrl.vi v8, v24, 2, v0.t
+; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
+; RV32-NEXT:    vand.vv v16, v16, v24, v0.t
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 5
+; RV32-NEXT:    li a1, 24
+; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
-; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
-; RV32-NEXT:    vand.vv v8, v8, v16, v0.t
+; RV32-NEXT:    vs8r.v v16, (a0) # Unknown-size Folded Spill
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    li a1, 40
 ; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
+; RV32-NEXT:    vl8r.v v8, (a0) # Unknown-size Folded Reload
+; RV32-NEXT:    vsrl.vi v8, v8, 2, v0.t
+; RV32-NEXT:    vand.vv v8, v8, v24, v0.t
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    li a1, 24
+; RV32-NEXT:    mul a0, a0, a1
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 48
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vadd.vv v8, v16, v8, v0.t
 ; RV32-NEXT:    vsrl.vi v16, v8, 4, v0.t
@@ -4110,8 +4125,7 @@ define <32 x i64> @vp_cttz_zero_undef_v32i64(<32 x i64> %va, <32 x i1> %m, i32 z
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
 ; RV32-NEXT:    vand.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    li a1, 24
-; RV32-NEXT:    mul a0, a0, a1
+; RV32-NEXT:    slli a0, a0, 5
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 48
 ; RV32-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
index 0c7d7925edf39c..1516c67bf7ecc0 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
@@ -183,10 +183,10 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    addi sp, sp, -16
 ; RV32-NEXT:    .cfi_def_cfa_offset 16
 ; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    li a3, 92
+; RV32-NEXT:    li a3, 88
 ; RV32-NEXT:    mul a2, a2, a3
 ; RV32-NEXT:    sub sp, sp, a2
-; RV32-NEXT:    .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0xdc, 0x00, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 92 * vlenb
+; RV32-NEXT:    .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0xd8, 0x00, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 88 * vlenb
 ; RV32-NEXT:    addi a3, a1, 256
 ; RV32-NEXT:    addi a4, a1, 128
 ; RV32-NEXT:    li a2, 32
@@ -198,15 +198,14 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vle32.v v16, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li t1, 76
+; RV32-NEXT:    li t1, 72
 ; RV32-NEXT:    mul a1, a1, t1
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    vle32.v v8, (a4)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 68
-; RV32-NEXT:    mul a1, a1, a4
+; RV32-NEXT:    slli a1, a1, 6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
@@ -221,7 +220,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV32-NEXT:    vrgatherei16.vv v24, v16, v6
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    li a5, 52
+; RV32-NEXT:    li a5, 48
 ; RV32-NEXT:    mul a4, a4, a5
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
@@ -236,13 +235,13 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
 ; RV32-NEXT:    vslidedown.vi v24, v16, 16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 60
+; RV32-NEXT:    li a3, 56
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 84
+; RV32-NEXT:    li a3, 80
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -250,14 +249,14 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
 ; RV32-NEXT:    vmerge.vvm v20, v24, v16, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 40
+; RV32-NEXT:    li a3, 36
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs4r.v v20, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    vmv1r.v v0, v7
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 76
+; RV32-NEXT:    li a3, 72
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -267,7 +266,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV32-NEXT:    vrgatherei16.vv v8, v24, v4
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 44
+; RV32-NEXT:    li a3, 40
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -282,14 +281,14 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vle16.v v14, (a4)
 ; RV32-NEXT:    vmv.s.x v12, a3
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 84
+; RV32-NEXT:    li a3, 80
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vmv4r.v v8, v24
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 60
+; RV32-NEXT:    li a3, 56
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -304,8 +303,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    vmv1r.v v0, v12
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 68
-; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    slli a1, a1, 6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
@@ -314,7 +312,8 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV32-NEXT:    vrgatherei16.vv v16, v24, v14
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 5
+; RV32-NEXT:    li a3, 28
+; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
@@ -330,22 +329,22 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    addi a4, a4, 12
 ; RV32-NEXT:    addi a5, a5, 768
 ; RV32-NEXT:    addi a7, a7, -1024
-; RV32-NEXT:    vmv.s.x v1, a6
-; RV32-NEXT:    vmv.s.x v12, t0
+; RV32-NEXT:    vmv.s.x v13, a6
+; RV32-NEXT:    vmv.s.x v2, t0
 ; RV32-NEXT:    vmv.s.x v0, a1
-; RV32-NEXT:    vmv.s.x v3, a3
-; RV32-NEXT:    vmv.s.x v2, a4
-; RV32-NEXT:    vmv.s.x v13, a5
-; RV32-NEXT:    vmv.s.x v14, a7
+; RV32-NEXT:    vmv.s.x v12, a3
+; RV32-NEXT:    vmv.s.x v3, a4
+; RV32-NEXT:    vmv.s.x v14, a5
+; RV32-NEXT:    vmv.s.x v1, a7
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 60
+; RV32-NEXT:    li a3, 56
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vmv4r.v v8, v16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 84
+; RV32-NEXT:    li a3, 80
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -354,15 +353,14 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vmerge.vvm v20, v8, v16, v0
 ; RV32-NEXT:    addi a1, sp, 16
 ; RV32-NEXT:    vs4r.v v20, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vmv1r.v v0, v3
+; RV32-NEXT:    vmv1r.v v0, v12
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 68
-; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    slli a1, a1, 6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 76
+; RV32-NEXT:    li a3, 72
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -370,28 +368,23 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vmerge.vvm v24, v16, v24, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    li a3, 12
+; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vmv1r.v v0, v2
+; RV32-NEXT:    vmv1r.v v0, v3
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 84
+; RV32-NEXT:    li a3, 80
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vmerge.vvm v24, v8, v24, v0
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 12
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v24, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vmv1r.v v0, v13
+; RV32-NEXT:    vmerge.vvm v4, v8, v24, v0
+; RV32-NEXT:    vmv1r.v v0, v14
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 76
+; RV32-NEXT:    li a3, 72
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -403,45 +396,44 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vmv1r.v v0, v1
+; RV32-NEXT:    vmv1r.v v0, v13
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 84
+; RV32-NEXT:    li a3, 80
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vmerge.vvm v4, v8, v24, v0
+; RV32-NEXT:    vmerge.vvm v12, v8, v24, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 28
+; RV32-NEXT:    li a3, 20
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v4, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vmv1r.v v0, v14
+; RV32-NEXT:    vs4r.v v12, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vmv1r.v v0, v1
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 76
+; RV32-NEXT:    li a3, 72
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 68
-; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    slli a1, a1, 6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vmerge.vvm v16, v24, v16, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 76
+; RV32-NEXT:    li a2, 72
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
-; RV32-NEXT:    vmv1r.v v0, v12
+; RV32-NEXT:    vmv1r.v v0, v2
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 84
+; RV32-NEXT:    li a2, 80
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -449,190 +441,208 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 68
-; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    slli a1, a1, 6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a1, 32
 ; RV32-NEXT:    addi a1, a1, 4
 ; RV32-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; RV32-NEXT:    vmv.v.x v16, a1
+; RV32-NEXT:    vmv.v.x v12, a1
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 40
+; RV32-NEXT:    li a2, 36
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v20, v8, v16
+; RV32-NEXT:    vrgatherei16.vv v16, v8, v12
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 52
+; RV32-NEXT:    li a2, 48
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v20, v8
+; RV32-NEXT:    vmv.v.v v16, v8
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 84
+; RV32-NEXT:    li a2, 80
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v20, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vs4r.v v16, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a1, 48
 ; RV32-NEXT:    lui a2, %hi(.LCPI8_3)
 ; RV32-NEXT:    addi a2, a2, %lo(.LCPI8_3)
 ; RV32-NEXT:    addi a1, a1, 5
 ; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; RV32-NEXT:    vle16.v v28, (a2)
+; RV32-NEXT:    vle16.v v24, (a2)
 ; RV32-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; RV32-NEXT:    vmv.v.x v20, a1
+; RV32-NEXT:    vmv.v.x v25, a1
 ; RV32-NEXT:    csrr a1, vlenb
 ; RV32-NEXT:    li a2, 24
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v8, v12, v20
+; RV32-NEXT:    vrgatherei16.vv v16, v8, v25
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 44
+; RV32-NEXT:    li a2, 40
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v8, v16
+; RV32-NEXT:    vmv.v.v v16, v8
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 52
+; RV32-NEXT:    li a2, 56
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vs4r.v v16, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    addi a1, sp, 16
-; RV32-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v24, v12, v28
+; RV32-NEXT:    vrgatherei16.vv v16, v8, v24
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 5
+; RV32-NEXT:    li a2, 28
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v24, v16
+; RV32-NEXT:    vmv.v.v v16, v8
+; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    li a2, 40
+; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    add a1, sp, a1
+; RV32-NEXT:    addi a1, a1, 16
+; RV32-NEXT:    vs4r.v v16, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a1, %hi(.LCPI8_4)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI8_4)
 ; RV32-NEXT:    lui a2, %hi(.LCPI8_5)
 ; RV32-NEXT:    addi a2, a2, %lo(.LCPI8_5)
 ; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV32-NEXT:    vle16.v v12, (a1)
+; RV32-NEXT:    vle16.v v26, (a1)
 ; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; RV32-NEXT:    vle16.v v28, (a2)
+; RV32-NEXT:    vle16.v v24, (a2)
 ; RV32-NEXT:    lui a1, %hi(.LCPI8_6)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI8_6)
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vle16.v v30, (a1)
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v16, v0, v12
+; RV32-NEXT:    vle16.v v2, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
 ; RV32-NEXT:    li a2, 12
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v20, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v8, v16, v26
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v12, v20, v28
+; RV32-NEXT:    vrgatherei16.vv v20, v4, v24
 ; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v12, v16
+; RV32-NEXT:    vmv.v.v v20, v8
 ; RV32-NEXT:    csrr a1, vlenb
 ; RV32-NEXT:    slli a1, a1, 2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
+; RV32-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v16, v0, v30
+; RV32-NEXT:    vrgatherei16.vv v24, v8, v2
+; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    li a2, 48
+; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    add a1, sp, a1
+; RV32-NEXT:    addi a1, a1, 16
+; RV32-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a1, %hi(.LCPI8_7)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI8_7)
 ; RV32-NEXT:    lui a2, %hi(.LCPI8_8)
 ; RV32-NEXT:    addi a2, a2, %lo(.LCPI8_8)
 ; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; RV32-NEXT:    vle16.v v20, (a1)
+; RV32-NEXT:    vle16.v v12, (a1)
 ; RV32-NEXT:    lui a1, %hi(.LCPI8_9)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI8_9)
 ; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV32-NEXT:    vle16.v v8, (a2)
+; RV32-NEXT:    vle16.v v16, (a2)
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vle16.v v10, (a1)
+; RV32-NEXT:    vle16.v v18, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 28
+; RV32-NEXT:    li a2, 20
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl4r.v v0, (a1) # Unknown-size Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v28, v0, v20
+; RV32-NEXT:    vrgatherei16.vv v24, v0, v12
+; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    li a2, 48
+; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    add a1, sp, a1
+; RV32-NEXT:    addi a1, a1, 16
+; RV32-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v28, v16
+; RV32-NEXT:    vmv.v.v v24, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 76
+; RV32-NEXT:    li a2, 72
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v16, v0, v8
+; RV32-NEXT:    vrgatherei16.vv v8, v0, v16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 60
+; RV32-NEXT:    li a2, 48
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
+; RV32-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 68
-; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    slli a1, a1, 6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl4r.v v4, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v16, v4, v10
+; RV32-NEXT:    vrgatherei16.vv v8, v4, v18
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 60
+; RV32-NEXT:    li a2, 48
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v16, v0
+; RV32-NEXT:    vmv.v.v v8, v0
 ; RV32-NEXT:    addi a1, a0, 320
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vse32.v v16, (a1)
+; RV32-NEXT:    vse32.v v8, (a1)
 ; RV32-NEXT:    addi a1, a0, 256
-; RV32-NEXT:    vse32.v v28, (a1)
+; RV32-NEXT:    vse32.v v24, (a1)
 ; RV32-NEXT:    addi a1, a0, 192
-; RV32-NEXT:    vse32.v v12, (a1)
+; RV32-NEXT:    vse32.v v20, (a1)
 ; RV32-NEXT:    addi a1, a0, 128
-; RV32-NEXT:    vse32.v v24, (a1)
+; RV32-NEXT:    csrr a2, vlenb
+; RV32-NEXT:    li a3, 40
+; RV32-NEXT:    mul a2, a2, a3
+; RV32-NEXT:    add a2, sp, a2
+; RV32-NEXT:    addi a2, a2, 16
+; RV32-NEXT:    vl4r.v v8, (a2) # Unknown-size Folded Reload
+; RV32-NEXT:    vse32.v v8, (a1)
 ; RV32-NEXT:    addi a1, a0, 64
 ; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    li a3, 52
+; RV32-NEXT:    li a3, 56
 ; RV32-NEXT:    mul a2, a2, a3
 ; RV32-NEXT:    add a2, sp, a2
 ; RV32-NEXT:    addi a2, a2, 16
 ; RV32-NEXT:    vl4r.v v8, (a2) # Unknown-size Folded Reload
 ; RV32-NEXT:    vse32.v v8, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 84
+; RV32-NEXT:    li a2, 80
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vse32.v v8, (a0)
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    li a1, 92
+; RV32-NEXT:    li a1, 88
 ; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add sp, sp, a0
 ; RV32-NEXT:    .cfi_def_cfa sp, 16
@@ -665,145 +675,154 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vmv.v.x v17, a6
 ; RV64-NEXT:    addi a6, a2, 65
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV64-NEXT:    vrgather.vi v24, v8, 4
+; RV64-NEXT:    vrgather.vi v4, v8, 4
 ; RV64-NEXT:    vrgather.vi v20, v8, 5
 ; RV64-NEXT:    csrr a7, vlenb
-; RV64-NEXT:    li t0, 68
+; RV64-NEXT:    li t0, 84
 ; RV64-NEXT:    mul a7, a7, t0
 ; RV64-NEXT:    add a7, sp, a7
 ; RV64-NEXT:    addi a7, a7, 16
 ; RV64-NEXT:    vs4r.v v20, (a7) # Unknown-size Folded Spill
 ; RV64-NEXT:    vrgatherei16.vv v20, v8, v16
 ; RV64-NEXT:    csrr a7, vlenb
-; RV64-NEXT:    li t0, 84
-; RV64-NEXT:    mul a7, a7, t0
+; RV64-NEXT:    slli a7, a7, 6
 ; RV64-NEXT:    add a7, sp, a7
 ; RV64-NEXT:    addi a7, a7, 16
 ; RV64-NEXT:    vs4r.v v20, (a7) # Unknown-size Folded Spill
 ; RV64-NEXT:    vrgatherei16.vv v20, v8, v17
 ; RV64-NEXT:    csrr a7, vlenb
-; RV64-NEXT:    li t0, 72
+; RV64-NEXT:    li t0, 56
 ; RV64-NEXT:    mul a7, a7, t0
 ; RV64-NEXT:    add a7, sp, a7
 ; RV64-NEXT:    addi a7, a7, 16
 ; RV64-NEXT:    vs4r.v v20, (a7) # Unknown-size Folded Spill
 ; RV64-NEXT:    vrgather.vi v16, v8, 2
 ; RV64-NEXT:    csrr a7, vlenb
-; RV64-NEXT:    slli a7, a7, 6
+; RV64-NEXT:    li t0, 72
+; RV64-NEXT:    mul a7, a7, t0
 ; RV64-NEXT:    add a7, sp, a7
 ; RV64-NEXT:    addi a7, a7, 16
 ; RV64-NEXT:    vs4r.v v16, (a7) # Unknown-size Folded Spill
 ; RV64-NEXT:    vrgather.vi v16, v8, 3
 ; RV64-NEXT:    csrr a7, vlenb
-; RV64-NEXT:    li t0, 56
+; RV64-NEXT:    li t0, 68
 ; RV64-NEXT:    mul a7, a7, t0
 ; RV64-NEXT:    add a7, sp, a7
 ; RV64-NEXT:    addi a7, a7, 16
 ; RV64-NEXT:    vs4r.v v16, (a7) # Unknown-size Folded Spill
 ; RV64-NEXT:    vsetivli zero, 8, e64, m8, ta, ma
-; RV64-NEXT:    vslidedown.vi v16, v8, 8
+; RV64-NEXT:    vslidedown.vi v8, v8, 8
 ; RV64-NEXT:    csrr a7, vlenb
-; RV64-NEXT:    li t0, 48
+; RV64-NEXT:    li t0, 40
 ; RV64-NEXT:    mul a7, a7, t0
 ; RV64-NEXT:    add a7, sp, a7
 ; RV64-NEXT:    addi a7, a7, 16
-; RV64-NEXT:    vs8r.v v16, (a7) # Unknown-size Folded Spill
-; RV64-NEXT:    vmv.s.x v21, a4
+; RV64-NEXT:    vs8r.v v8, (a7) # Unknown-size Folded Spill
+; RV64-NEXT:    vmv.s.x v0, a4
+; RV64-NEXT:    csrr a4, vlenb
+; RV64-NEXT:    slli a4, a4, 5
+; RV64-NEXT:    add a4, sp, a4
+; RV64-NEXT:    addi a4, a4, 16
+; RV64-NEXT:    vs1r.v v0, (a4) # Unknown-size Folded Spill
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vle64.v v8, (a1)
-; RV64-NEXT:    vle64.v v0, (a3)
+; RV64-NEXT:    vle64.v v24, (a1)
+; RV64-NEXT:    vle64.v v16, (a3)
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 40
+; RV64-NEXT:    li a3, 76
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs8r.v v0, (a1) # Unknown-size Folded Spill
-; RV64-NEXT:    vle16.v v2, (a5)
-; RV64-NEXT:    vmv.s.x v20, a6
-; RV64-NEXT:    vmv1r.v v0, v21
-; RV64-NEXT:    vmv1r.v v7, v21
+; RV64-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vle16.v v12, (a5)
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vs2r.v v12, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vmv.s.x v2, a6
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vrgather.vi v24, v16, 2, v0.t
+; RV64-NEXT:    vrgather.vi v4, v8, 2, v0.t
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a3, 60
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v24, (a1) # Unknown-size Folded Spill
-; RV64-NEXT:    vmv1r.v v0, v20
+; RV64-NEXT:    vs4r.v v4, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vmv1r.v v0, v2
+; RV64-NEXT:    vmv8r.v v8, v24
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 40
+; RV64-NEXT:    li a3, 48
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vmerge.vvm v24, v16, v8, v0
-; RV64-NEXT:    vmv8r.v v16, v8
+; RV64-NEXT:    vmerge.vvm v24, v16, v24, v0
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 76
-; RV64-NEXT:    mul a1, a1, a3
+; RV64-NEXT:    slli a1, a1, 4
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
-; RV64-NEXT:    vrgatherei16.vv v8, v24, v2
+; RV64-NEXT:    vl2r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vrgatherei16.vv v0, v24, v16
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 5
+; RV64-NEXT:    li a3, 24
+; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vs8r.v v0, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    lui a1, 2
 ; RV64-NEXT:    lui a3, %hi(.LCPI8_1)
 ; RV64-NEXT:    addi a3, a3, %lo(.LCPI8_1)
 ; RV64-NEXT:    addi a1, a1, 130
-; RV64-NEXT:    vle16.v v8, (a3)
+; RV64-NEXT:    vle16.v v16, (a3)
 ; RV64-NEXT:    csrr a3, vlenb
-; RV64-NEXT:    slli a3, a3, 4
+; RV64-NEXT:    slli a3, a3, 3
 ; RV64-NEXT:    add a3, sp, a3
 ; RV64-NEXT:    addi a3, a3, 16
-; RV64-NEXT:    vs2r.v v8, (a3) # Unknown-size Folded Spill
+; RV64-NEXT:    vs2r.v v16, (a3) # Unknown-size Folded Spill
 ; RV64-NEXT:    vmv.s.x v2, a1
-; RV64-NEXT:    vmv1r.v v0, v7
-; RV64-NEXT:    addi a1, sp, 16
-; RV64-NEXT:    vs1r.v v7, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 68
+; RV64-NEXT:    slli a1, a1, 5
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl1r.v v0, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a3, 84
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl4r.v v24, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 48
+; RV64-NEXT:    li a3, 40
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vrgather.vi v24, v8, 3, v0.t
+; RV64-NEXT:    vrgather.vi v24, v16, 3, v0.t
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 68
+; RV64-NEXT:    li a3, 84
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vs4r.v v24, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v2
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 40
+; RV64-NEXT:    li a3, 76
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vmerge.vvm v24, v8, v16, v0
+; RV64-NEXT:    vmerge.vvm v24, v16, v8, v0
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    slli a1, a1, 3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl2r.v v16, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vrgatherei16.vv v0, v24, v16
+; RV64-NEXT:    vl2r.v v8, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vrgatherei16.vv v0, v24, v8
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 24
-; RV64-NEXT:    mul a1, a1, a3
+; RV64-NEXT:    slli a1, a1, 4
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vs8r.v v0, (a1) # Unknown-size Folded Spill
@@ -814,40 +833,39 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vmv.s.x v0, a1
 ; RV64-NEXT:    vmv.s.x v2, a3
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 76
+; RV64-NEXT:    li a3, 48
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vmerge.vvm v24, v8, v16, v0
+; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vmerge.vvm v8, v16, v24, v0
+; RV64-NEXT:    addi a1, sp, 16
+; RV64-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 3
+; RV64-NEXT:    slli a1, a1, 5
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
-; RV64-NEXT:    addi a1, sp, 16
 ; RV64-NEXT:    vl1r.v v7, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vmv1r.v v0, v7
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 84
-; RV64-NEXT:    mul a1, a1, a3
+; RV64-NEXT:    slli a1, a1, 6
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 48
+; RV64-NEXT:    li a3, 40
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vmv4r.v v8, v16
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vrgather.vi v24, v16, 4, v0.t
+; RV64-NEXT:    vrgather.vi v12, v16, 4, v0.t
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 84
-; RV64-NEXT:    mul a1, a1, a3
+; RV64-NEXT:    slli a1, a1, 6
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v24, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vs4r.v v12, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v2
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a3, 76
@@ -856,53 +874,48 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vmerge.vvm v24, v8, v16, v0
+; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    slli a1, a1, 3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
-; RV64-NEXT:    vmv8r.v v16, v8
+; RV64-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v7
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 72
-; RV64-NEXT:    mul a1, a1, a3
-; RV64-NEXT:    add a1, sp, a1
-; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 48
+; RV64-NEXT:    li a3, 56
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vmv4r.v v8, v24
+; RV64-NEXT:    vl4r.v v24, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vrgather.vi v12, v24, 5, v0.t
+; RV64-NEXT:    vrgather.vi v24, v8, 5, v0.t
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 72
+; RV64-NEXT:    li a3, 56
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v12, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vs4r.v v24, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    lui a1, 96
 ; RV64-NEXT:    li a3, 192
 ; RV64-NEXT:    vmv.s.x v3, a3
 ; RV64-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; RV64-NEXT:    vmv.v.x v12, a1
+; RV64-NEXT:    vmv.v.x v24, a1
 ; RV64-NEXT:    vmv1r.v v0, v3
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 6
+; RV64-NEXT:    li a3, 72
+; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl4r.v v28, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vrgatherei16.vv v24, v8, v12, v0.t
+; RV64-NEXT:    vrgatherei16.vv v28, v8, v24, v0.t
+; RV64-NEXT:    vmv4r.v v16, v8
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 6
+; RV64-NEXT:    li a3, 72
+; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v24, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vs4r.v v28, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    lui a1, %hi(.LCPI8_2)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI8_2)
 ; RV64-NEXT:    li a3, 1040
@@ -910,50 +923,67 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a4, a4, 1
 ; RV64-NEXT:    vmv.s.x v0, a3
 ; RV64-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; RV64-NEXT:    vmv.v.x v12, a4
+; RV64-NEXT:    vmv.v.x v5, a4
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV64-NEXT:    vle16.v v6, (a1)
-; RV64-NEXT:    vmv8r.v v24, v16
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a3, 76
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vmerge.vvm v16, v24, v16, v0
-; RV64-NEXT:    addi a1, sp, 16
-; RV64-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a3, 48
+; RV64-NEXT:    mul a1, a1, a3
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vmerge.vvm v24, v8, v24, v0
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    slli a1, a1, 5
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vs8r.v v24, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v3
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 56
+; RV64-NEXT:    li a3, 68
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl4r.v v28, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vrgatherei16.vv v16, v8, v12, v0.t
+; RV64-NEXT:    vrgatherei16.vv v28, v16, v5, v0.t
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a3, 56
+; RV64-NEXT:    li a3, 68
 ; RV64-NEXT:    mul a1, a1, a3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v16, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vs4r.v v28, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    addi a1, a2, -2016
 ; RV64-NEXT:    vmv.s.x v0, a1
+; RV64-NEXT:    addi a1, sp, 16
+; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV64-NEXT:    vrgatherei16.vv v16, v24, v6
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 3
+; RV64-NEXT:    li a2, 40
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vs8r.v v16, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 48
+; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v16, v8, v6
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 76
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vmerge.vvm v8, v24, v8, v0
+; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vmerge.vvm v8, v16, v8, v0
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 76
 ; RV64-NEXT:    mul a1, a1, a2
@@ -962,80 +992,94 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    lui a1, %hi(.LCPI8_3)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI8_3)
-; RV64-NEXT:    vle16.v v24, (a1)
+; RV64-NEXT:    vle16.v v8, (a1)
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 5
+; RV64-NEXT:    li a2, 60
+; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 60
+; RV64-NEXT:    li a2, 24
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 6, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v8, v0
+; RV64-NEXT:    vmv.v.v v12, v0
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 60
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vs4r.v v12, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 68
+; RV64-NEXT:    li a2, 84
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v0, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 24
-; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    slli a1, a1, 4
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vmv.v.v v0, v8
+; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vmv.v.v v12, v24
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 84
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
-; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v8, v16
+; RV64-NEXT:    vs4r.v v12, (a1) # Unknown-size Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 84
+; RV64-NEXT:    slli a1, a1, 6
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl4r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 40
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v8, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
+; RV64-NEXT:    vmv.v.v v16, v24
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    slli a1, a1, 3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl8r.v v24, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v16, v8, v24
+; RV64-NEXT:    vrgatherei16.vv v0, v24, v8
 ; RV64-NEXT:    lui a1, %hi(.LCPI8_4)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI8_4)
 ; RV64-NEXT:    vle16.v v8, (a1)
 ; RV64-NEXT:    lui a1, %hi(.LCPI8_5)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI8_5)
-; RV64-NEXT:    vle16.v v6, (a1)
+; RV64-NEXT:    vle16.v v10, (a1)
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 72
+; RV64-NEXT:    slli a1, a1, 6
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vs2r.v v10, (a1) # Unknown-size Folded Spill
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 56
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl4r.v v12, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v12, v16
-; RV64-NEXT:    addi a1, sp, 16
-; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vmv.v.v v12, v0
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    slli a1, a1, 5
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v24, v16, v8
+; RV64-NEXT:    vrgatherei16.vv v24, v0, v8
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 6
+; RV64-NEXT:    li a2, 72
+; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl4r.v v8, (a1) # Unknown-size Folded Reload
@@ -1046,25 +1090,32 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl8r.v v0, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    slli a1, a1, 6
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl2r.v v20, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v24, v16, v6
+; RV64-NEXT:    vrgatherei16.vv v24, v0, v20
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 56
+; RV64-NEXT:    li a2, 68
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v16, (a1) # Unknown-size Folded Reload
+; RV64-NEXT:    vl4r.v v28, (a1) # Unknown-size Folded Reload
 ; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v16, v24
+; RV64-NEXT:    vmv.v.v v28, v24
 ; RV64-NEXT:    addi a1, a0, 256
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
 ; RV64-NEXT:    vse64.v v8, (a1)
 ; RV64-NEXT:    addi a1, a0, 320
-; RV64-NEXT:    vse64.v v16, (a1)
+; RV64-NEXT:    vse64.v v28, (a1)
 ; RV64-NEXT:    addi a1, a0, 192
 ; RV64-NEXT:    vse64.v v12, (a1)
 ; RV64-NEXT:    addi a1, a0, 128
+; RV64-NEXT:    vse64.v v16, (a1)
+; RV64-NEXT:    addi a1, a0, 64
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    li a3, 84
 ; RV64-NEXT:    mul a2, a2, a3
@@ -1072,8 +1123,6 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vl4r.v v8, (a2) # Unknown-size Folded Reload
 ; RV64-NEXT:    vse64.v v8, (a1)
-; RV64-NEXT:    addi a1, a0, 64
-; RV64-NEXT:    vse64.v v0, (a1)
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 60
 ; RV64-NEXT:    mul a1, a1, a2
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
index 7ec47269257048..5db05c438d925b 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-scatter.ll
@@ -4029,7 +4029,7 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ;
 ; RV64ZVE32F-LABEL: mscatter_baseidx_zext_v8i8_v8i64:
 ; RV64ZVE32F:       # %bb.0:
-; RV64ZVE32F-NEXT:    ld a4, 40(a0)
+; RV64ZVE32F-NEXT:    ld a5, 40(a0)
 ; RV64ZVE32F-NEXT:    ld a3, 48(a0)
 ; RV64ZVE32F-NEXT:    ld a2, 56(a0)
 ; RV64ZVE32F-NEXT:    ld t1, 8(a0)
@@ -4037,8 +4037,8 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    ld a7, 24(a0)
 ; RV64ZVE32F-NEXT:    ld a6, 32(a0)
 ; RV64ZVE32F-NEXT:    vsetivli zero, 1, e8, m1, ta, ma
-; RV64ZVE32F-NEXT:    vmv.x.s a5, v0
-; RV64ZVE32F-NEXT:    andi t2, a5, 1
+; RV64ZVE32F-NEXT:    vmv.x.s a4, v0
+; RV64ZVE32F-NEXT:    andi t2, a4, 1
 ; RV64ZVE32F-NEXT:    beqz t2, .LBB44_2
 ; RV64ZVE32F-NEXT:  # %bb.1: # %cond.store
 ; RV64ZVE32F-NEXT:    ld a0, 0(a0)
@@ -4048,7 +4048,7 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    add t2, a1, t2
 ; RV64ZVE32F-NEXT:    sd a0, 0(t2)
 ; RV64ZVE32F-NEXT:  .LBB44_2: # %else
-; RV64ZVE32F-NEXT:    andi a0, a5, 2
+; RV64ZVE32F-NEXT:    andi a0, a4, 2
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB44_4
 ; RV64ZVE32F-NEXT:  # %bb.3: # %cond.store1
 ; RV64ZVE32F-NEXT:    vsetivli zero, 1, e8, mf4, ta, ma
@@ -4061,18 +4061,18 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:  .LBB44_4: # %else2
 ; RV64ZVE32F-NEXT:    vsetivli zero, 4, e8, mf2, ta, ma
 ; RV64ZVE32F-NEXT:    vslidedown.vi v9, v8, 4
-; RV64ZVE32F-NEXT:    andi a0, a5, 4
+; RV64ZVE32F-NEXT:    andi a0, a4, 4
 ; RV64ZVE32F-NEXT:    vsetivli zero, 2, e8, mf4, ta, ma
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v8, 2
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB44_12
 ; RV64ZVE32F-NEXT:  # %bb.5: # %else4
-; RV64ZVE32F-NEXT:    andi a0, a5, 8
+; RV64ZVE32F-NEXT:    andi a0, a4, 8
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB44_13
 ; RV64ZVE32F-NEXT:  .LBB44_6: # %else6
-; RV64ZVE32F-NEXT:    andi a0, a5, 16
+; RV64ZVE32F-NEXT:    andi a0, a4, 16
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB44_14
 ; RV64ZVE32F-NEXT:  .LBB44_7: # %else8
-; RV64ZVE32F-NEXT:    andi a0, a5, 32
+; RV64ZVE32F-NEXT:    andi a0, a4, 32
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB44_9
 ; RV64ZVE32F-NEXT:  .LBB44_8: # %cond.store9
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v9, 1
@@ -4080,13 +4080,13 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    andi a0, a0, 255
 ; RV64ZVE32F-NEXT:    slli a0, a0, 3
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
-; RV64ZVE32F-NEXT:    sd a4, 0(a0)
+; RV64ZVE32F-NEXT:    sd a5, 0(a0)
 ; RV64ZVE32F-NEXT:  .LBB44_9: # %else10
-; RV64ZVE32F-NEXT:    andi a0, a5, 64
+; RV64ZVE32F-NEXT:    andi a0, a4, 64
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v9, 2
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB44_15
 ; RV64ZVE32F-NEXT:  # %bb.10: # %else12
-; RV64ZVE32F-NEXT:    andi a0, a5, -128
+; RV64ZVE32F-NEXT:    andi a0, a4, -128
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB44_16
 ; RV64ZVE32F-NEXT:  .LBB44_11: # %else14
 ; RV64ZVE32F-NEXT:    ret
@@ -4096,7 +4096,7 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    slli a0, a0, 3
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd t0, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, 8
+; RV64ZVE32F-NEXT:    andi a0, a4, 8
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB44_6
 ; RV64ZVE32F-NEXT:  .LBB44_13: # %cond.store5
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v8, 1
@@ -4105,7 +4105,7 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    slli a0, a0, 3
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd a7, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, 16
+; RV64ZVE32F-NEXT:    andi a0, a4, 16
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB44_7
 ; RV64ZVE32F-NEXT:  .LBB44_14: # %cond.store7
 ; RV64ZVE32F-NEXT:    vmv.x.s a0, v9
@@ -4113,7 +4113,7 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    slli a0, a0, 3
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd a6, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, 32
+; RV64ZVE32F-NEXT:    andi a0, a4, 32
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB44_8
 ; RV64ZVE32F-NEXT:    j .LBB44_9
 ; RV64ZVE32F-NEXT:  .LBB44_15: # %cond.store11
@@ -4122,7 +4122,7 @@ define void @mscatter_baseidx_zext_v8i8_v8i64(<8 x i64> %val, ptr %base, <8 x i8
 ; RV64ZVE32F-NEXT:    slli a0, a0, 3
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd a3, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, -128
+; RV64ZVE32F-NEXT:    andi a0, a4, -128
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB44_11
 ; RV64ZVE32F-NEXT:  .LBB44_16: # %cond.store13
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v8, 1
@@ -4792,7 +4792,7 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ;
 ; RV64ZVE32F-LABEL: mscatter_baseidx_zext_v8i16_v8i64:
 ; RV64ZVE32F:       # %bb.0:
-; RV64ZVE32F-NEXT:    ld a4, 40(a0)
+; RV64ZVE32F-NEXT:    ld a5, 40(a0)
 ; RV64ZVE32F-NEXT:    ld a3, 48(a0)
 ; RV64ZVE32F-NEXT:    ld a2, 56(a0)
 ; RV64ZVE32F-NEXT:    ld t1, 8(a0)
@@ -4800,8 +4800,8 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    ld a7, 24(a0)
 ; RV64ZVE32F-NEXT:    ld a6, 32(a0)
 ; RV64ZVE32F-NEXT:    vsetivli zero, 1, e8, m1, ta, ma
-; RV64ZVE32F-NEXT:    vmv.x.s a5, v0
-; RV64ZVE32F-NEXT:    andi t2, a5, 1
+; RV64ZVE32F-NEXT:    vmv.x.s a4, v0
+; RV64ZVE32F-NEXT:    andi t2, a4, 1
 ; RV64ZVE32F-NEXT:    beqz t2, .LBB47_2
 ; RV64ZVE32F-NEXT:  # %bb.1: # %cond.store
 ; RV64ZVE32F-NEXT:    ld a0, 0(a0)
@@ -4812,7 +4812,7 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    add t2, a1, t2
 ; RV64ZVE32F-NEXT:    sd a0, 0(t2)
 ; RV64ZVE32F-NEXT:  .LBB47_2: # %else
-; RV64ZVE32F-NEXT:    andi a0, a5, 2
+; RV64ZVE32F-NEXT:    andi a0, a4, 2
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB47_4
 ; RV64ZVE32F-NEXT:  # %bb.3: # %cond.store1
 ; RV64ZVE32F-NEXT:    vsetivli zero, 1, e16, mf2, ta, ma
@@ -4825,18 +4825,18 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:  .LBB47_4: # %else2
 ; RV64ZVE32F-NEXT:    vsetivli zero, 4, e16, m1, ta, ma
 ; RV64ZVE32F-NEXT:    vslidedown.vi v9, v8, 4
-; RV64ZVE32F-NEXT:    andi a0, a5, 4
+; RV64ZVE32F-NEXT:    andi a0, a4, 4
 ; RV64ZVE32F-NEXT:    vsetivli zero, 2, e16, mf2, ta, ma
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v8, 2
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB47_12
 ; RV64ZVE32F-NEXT:  # %bb.5: # %else4
-; RV64ZVE32F-NEXT:    andi a0, a5, 8
+; RV64ZVE32F-NEXT:    andi a0, a4, 8
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB47_13
 ; RV64ZVE32F-NEXT:  .LBB47_6: # %else6
-; RV64ZVE32F-NEXT:    andi a0, a5, 16
+; RV64ZVE32F-NEXT:    andi a0, a4, 16
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB47_14
 ; RV64ZVE32F-NEXT:  .LBB47_7: # %else8
-; RV64ZVE32F-NEXT:    andi a0, a5, 32
+; RV64ZVE32F-NEXT:    andi a0, a4, 32
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB47_9
 ; RV64ZVE32F-NEXT:  .LBB47_8: # %cond.store9
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v9, 1
@@ -4844,13 +4844,13 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    slli a0, a0, 48
 ; RV64ZVE32F-NEXT:    srli a0, a0, 45
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
-; RV64ZVE32F-NEXT:    sd a4, 0(a0)
+; RV64ZVE32F-NEXT:    sd a5, 0(a0)
 ; RV64ZVE32F-NEXT:  .LBB47_9: # %else10
-; RV64ZVE32F-NEXT:    andi a0, a5, 64
+; RV64ZVE32F-NEXT:    andi a0, a4, 64
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v9, 2
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB47_15
 ; RV64ZVE32F-NEXT:  # %bb.10: # %else12
-; RV64ZVE32F-NEXT:    andi a0, a5, -128
+; RV64ZVE32F-NEXT:    andi a0, a4, -128
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB47_16
 ; RV64ZVE32F-NEXT:  .LBB47_11: # %else14
 ; RV64ZVE32F-NEXT:    ret
@@ -4860,7 +4860,7 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    srli a0, a0, 45
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd t0, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, 8
+; RV64ZVE32F-NEXT:    andi a0, a4, 8
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB47_6
 ; RV64ZVE32F-NEXT:  .LBB47_13: # %cond.store5
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v8, 1
@@ -4869,7 +4869,7 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    srli a0, a0, 45
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd a7, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, 16
+; RV64ZVE32F-NEXT:    andi a0, a4, 16
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB47_7
 ; RV64ZVE32F-NEXT:  .LBB47_14: # %cond.store7
 ; RV64ZVE32F-NEXT:    vmv.x.s a0, v9
@@ -4877,7 +4877,7 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    srli a0, a0, 45
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd a6, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, 32
+; RV64ZVE32F-NEXT:    andi a0, a4, 32
 ; RV64ZVE32F-NEXT:    bnez a0, .LBB47_8
 ; RV64ZVE32F-NEXT:    j .LBB47_9
 ; RV64ZVE32F-NEXT:  .LBB47_15: # %cond.store11
@@ -4886,7 +4886,7 @@ define void @mscatter_baseidx_zext_v8i16_v8i64(<8 x i64> %val, ptr %base, <8 x i
 ; RV64ZVE32F-NEXT:    srli a0, a0, 45
 ; RV64ZVE32F-NEXT:    add a0, a1, a0
 ; RV64ZVE32F-NEXT:    sd a3, 0(a0)
-; RV64ZVE32F-NEXT:    andi a0, a5, -128
+; RV64ZVE32F-NEXT:    andi a0, a4, -128
 ; RV64ZVE32F-NEXT:    beqz a0, .LBB47_11
 ; RV64ZVE32F-NEXT:  .LBB47_16: # %cond.store13
 ; RV64ZVE32F-NEXT:    vslidedown.vi v8, v8, 1
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll b/llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll
index 864acb320d8fe1..53929d17bb5a1d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll
@@ -216,11 +216,11 @@ define <vscale x 128 x i8> @vector_interleave_nxv128i8_nxv64i8(<vscale x 64 x i8
 ; ZVBB-LABEL: vector_interleave_nxv128i8_nxv64i8:
 ; ZVBB:       # %bb.0:
 ; ZVBB-NEXT:    vsetvli a0, zero, e8, m4, ta, ma
-; ZVBB-NEXT:    vmv8r.v v24, v8
-; ZVBB-NEXT:    vwsll.vi v8, v16, 8
+; ZVBB-NEXT:    vwsll.vi v24, v16, 8
 ; ZVBB-NEXT:    vwsll.vi v0, v20, 8
-; ZVBB-NEXT:    vwaddu.wv v8, v8, v24
-; ZVBB-NEXT:    vwaddu.wv v0, v0, v28
+; ZVBB-NEXT:    vwaddu.wv v24, v24, v8
+; ZVBB-NEXT:    vwaddu.wv v0, v0, v12
+; ZVBB-NEXT:    vmv8r.v v8, v24
 ; ZVBB-NEXT:    vmv8r.v v16, v0
 ; ZVBB-NEXT:    ret
   %res = call <vscale x 128 x i8> @llvm.vector.interleave2.nxv128i8(<vscale x 64 x i8> %a, <vscale x 64 x i8> %b)
@@ -243,11 +243,11 @@ define <vscale x 64 x i16> @vector_interleave_nxv64i16_nxv32i16(<vscale x 32 x i
 ; ZVBB-LABEL: vector_interleave_nxv64i16_nxv32i16:
 ; ZVBB:       # %bb.0:
 ; ZVBB-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVBB-NEXT:    vmv8r.v v24, v8
-; ZVBB-NEXT:    vwsll.vi v8, v16, 16
+; ZVBB-NEXT:    vwsll.vi v24, v16, 16
 ; ZVBB-NEXT:    vwsll.vi v0, v20, 16
-; ZVBB-NEXT:    vwaddu.wv v8, v8, v24
-; ZVBB-NEXT:    vwaddu.wv v0, v0, v28
+; ZVBB-NEXT:    vwaddu.wv v24, v24, v8
+; ZVBB-NEXT:    vwaddu.wv v0, v0, v12
+; ZVBB-NEXT:    vmv8r.v v8, v24
 ; ZVBB-NEXT:    vmv8r.v v16, v0
 ; ZVBB-NEXT:    ret
   %res = call <vscale x 64 x i16> @llvm.vector.interleave2.nxv64i16(<vscale x 32 x i16> %a, <vscale x 32 x i16> %b)
@@ -269,13 +269,13 @@ define <vscale x 32 x i32> @vector_interleave_nxv32i32_nxv16i32(<vscale x 16 x i
 ;
 ; ZVBB-LABEL: vector_interleave_nxv32i32_nxv16i32:
 ; ZVBB:       # %bb.0:
-; ZVBB-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
-; ZVBB-NEXT:    vmv8r.v v24, v8
 ; ZVBB-NEXT:    li a0, 32
-; ZVBB-NEXT:    vwsll.vx v8, v16, a0
+; ZVBB-NEXT:    vsetvli a1, zero, e32, m4, ta, ma
+; ZVBB-NEXT:    vwsll.vx v24, v16, a0
 ; ZVBB-NEXT:    vwsll.vx v0, v20, a0
-; ZVBB-NEXT:    vwaddu.wv v8, v8, v24
-; ZVBB-NEXT:    vwaddu.wv v0, v0, v28
+; ZVBB-NEXT:    vwaddu.wv v24, v24, v8
+; ZVBB-NEXT:    vwaddu.wv v0, v0, v12
+; ZVBB-NEXT:    vmv8r.v v8, v24
 ; ZVBB-NEXT:    vmv8r.v v16, v0
 ; ZVBB-NEXT:    ret
   %res = call <vscale x 32 x i32> @llvm.vector.interleave2.nxv32i32(<vscale x 16 x i32> %a, <vscale x 16 x i32> %b)
@@ -588,11 +588,11 @@ define <vscale x 64 x bfloat> @vector_interleave_nxv64bf16_nxv32bf16(<vscale x 3
 ; ZVBB-LABEL: vector_interleave_nxv64bf16_nxv32bf16:
 ; ZVBB:       # %bb.0:
 ; ZVBB-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVBB-NEXT:    vmv8r.v v24, v8
-; ZVBB-NEXT:    vwsll.vi v8, v16, 16
+; ZVBB-NEXT:    vwsll.vi v24, v16, 16
 ; ZVBB-NEXT:    vwsll.vi v0, v20, 16
-; ZVBB-NEXT:    vwaddu.wv v8, v8, v24
-; ZVBB-NEXT:    vwaddu.wv v0, v0, v28
+; ZVBB-NEXT:    vwaddu.wv v24, v24, v8
+; ZVBB-NEXT:    vwaddu.wv v0, v0, v12
+; ZVBB-NEXT:    vmv8r.v v8, v24
 ; ZVBB-NEXT:    vmv8r.v v16, v0
 ; ZVBB-NEXT:    ret
   %res = call <vscale x 64 x bfloat> @llvm.vector.interleave2.nxv64bf16(<vscale x 32 x bfloat> %a, <vscale x 32 x bfloat> %b)
@@ -615,11 +615,11 @@ define <vscale x 64 x half> @vector_interleave_nxv64f16_nxv32f16(<vscale x 32 x
 ; ZVBB-LABEL: vector_interleave_nxv64f16_nxv32f16:
 ; ZVBB:       # %bb.0:
 ; ZVBB-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVBB-NEXT:    vmv8r.v v24, v8
-; ZVBB-NEXT:    vwsll.vi v8, v16, 16
+; ZVBB-NEXT:    vwsll.vi v24, v16, 16
 ; ZVBB-NEXT:    vwsll.vi v0, v20, 16
-; ZVBB-NEXT:    vwaddu.wv v8, v8, v24
-; ZVBB-NEXT:    vwaddu.wv v0, v0, v28
+; ZVBB-NEXT:    vwaddu.wv v24, v24, v8
+; ZVBB-NEXT:    vwaddu.wv v0, v0, v12
+; ZVBB-NEXT:    vmv8r.v v8, v24
 ; ZVBB-NEXT:    vmv8r.v v16, v0
 ; ZVBB-NEXT:    ret
   %res = call <vscale x 64 x half> @llvm.vector.interleave2.nxv64f16(<vscale x 32 x half> %a, <vscale x 32 x half> %b)
@@ -641,13 +641,13 @@ define <vscale x 32 x float> @vector_interleave_nxv32f32_nxv16f32(<vscale x 16 x
 ;
 ; ZVBB-LABEL: vector_interleave_nxv32f32_nxv16f32:
 ; ZVBB:       # %bb.0:
-; ZVBB-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
-; ZVBB-NEXT:    vmv8r.v v24, v8
 ; ZVBB-NEXT:    li a0, 32
-; ZVBB-NEXT:    vwsll.vx v8, v16, a0
+; ZVBB-NEXT:    vsetvli a1, zero, e32, m4, ta, ma
+; ZVBB-NEXT:    vwsll.vx v24, v16, a0
 ; ZVBB-NEXT:    vwsll.vx v0, v20, a0
-; ZVBB-NEXT:    vwaddu.wv v8, v8, v24
-; ZVBB-NEXT:    vwaddu.wv v0, v0, v28
+; ZVBB-NEXT:    vwaddu.wv v24, v24, v8
+; ZVBB-NEXT:    vwaddu.wv v0, v0, v12
+; ZVBB-NEXT:    vmv8r.v v8, v24
 ; ZVBB-NEXT:    vmv8r.v v16, v0
 ; ZVBB-NEXT:    ret
   %res = call <vscale x 32 x float> @llvm.vector.interleave2.nxv32f32(<vscale x 16 x float> %a, <vscale x 16 x float> %b)
diff --git a/llvm/test/CodeGen/SPARC/fmuladd-soft-float.ll b/llvm/test/CodeGen/SPARC/fmuladd-soft-float.ll
index a9e666e3c9b4db..b2ea38f2943351 100644
--- a/llvm/test/CodeGen/SPARC/fmuladd-soft-float.ll
+++ b/llvm/test/CodeGen/SPARC/fmuladd-soft-float.ll
@@ -165,7 +165,7 @@ define <4 x float> @fmuladd_contract_v4f32(<4 x float> %a, <4 x float> %b, <4 x
 ; SOFT-FLOAT-32-NEXT:    mov %i0, %o0
 ; SOFT-FLOAT-32-NEXT:    call __mulsf3
 ; SOFT-FLOAT-32-NEXT:    mov %i4, %o1
-; SOFT-FLOAT-32-NEXT:    mov %o0, %l6
+; SOFT-FLOAT-32-NEXT:    mov %o0, %i0
 ; SOFT-FLOAT-32-NEXT:    mov %i1, %o0
 ; SOFT-FLOAT-32-NEXT:    call __mulsf3
 ; SOFT-FLOAT-32-NEXT:    mov %i5, %o1
@@ -173,26 +173,28 @@ define <4 x float> @fmuladd_contract_v4f32(<4 x float> %a, <4 x float> %b, <4 x
 ; SOFT-FLOAT-32-NEXT:    mov %i2, %o0
 ; SOFT-FLOAT-32-NEXT:    call __mulsf3
 ; SOFT-FLOAT-32-NEXT:    mov %l5, %o1
-; SOFT-FLOAT-32-NEXT:    mov %o0, %i4
+; SOFT-FLOAT-32-NEXT:    mov %o0, %i2
 ; SOFT-FLOAT-32-NEXT:    mov %i3, %o0
 ; SOFT-FLOAT-32-NEXT:    call __mulsf3
 ; SOFT-FLOAT-32-NEXT:    mov %l4, %o1
 ; SOFT-FLOAT-32-NEXT:    call __addsf3
 ; SOFT-FLOAT-32-NEXT:    mov %l3, %o1
 ; SOFT-FLOAT-32-NEXT:    mov %o0, %i3
-; SOFT-FLOAT-32-NEXT:    mov %i4, %o0
+; SOFT-FLOAT-32-NEXT:    mov %i2, %o0
 ; SOFT-FLOAT-32-NEXT:    call __addsf3
 ; SOFT-FLOAT-32-NEXT:    mov %l2, %o1
-; SOFT-FLOAT-32-NEXT:    mov %o0, %i2
+; SOFT-FLOAT-32-NEXT:    mov %o0, %i4
 ; SOFT-FLOAT-32-NEXT:    mov %i1, %o0
 ; SOFT-FLOAT-32-NEXT:    call __addsf3
 ; SOFT-FLOAT-32-NEXT:    mov %l1, %o1
 ; SOFT-FLOAT-32-NEXT:    mov %o0, %i1
-; SOFT-FLOAT-32-NEXT:    mov %l6, %o0
+; SOFT-FLOAT-32-NEXT:    mov %i0, %o0
 ; SOFT-FLOAT-32-NEXT:    call __addsf3
 ; SOFT-FLOAT-32-NEXT:    mov %l0, %o1
+; SOFT-FLOAT-32-NEXT:    ! kill: def $o0 killed $o0 def $o0_o1
+; SOFT-FLOAT-32-NEXT:    mov %o0, %i0
 ; SOFT-FLOAT-32-NEXT:    ret
-; SOFT-FLOAT-32-NEXT:    restore %g0, %o0, %o0
+; SOFT-FLOAT-32-NEXT:    restore %g0, %i4, %o2
 ;
 ; SOFT-FLOAT-64-LABEL: fmuladd_contract_v4f32:
 ; SOFT-FLOAT-64:         .cfi_startproc
diff --git a/llvm/test/CodeGen/SPARC/fp128.ll b/llvm/test/CodeGen/SPARC/fp128.ll
index 521e33399ac280..99bfb8d7427112 100644
--- a/llvm/test/CodeGen/SPARC/fp128.ll
+++ b/llvm/test/CodeGen/SPARC/fp128.ll
@@ -54,11 +54,11 @@ entry:
 
 ; CHECK-LABEL: f128_spill_large:
 ; CHECK:       sethi 4, %g1
-; CHECK:       std %f{{.+}}, [%fp+-16]
-; CHECK-NEXT:  std %f{{.+}}, [%fp+-8]
-; CHECK:       ldd [%fp+-16], %f{{.+}}
-; CHECK-NEXT:  ldd [%fp+-8], %f{{.+}}
 
+; CHECK:       std %f{{.+}}, [%[[S0:.+]]]
+; CHECK:       std %f{{.+}}, [%[[S1:.+]]]
+; CHECK-DAG:   ldd [%[[S0]]], %f{{.+}}
+; CHECK-DAG:   ldd [%[[S1]]], %f{{.+}}
 define void @f128_spill_large(ptr noalias sret(<251 x fp128>) %scalar.result, ptr byval(<251 x fp128>) %a) {
 entry:
   %0 = load <251 x fp128>, ptr %a, align 8
@@ -102,10 +102,10 @@ entry:
 
 
 ; CHECK-LABEL: f128_abs:
-; CHECK:       ldd [%o0], %f0
-; CHECK:       ldd [%o0+8], %f2
-; BE:          fabss %f0, %f0
-; EL:          fabss %f3, %f3
+; CHECK-DAG:       ldd [%o0], [[REG:%f[0-9]+]]
+; CHECK-DAG:       ldd [%o0+8], %f{{[0-9]+}}
+; BE:          fabss [[REG]], [[REG]]
+; EL:          fabss %f1, %f1
 
 define void @f128_abs(ptr noalias sret(fp128) %scalar.result, ptr byval(fp128) %a) {
 entry:
@@ -229,10 +229,10 @@ entry:
 }
 
 ; CHECK-LABEL: f128_neg:
-; CHECK:       ldd [%o0], %f0
-; CHECK:       ldd [%o0+8], %f2
-; BE:          fnegs %f0, %f0
-; EL:          fnegs %f3, %f3
+; CHECK-DAG:       ldd [%o0], [[REG:%f[0-9]+]]
+; CHECK-DAG:       ldd [%o0+8], %f{{[0-9]+}}
+; BE:          fnegs [[REG]], [[REG]]
+; LE:          fnegs [[REG]], [[REG]]
 
 define void @f128_neg(ptr noalias sret(fp128) %scalar.result, ptr byval(fp128) %a) {
 entry:
diff --git a/llvm/test/CodeGen/SPARC/fp16-promote.ll b/llvm/test/CodeGen/SPARC/fp16-promote.ll
index f09c37b790844d..a15104c7b8cff8 100644
--- a/llvm/test/CodeGen/SPARC/fp16-promote.ll
+++ b/llvm/test/CodeGen/SPARC/fp16-promote.ll
@@ -89,10 +89,10 @@ define void @test_fpextend_fp128(ptr %p, ptr %out) nounwind {
 ; V8-OPT-NEXT:    call _Q_stoq
 ; V8-OPT-NEXT:    ld [%fp+-20], %o0
 ; V8-OPT-NEXT:    unimp 16
-; V8-OPT-NEXT:    ldd [%fp+-16], %f0
-; V8-OPT-NEXT:    ldd [%fp+-8], %f2
-; V8-OPT-NEXT:    std %f2, [%i1+8]
-; V8-OPT-NEXT:    std %f0, [%i1]
+; V8-OPT-NEXT:    ldd [%fp+-8], %f0
+; V8-OPT-NEXT:    ldd [%fp+-16], %f4
+; V8-OPT-NEXT:    std %f0, [%i1+8]
+; V8-OPT-NEXT:    std %f4, [%i1]
 ; V8-OPT-NEXT:    ret
 ; V8-OPT-NEXT:    restore
 ;
@@ -133,10 +133,10 @@ define void @test_fpextend_fp128(ptr %p, ptr %out) nounwind {
 ; V9-NEXT:    call _Q_stoq
 ; V9-NEXT:    ld [%fp+-20], %o0
 ; V9-NEXT:    unimp 16
-; V9-NEXT:    ldd [%fp+-16], %f0
-; V9-NEXT:    ldd [%fp+-8], %f2
-; V9-NEXT:    std %f2, [%i1+8]
-; V9-NEXT:    std %f0, [%i1]
+; V9-NEXT:    ldd [%fp+-8], %f0
+; V9-NEXT:    ldd [%fp+-16], %f4
+; V9-NEXT:    std %f0, [%i1+8]
+; V9-NEXT:    std %f4, [%i1]
 ; V9-NEXT:    ret
 ; V9-NEXT:    restore
 ;
@@ -149,10 +149,10 @@ define void @test_fpextend_fp128(ptr %p, ptr %out) nounwind {
 ; SPARC64-NEXT:    fmovs %f0, %f3
 ; SPARC64-NEXT:    call _Qp_stoq
 ; SPARC64-NEXT:    nop
-; SPARC64-NEXT:    ldd [%fp+2031], %f0
-; SPARC64-NEXT:    ldd [%fp+2039], %f2
-; SPARC64-NEXT:    std %f2, [%i1+8]
-; SPARC64-NEXT:    std %f0, [%i1]
+; SPARC64-NEXT:    ldd [%fp+2039], %f0
+; SPARC64-NEXT:    ldd [%fp+2031], %f4
+; SPARC64-NEXT:    std %f0, [%i1+8]
+; SPARC64-NEXT:    std %f4, [%i1]
 ; SPARC64-NEXT:    ret
 ; SPARC64-NEXT:    restore
   %a = load half, ptr %p
@@ -270,8 +270,8 @@ define void @test_fptrunc_fp128(ptr %dp, ptr %p) nounwind {
 ; V8-OPT:       ! %bb.0:
 ; V8-OPT-NEXT:    save %sp, -104, %sp
 ; V8-OPT-NEXT:    ldd [%i0], %f0
-; V8-OPT-NEXT:    ldd [%i0+8], %f2
-; V8-OPT-NEXT:    std %f2, [%sp+100]
+; V8-OPT-NEXT:    ldd [%i0+8], %f4
+; V8-OPT-NEXT:    std %f4, [%sp+100]
 ; V8-OPT-NEXT:    call __trunctfhf2
 ; V8-OPT-NEXT:    std %f0, [%sp+92]
 ; V8-OPT-NEXT:    sth %o0, [%i1]
@@ -302,8 +302,8 @@ define void @test_fptrunc_fp128(ptr %dp, ptr %p) nounwind {
 ; V9:       ! %bb.0:
 ; V9-NEXT:    save %sp, -104, %sp
 ; V9-NEXT:    ldd [%i0], %f0
-; V9-NEXT:    ldd [%i0+8], %f2
-; V9-NEXT:    std %f2, [%sp+100]
+; V9-NEXT:    ldd [%i0+8], %f4
+; V9-NEXT:    std %f4, [%sp+100]
 ; V9-NEXT:    call __trunctfhf2
 ; V9-NEXT:    std %f0, [%sp+92]
 ; V9-NEXT:    sth %o0, [%i1]
diff --git a/llvm/test/CodeGen/SystemZ/int-uadd-01.ll b/llvm/test/CodeGen/SystemZ/int-uadd-01.ll
index 7d4aa9f3218485..ec36065737fbff 100644
--- a/llvm/test/CodeGen/SystemZ/int-uadd-01.ll
+++ b/llvm/test/CodeGen/SystemZ/int-uadd-01.ll
@@ -271,10 +271,10 @@ define zeroext i1 @f14(ptr %ptr0) {
 ; CHECK-NEXT:    .cfi_offset %r15, -40
 ; CHECK-NEXT:    aghi %r15, -168
 ; CHECK-NEXT:    .cfi_def_cfa_offset 328
-; CHECK-NEXT:    l %r6, 0(%r2)
-; CHECK-NEXT:    l %r13, 8(%r2)
-; CHECK-NEXT:    l %r12, 16(%r2)
-; CHECK-NEXT:    l %r7, 24(%r2)
+; CHECK-NEXT:    l %r7, 0(%r2)
+; CHECK-NEXT:    l %r6, 8(%r2)
+; CHECK-NEXT:    l %r13, 16(%r2)
+; CHECK-NEXT:    l %r12, 24(%r2)
 ; CHECK-NEXT:    l %r8, 32(%r2)
 ; CHECK-NEXT:    l %r9, 40(%r2)
 ; CHECK-NEXT:    l %r10, 48(%r2)
@@ -282,16 +282,16 @@ define zeroext i1 @f14(ptr %ptr0) {
 ; CHECK-NEXT:    mvc 160(4,%r15), 64(%r2) # 4-byte Folded Spill
 ; CHECK-NEXT:    mvc 164(4,%r15), 72(%r2) # 4-byte Folded Spill
 ; CHECK-NEXT:    brasl %r14, foo at PLT
-; CHECK-NEXT:    alr %r2, %r6
+; CHECK-NEXT:    alr %r2, %r7
 ; CHECK-NEXT:    ipm %r0
 ; CHECK-NEXT:    risbg %r0, %r0, 63, 191, 35
-; CHECK-NEXT:    alr %r2, %r13
+; CHECK-NEXT:    alr %r2, %r6
 ; CHECK-NEXT:    ipm %r1
 ; CHECK-NEXT:    rosbg %r0, %r1, 63, 63, 35
-; CHECK-NEXT:    alr %r2, %r12
+; CHECK-NEXT:    alr %r2, %r13
 ; CHECK-NEXT:    ipm %r1
 ; CHECK-NEXT:    rosbg %r0, %r1, 63, 63, 35
-; CHECK-NEXT:    alr %r2, %r7
+; CHECK-NEXT:    alr %r2, %r12
 ; CHECK-NEXT:    ipm %r1
 ; CHECK-NEXT:    rosbg %r0, %r1, 63, 63, 35
 ; CHECK-NEXT:    alr %r2, %r8
diff --git a/llvm/test/CodeGen/SystemZ/int-uadd-02.ll b/llvm/test/CodeGen/SystemZ/int-uadd-02.ll
index 46c5b4ff35fc05..ad6aa79feac474 100644
--- a/llvm/test/CodeGen/SystemZ/int-uadd-02.ll
+++ b/llvm/test/CodeGen/SystemZ/int-uadd-02.ll
@@ -215,10 +215,10 @@ define zeroext i1 @f11(ptr %ptr0) {
 ; CHECK-NEXT:    .cfi_offset %r15, -40
 ; CHECK-NEXT:    aghi %r15, -176
 ; CHECK-NEXT:    .cfi_def_cfa_offset 336
-; CHECK-NEXT:    lg %r6, 0(%r2)
-; CHECK-NEXT:    lg %r13, 16(%r2)
-; CHECK-NEXT:    lg %r12, 32(%r2)
-; CHECK-NEXT:    lg %r7, 48(%r2)
+; CHECK-NEXT:    lg %r7, 0(%r2)
+; CHECK-NEXT:    lg %r6, 16(%r2)
+; CHECK-NEXT:    lg %r13, 32(%r2)
+; CHECK-NEXT:    lg %r12, 48(%r2)
 ; CHECK-NEXT:    lg %r8, 64(%r2)
 ; CHECK-NEXT:    lg %r9, 80(%r2)
 ; CHECK-NEXT:    lg %r10, 96(%r2)
@@ -226,16 +226,16 @@ define zeroext i1 @f11(ptr %ptr0) {
 ; CHECK-NEXT:    mvc 160(8,%r15), 128(%r2) # 8-byte Folded Spill
 ; CHECK-NEXT:    mvc 168(8,%r15), 144(%r2) # 8-byte Folded Spill
 ; CHECK-NEXT:    brasl %r14, foo at PLT
-; CHECK-NEXT:    algr %r2, %r6
+; CHECK-NEXT:    algr %r2, %r7
 ; CHECK-NEXT:    ipm %r0
 ; CHECK-NEXT:    risbg %r0, %r0, 63, 191, 35
-; CHECK-NEXT:    algr %r2, %r13
+; CHECK-NEXT:    algr %r2, %r6
 ; CHECK-NEXT:    ipm %r1
 ; CHECK-NEXT:    rosbg %r0, %r1, 63, 63, 35
-; CHECK-NEXT:    algr %r2, %r12
+; CHECK-NEXT:    algr %r2, %r13
 ; CHECK-NEXT:    ipm %r1
 ; CHECK-NEXT:    rosbg %r0, %r1, 63, 63, 35
-; CHECK-NEXT:    algr %r2, %r7
+; CHECK-NEXT:    algr %r2, %r12
 ; CHECK-NEXT:    ipm %r1
 ; CHECK-NEXT:    rosbg %r0, %r1, 63, 63, 35
 ; CHECK-NEXT:    algr %r2, %r8
diff --git a/llvm/test/CodeGen/SystemZ/pr60413.ll b/llvm/test/CodeGen/SystemZ/pr60413.ll
index 62f5d49192ea9b..6dee3d303a6f28 100644
--- a/llvm/test/CodeGen/SystemZ/pr60413.ll
+++ b/llvm/test/CodeGen/SystemZ/pr60413.ll
@@ -40,38 +40,38 @@ define dso_local void @m() local_unnamed_addr #1 {
 ; CHECK-NEXT:    vn %v0, %v0, %v6
 ; CHECK-NEXT:    vn %v4, %v4, %v6
 ; CHECK-NEXT:    vperm %v1, %v1, %v1, %v5
-; CHECK-NEXT:    vn %v5, %v1, %v6
-; CHECK-NEXT:    vperm %v1, %v0, %v3, %v2
-; CHECK-NEXT:    vn %v2, %v1, %v6
-; CHECK-NEXT:    vrepif %v1, 127
-; CHECK-NEXT:    vchlf %v3, %v5, %v1
-; CHECK-NEXT:    vlgvf %r3, %v3, 1
-; CHECK-NEXT:    vlgvf %r2, %v3, 0
+; CHECK-NEXT:    vn %v1, %v1, %v6
+; CHECK-NEXT:    vperm %v2, %v0, %v3, %v2
+; CHECK-NEXT:    vn %v2, %v2, %v6
+; CHECK-NEXT:    vrepif %v3, 127
+; CHECK-NEXT:    vchlf %v1, %v1, %v3
+; CHECK-NEXT:    vlgvf %r3, %v1, 1
+; CHECK-NEXT:    vlgvf %r2, %v1, 0
 ; CHECK-NEXT:    risbg %r2, %r2, 48, 176, 15
 ; CHECK-NEXT:    rosbg %r2, %r3, 49, 49, 14
-; CHECK-NEXT:    vlgvf %r3, %v3, 2
+; CHECK-NEXT:    vlgvf %r3, %v1, 2
 ; CHECK-NEXT:    rosbg %r2, %r3, 50, 50, 13
-; CHECK-NEXT:    vlgvf %r3, %v3, 3
+; CHECK-NEXT:    vlgvf %r3, %v1, 3
 ; CHECK-NEXT:    rosbg %r2, %r3, 51, 51, 12
-; CHECK-NEXT:    vchlf %v3, %v4, %v1
-; CHECK-NEXT:    vlgvf %r3, %v3, 0
+; CHECK-NEXT:    vchlf %v1, %v4, %v3
+; CHECK-NEXT:    vlgvf %r3, %v1, 0
 ; CHECK-NEXT:    rosbg %r2, %r3, 52, 52, 11
-; CHECK-NEXT:    vlgvf %r3, %v3, 1
+; CHECK-NEXT:    vlgvf %r3, %v1, 1
 ; CHECK-NEXT:    rosbg %r2, %r3, 53, 53, 10
-; CHECK-NEXT:    vlgvf %r3, %v3, 2
+; CHECK-NEXT:    vlgvf %r3, %v1, 2
 ; CHECK-NEXT:    rosbg %r2, %r3, 54, 54, 9
-; CHECK-NEXT:    vlgvf %r3, %v3, 3
+; CHECK-NEXT:    vlgvf %r3, %v1, 3
 ; CHECK-NEXT:    rosbg %r2, %r3, 55, 55, 8
-; CHECK-NEXT:    vchlf %v2, %v2, %v1
-; CHECK-NEXT:    vlgvf %r3, %v2, 0
+; CHECK-NEXT:    vchlf %v1, %v2, %v3
+; CHECK-NEXT:    vlgvf %r3, %v1, 0
 ; CHECK-NEXT:    rosbg %r2, %r3, 56, 56, 7
-; CHECK-NEXT:    vlgvf %r3, %v2, 1
+; CHECK-NEXT:    vlgvf %r3, %v1, 1
 ; CHECK-NEXT:    rosbg %r2, %r3, 57, 57, 6
-; CHECK-NEXT:    vlgvf %r3, %v2, 2
+; CHECK-NEXT:    vlgvf %r3, %v1, 2
 ; CHECK-NEXT:    rosbg %r2, %r3, 58, 58, 5
-; CHECK-NEXT:    vlgvf %r3, %v2, 3
+; CHECK-NEXT:    vlgvf %r3, %v1, 3
 ; CHECK-NEXT:    rosbg %r2, %r3, 59, 59, 4
-; CHECK-NEXT:    vchlf %v0, %v0, %v1
+; CHECK-NEXT:    vchlf %v0, %v0, %v3
 ; CHECK-NEXT:    vlgvf %r3, %v0, 0
 ; CHECK-NEXT:    rosbg %r2, %r3, 60, 60, 3
 ; CHECK-NEXT:    vlgvf %r3, %v0, 1
diff --git a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
index c5f61b7fcdde55..dc67abc1be07eb 100644
--- a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-uniform-cases.ll
@@ -212,70 +212,70 @@ define arm_aapcs_vfpcc <12 x float> @abp90c12(<12 x float> %a, <12 x float> %b,
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    .pad #64
 ; CHECK-NEXT:    sub sp, #64
-; CHECK-NEXT:    vldr s23, [sp, #140]
-; CHECK-NEXT:    vmov.f32 s20, s13
-; CHECK-NEXT:    vldr s22, [sp, #132]
-; CHECK-NEXT:    vmov.f32 s25, s11
-; CHECK-NEXT:    vmov.f32 s13, s10
+; CHECK-NEXT:    vmov.f32 s21, s11
+; CHECK-NEXT:    vldr s31, [sp, #140]
+; CHECK-NEXT:    vmov.f32 s20, s9
+; CHECK-NEXT:    vldr s30, [sp, #132]
+; CHECK-NEXT:    vstrw.32 q5, [sp, #32] @ 16-byte Spill
+; CHECK-NEXT:    vmov.f32 s28, s13
+; CHECK-NEXT:    vmov.f32 s23, s6
 ; CHECK-NEXT:    vldr s19, [sp, #136]
-; CHECK-NEXT:    vmov.f32 s11, s6
-; CHECK-NEXT:    vldr s18, [sp, #128]
 ; CHECK-NEXT:    vmov.f32 s6, s5
-; CHECK-NEXT:    vldr s31, [sp, #188]
-; CHECK-NEXT:    vmov.f32 s10, s4
-; CHECK-NEXT:    vldr s30, [sp, #180]
-; CHECK-NEXT:    vmov.f32 s21, s15
-; CHECK-NEXT:    vldr s29, [sp, #172]
+; CHECK-NEXT:    vldr s18, [sp, #128]
+; CHECK-NEXT:    vmov.f32 s22, s4
+; CHECK-NEXT:    vldr s27, [sp, #184]
+; CHECK-NEXT:    vmov.f32 s29, s15
+; CHECK-NEXT:    vldr s26, [sp, #176]
 ; CHECK-NEXT:    vmov.f32 s5, s3
-; CHECK-NEXT:    vldr s28, [sp, #164]
+; CHECK-NEXT:    vldr s25, [sp, #168]
 ; CHECK-NEXT:    vmov.f32 s4, s1
-; CHECK-NEXT:    vmov.f32 s24, s9
+; CHECK-NEXT:    vldr s24, [sp, #160]
 ; CHECK-NEXT:    vmov.f32 s16, s12
-; CHECK-NEXT:    vstrw.32 q6, [sp, #32] @ 16-byte Spill
-; CHECK-NEXT:    vmov.f32 s12, s8
-; CHECK-NEXT:    vldr s27, [sp, #184]
+; CHECK-NEXT:    vldr s11, [sp, #188]
 ; CHECK-NEXT:    vmov.f32 s17, s14
-; CHECK-NEXT:    vldr s26, [sp, #176]
-; CHECK-NEXT:    vmov.f32 s9, s2
-; CHECK-NEXT:    vldr s25, [sp, #168]
-; CHECK-NEXT:    vmov.f32 s8, s0
-; CHECK-NEXT:    vmul.f32 q0, q5, q1
+; CHECK-NEXT:    vldr s9, [sp, #172]
+; CHECK-NEXT:    vmov.f32 s21, s2
+; CHECK-NEXT:    vmov.f32 s20, s0
+; CHECK-NEXT:    vmul.f32 q0, q7, q1
 ; CHECK-NEXT:    vmul.f32 q1, q4, q1
 ; CHECK-NEXT:    vneg.f32 q0, q0
-; CHECK-NEXT:    vldr s24, [sp, #160]
-; CHECK-NEXT:    vfma.f32 q1, q5, q2
 ; CHECK-NEXT:    vstrw.32 q0, [sp, #16] @ 16-byte Spill
-; CHECK-NEXT:    vstrw.32 q3, [sp, #48] @ 16-byte Spill
-; CHECK-NEXT:    vsub.f32 q6, q6, q1
+; CHECK-NEXT:    vfma.f32 q1, q7, q5
+; CHECK-NEXT:    vsub.f32 q7, q6, q1
 ; CHECK-NEXT:    vldrw.u32 q1, [sp, #16] @ 16-byte Reload
+; CHECK-NEXT:    vmov.f32 s13, s10
+; CHECK-NEXT:    vldr s10, [sp, #180]
+; CHECK-NEXT:    vmov.f32 s12, s8
+; CHECK-NEXT:    vfma.f32 q1, q4, q5
+; CHECK-NEXT:    vstrw.32 q3, [sp, #48] @ 16-byte Spill
+; CHECK-NEXT:    vldr s8, [sp, #164]
 ; CHECK-NEXT:    vldr s13, [sp, #156]
-; CHECK-NEXT:    vfma.f32 q1, q4, q2
+; CHECK-NEXT:    vldrw.u32 q5, [sp, #32] @ 16-byte Reload
 ; CHECK-NEXT:    vldr s12, [sp, #148]
-; CHECK-NEXT:    vadd.f32 q1, q7, q1
-; CHECK-NEXT:    vldrw.u32 q7, [sp, #32] @ 16-byte Reload
+; CHECK-NEXT:    vadd.f32 q1, q2, q1
 ; CHECK-NEXT:    vldr s1, [sp, #152]
+; CHECK-NEXT:    vmul.f32 q2, q3, q5
 ; CHECK-NEXT:    vstrw.32 q3, [sp] @ 16-byte Spill
-; CHECK-NEXT:    vmul.f32 q2, q3, q7
 ; CHECK-NEXT:    vldr s0, [sp, #144]
 ; CHECK-NEXT:    vldrw.u32 q3, [sp, #48] @ 16-byte Reload
 ; CHECK-NEXT:    vneg.f32 q2, q2
-; CHECK-NEXT:    vldr s21, [sp, #200]
+; CHECK-NEXT:    vldr s25, [sp, #200]
 ; CHECK-NEXT:    vfma.f32 q2, q0, q3
-; CHECK-NEXT:    vmul.f32 q0, q0, q7
-; CHECK-NEXT:    vldrw.u32 q7, [sp] @ 16-byte Reload
-; CHECK-NEXT:    vldr s20, [sp, #192]
+; CHECK-NEXT:    vmul.f32 q0, q0, q5
+; CHECK-NEXT:    vldrw.u32 q5, [sp] @ 16-byte Reload
+; CHECK-NEXT:    vldr s24, [sp, #192]
 ; CHECK-NEXT:    vldr s17, [sp, #204]
 ; CHECK-NEXT:    vldr s16, [sp, #196]
-; CHECK-NEXT:    vfma.f32 q0, q7, q3
-; CHECK-NEXT:    vsub.f32 q3, q5, q0
+; CHECK-NEXT:    vfma.f32 q0, q5, q3
+; CHECK-NEXT:    vsub.f32 q3, q6, q0
 ; CHECK-NEXT:    vmov.f32 s1, s4
 ; CHECK-NEXT:    vadd.f32 q4, q4, q2
 ; CHECK-NEXT:    vmov.f32 s3, s5
 ; CHECK-NEXT:    vmov.f32 s5, s6
-; CHECK-NEXT:    vmov.f32 s0, s24
-; CHECK-NEXT:    vmov.f32 s2, s25
-; CHECK-NEXT:    vmov.f32 s4, s26
-; CHECK-NEXT:    vmov.f32 s6, s27
+; CHECK-NEXT:    vmov.f32 s0, s28
+; CHECK-NEXT:    vmov.f32 s2, s29
+; CHECK-NEXT:    vmov.f32 s4, s30
+; CHECK-NEXT:    vmov.f32 s6, s31
 ; CHECK-NEXT:    vmov.f32 s8, s12
 ; CHECK-NEXT:    vmov.f32 s9, s16
 ; CHECK-NEXT:    vmov.f32 s10, s13
diff --git a/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll b/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
index eb52b5ab9fd171..28166e455aba27 100644
--- a/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
@@ -1912,11 +1912,11 @@ define void @arm_biquad_cascade_df2T_f32(ptr nocapture readonly %S, ptr nocaptur
 ; CHECK-NEXT:    b .LBB20_3
 ; CHECK-NEXT:  .LBB20_1: @ %if.else
 ; CHECK-NEXT:    @ in Loop: Header=BB20_3 Depth=1
-; CHECK-NEXT:    vmov.f32 s6, s5
-; CHECK-NEXT:    vstr s4, [r6]
+; CHECK-NEXT:    vmov.f32 s14, s13
+; CHECK-NEXT:    vstr s12, [r6]
 ; CHECK-NEXT:  .LBB20_2: @ %if.end
 ; CHECK-NEXT:    @ in Loop: Header=BB20_3 Depth=1
-; CHECK-NEXT:    vstr s6, [r6, #4]
+; CHECK-NEXT:    vstr s14, [r6, #4]
 ; CHECK-NEXT:    add.w r12, r12, #20
 ; CHECK-NEXT:    adds r6, #8
 ; CHECK-NEXT:    subs r0, #1
@@ -1925,41 +1925,41 @@ define void @arm_biquad_cascade_df2T_f32(ptr nocapture readonly %S, ptr nocaptur
 ; CHECK-NEXT:  .LBB20_3: @ %do.body
 ; CHECK-NEXT:    @ =>This Loop Header: Depth=1
 ; CHECK-NEXT:    @ Child Loop BB20_5 Depth 2
-; CHECK-NEXT:    vldrw.u32 q3, [r12]
+; CHECK-NEXT:    vldrw.u32 q2, [r12]
 ; CHECK-NEXT:    movs r5, #0
-; CHECK-NEXT:    vmov q4, q3
+; CHECK-NEXT:    vmov q4, q2
 ; CHECK-NEXT:    vshlc q4, r5, #32
-; CHECK-NEXT:    vldrw.u32 q2, [r12, #8]
-; CHECK-NEXT:    vmov q5, q2
+; CHECK-NEXT:    vldrw.u32 q1, [r12, #8]
+; CHECK-NEXT:    vmov q5, q1
 ; CHECK-NEXT:    vshlc q5, r5, #32
-; CHECK-NEXT:    vldrw.u32 q1, [r6]
-; CHECK-NEXT:    vmov.f32 s6, s0
+; CHECK-NEXT:    vldrw.u32 q3, [r6]
+; CHECK-NEXT:    vmov.f32 s14, s0
 ; CHECK-NEXT:    mov r5, r2
-; CHECK-NEXT:    vmov.f32 s7, s0
+; CHECK-NEXT:    vmov.f32 s15, s0
 ; CHECK-NEXT:    wls lr, r8, .LBB20_6
 ; CHECK-NEXT:  @ %bb.4: @ %while.body.preheader
 ; CHECK-NEXT:    @ in Loop: Header=BB20_3 Depth=1
-; CHECK-NEXT:    vmov q6, q1
+; CHECK-NEXT:    vmov q6, q3
 ; CHECK-NEXT:    mov r5, r2
 ; CHECK-NEXT:  .LBB20_5: @ %while.body
 ; CHECK-NEXT:    @ Parent Loop BB20_3 Depth=1
 ; CHECK-NEXT:    @ => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    ldrd r7, r4, [r1], #8
-; CHECK-NEXT:    vfma.f32 q6, q3, r7
+; CHECK-NEXT:    vfma.f32 q6, q2, r7
 ; CHECK-NEXT:    vmov r7, s24
-; CHECK-NEXT:    vmov q1, q6
-; CHECK-NEXT:    vfma.f32 q1, q2, r7
+; CHECK-NEXT:    vmov q3, q6
+; CHECK-NEXT:    vfma.f32 q3, q1, r7
 ; CHECK-NEXT:    vstr s24, [r5]
-; CHECK-NEXT:    vmov.f32 s7, s0
-; CHECK-NEXT:    vfma.f32 q1, q4, r4
-; CHECK-NEXT:    vmov r4, s5
-; CHECK-NEXT:    vstr s5, [r5, #4]
-; CHECK-NEXT:    vfma.f32 q1, q5, r4
+; CHECK-NEXT:    vmov.f32 s15, s0
+; CHECK-NEXT:    vfma.f32 q3, q4, r4
+; CHECK-NEXT:    vmov r4, s13
+; CHECK-NEXT:    vstr s13, [r5, #4]
+; CHECK-NEXT:    vfma.f32 q3, q5, r4
 ; CHECK-NEXT:    adds r5, #8
-; CHECK-NEXT:    vmov.f32 s4, s6
-; CHECK-NEXT:    vmov.f32 s5, s7
-; CHECK-NEXT:    vmov.f32 s6, s0
-; CHECK-NEXT:    vmov q6, q1
+; CHECK-NEXT:    vmov.f32 s12, s14
+; CHECK-NEXT:    vmov.f32 s13, s15
+; CHECK-NEXT:    vmov.f32 s14, s0
+; CHECK-NEXT:    vmov q6, q3
 ; CHECK-NEXT:    le lr, .LBB20_5
 ; CHECK-NEXT:  .LBB20_6: @ %while.end
 ; CHECK-NEXT:    @ in Loop: Header=BB20_3 Depth=1
@@ -1968,11 +1968,11 @@ define void @arm_biquad_cascade_df2T_f32(ptr nocapture readonly %S, ptr nocaptur
 ; CHECK-NEXT:  @ %bb.7: @ %if.then
 ; CHECK-NEXT:    @ in Loop: Header=BB20_3 Depth=1
 ; CHECK-NEXT:    ldr r1, [r1]
-; CHECK-NEXT:    vfma.f32 q1, q3, r1
-; CHECK-NEXT:    vmov r1, s4
-; CHECK-NEXT:    vstr s4, [r5]
-; CHECK-NEXT:    vfma.f32 q1, q2, r1
-; CHECK-NEXT:    vstr s5, [r6]
+; CHECK-NEXT:    vfma.f32 q3, q2, r1
+; CHECK-NEXT:    vmov r1, s12
+; CHECK-NEXT:    vstr s12, [r5]
+; CHECK-NEXT:    vfma.f32 q3, q1, r1
+; CHECK-NEXT:    vstr s13, [r6]
 ; CHECK-NEXT:    b .LBB20_2
 ; CHECK-NEXT:  .LBB20_8: @ %do.end
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
diff --git a/llvm/test/CodeGen/Thumb2/mve-masked-ldst.ll b/llvm/test/CodeGen/Thumb2/mve-masked-ldst.ll
index 4934d223209037..67910e40e77058 100644
--- a/llvm/test/CodeGen/Thumb2/mve-masked-ldst.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-masked-ldst.ll
@@ -762,10 +762,10 @@ define void @foo_v4f32_v4f16(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-LE-NEXT:  .LBB18_5: @ %else8
 ; CHECK-LE-NEXT:    vmrs r2, p0
 ; CHECK-LE-NEXT:    movs r1, #0
-; CHECK-LE-NEXT:    vcvtt.f32.f16 s3, s1
-; CHECK-LE-NEXT:    vcvtb.f32.f16 s2, s1
-; CHECK-LE-NEXT:    vcvtt.f32.f16 s1, s0
+; CHECK-LE-NEXT:    vcvtt.f32.f16 s6, s0
 ; CHECK-LE-NEXT:    vcvtb.f32.f16 s0, s0
+; CHECK-LE-NEXT:    vcvtt.f32.f16 s2, s1
+; CHECK-LE-NEXT:    vcvtb.f32.f16 s4, s1
 ; CHECK-LE-NEXT:    and r3, r2, #1
 ; CHECK-LE-NEXT:    rsbs r3, r3, #0
 ; CHECK-LE-NEXT:    bfi r1, r3, #0, #1
@@ -784,15 +784,15 @@ define void @foo_v4f32_v4f16(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-LE-NEXT:    strne r2, [r0]
 ; CHECK-LE-NEXT:    lsls r2, r1, #30
 ; CHECK-LE-NEXT:    itt mi
-; CHECK-LE-NEXT:    vmovmi r2, s1
+; CHECK-LE-NEXT:    vmovmi r2, s6
 ; CHECK-LE-NEXT:    strmi r2, [r0, #4]
 ; CHECK-LE-NEXT:    lsls r2, r1, #29
 ; CHECK-LE-NEXT:    itt mi
-; CHECK-LE-NEXT:    vmovmi r2, s2
+; CHECK-LE-NEXT:    vmovmi r2, s4
 ; CHECK-LE-NEXT:    strmi r2, [r0, #8]
 ; CHECK-LE-NEXT:    lsls r1, r1, #28
 ; CHECK-LE-NEXT:    itt mi
-; CHECK-LE-NEXT:    vmovmi r1, s3
+; CHECK-LE-NEXT:    vmovmi r1, s2
 ; CHECK-LE-NEXT:    strmi r1, [r0, #12]
 ; CHECK-LE-NEXT:    add sp, #4
 ; CHECK-LE-NEXT:    pop {r7, pc}
@@ -853,10 +853,10 @@ define void @foo_v4f32_v4f16(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-BE-NEXT:  .LBB18_5: @ %else8
 ; CHECK-BE-NEXT:    vmrs r2, p0
 ; CHECK-BE-NEXT:    movs r1, #0
-; CHECK-BE-NEXT:    vcvtt.f32.f16 s3, s1
-; CHECK-BE-NEXT:    vcvtb.f32.f16 s2, s1
-; CHECK-BE-NEXT:    vcvtt.f32.f16 s1, s0
+; CHECK-BE-NEXT:    vcvtt.f32.f16 s6, s0
 ; CHECK-BE-NEXT:    vcvtb.f32.f16 s0, s0
+; CHECK-BE-NEXT:    vcvtt.f32.f16 s2, s1
+; CHECK-BE-NEXT:    vcvtb.f32.f16 s4, s1
 ; CHECK-BE-NEXT:    ubfx r3, r2, #12, #1
 ; CHECK-BE-NEXT:    rsbs r3, r3, #0
 ; CHECK-BE-NEXT:    bfi r1, r3, #0, #1
@@ -875,15 +875,15 @@ define void @foo_v4f32_v4f16(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-BE-NEXT:    strmi r2, [r0]
 ; CHECK-BE-NEXT:    lsls r2, r1, #29
 ; CHECK-BE-NEXT:    itt mi
-; CHECK-BE-NEXT:    vmovmi r2, s1
+; CHECK-BE-NEXT:    vmovmi r2, s6
 ; CHECK-BE-NEXT:    strmi r2, [r0, #4]
 ; CHECK-BE-NEXT:    lsls r2, r1, #30
 ; CHECK-BE-NEXT:    itt mi
-; CHECK-BE-NEXT:    vmovmi r2, s2
+; CHECK-BE-NEXT:    vmovmi r2, s4
 ; CHECK-BE-NEXT:    strmi r2, [r0, #8]
 ; CHECK-BE-NEXT:    lsls r1, r1, #31
 ; CHECK-BE-NEXT:    itt ne
-; CHECK-BE-NEXT:    vmovne r1, s3
+; CHECK-BE-NEXT:    vmovne r1, s2
 ; CHECK-BE-NEXT:    strne r1, [r0, #12]
 ; CHECK-BE-NEXT:    add sp, #4
 ; CHECK-BE-NEXT:    pop {r7, pc}
@@ -953,10 +953,10 @@ define void @foo_v4f32_v4f16_unaligned(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-LE-NEXT:  .LBB19_5: @ %else8
 ; CHECK-LE-NEXT:    vmrs r2, p0
 ; CHECK-LE-NEXT:    movs r1, #0
-; CHECK-LE-NEXT:    vcvtt.f32.f16 s3, s1
-; CHECK-LE-NEXT:    vcvtb.f32.f16 s2, s1
-; CHECK-LE-NEXT:    vcvtt.f32.f16 s1, s0
+; CHECK-LE-NEXT:    vcvtt.f32.f16 s6, s0
 ; CHECK-LE-NEXT:    vcvtb.f32.f16 s0, s0
+; CHECK-LE-NEXT:    vcvtt.f32.f16 s2, s1
+; CHECK-LE-NEXT:    vcvtb.f32.f16 s4, s1
 ; CHECK-LE-NEXT:    and r3, r2, #1
 ; CHECK-LE-NEXT:    rsbs r3, r3, #0
 ; CHECK-LE-NEXT:    bfi r1, r3, #0, #1
@@ -975,15 +975,15 @@ define void @foo_v4f32_v4f16_unaligned(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-LE-NEXT:    strne r2, [r0]
 ; CHECK-LE-NEXT:    lsls r2, r1, #30
 ; CHECK-LE-NEXT:    itt mi
-; CHECK-LE-NEXT:    vmovmi r2, s1
+; CHECK-LE-NEXT:    vmovmi r2, s6
 ; CHECK-LE-NEXT:    strmi r2, [r0, #4]
 ; CHECK-LE-NEXT:    lsls r2, r1, #29
 ; CHECK-LE-NEXT:    itt mi
-; CHECK-LE-NEXT:    vmovmi r2, s2
+; CHECK-LE-NEXT:    vmovmi r2, s4
 ; CHECK-LE-NEXT:    strmi r2, [r0, #8]
 ; CHECK-LE-NEXT:    lsls r1, r1, #28
 ; CHECK-LE-NEXT:    itt mi
-; CHECK-LE-NEXT:    vmovmi r1, s3
+; CHECK-LE-NEXT:    vmovmi r1, s2
 ; CHECK-LE-NEXT:    strmi r1, [r0, #12]
 ; CHECK-LE-NEXT:    add sp, #4
 ; CHECK-LE-NEXT:    pop {r7, pc}
@@ -1044,10 +1044,10 @@ define void @foo_v4f32_v4f16_unaligned(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-BE-NEXT:  .LBB19_5: @ %else8
 ; CHECK-BE-NEXT:    vmrs r2, p0
 ; CHECK-BE-NEXT:    movs r1, #0
-; CHECK-BE-NEXT:    vcvtt.f32.f16 s3, s1
-; CHECK-BE-NEXT:    vcvtb.f32.f16 s2, s1
-; CHECK-BE-NEXT:    vcvtt.f32.f16 s1, s0
+; CHECK-BE-NEXT:    vcvtt.f32.f16 s6, s0
 ; CHECK-BE-NEXT:    vcvtb.f32.f16 s0, s0
+; CHECK-BE-NEXT:    vcvtt.f32.f16 s2, s1
+; CHECK-BE-NEXT:    vcvtb.f32.f16 s4, s1
 ; CHECK-BE-NEXT:    ubfx r3, r2, #12, #1
 ; CHECK-BE-NEXT:    rsbs r3, r3, #0
 ; CHECK-BE-NEXT:    bfi r1, r3, #0, #1
@@ -1066,15 +1066,15 @@ define void @foo_v4f32_v4f16_unaligned(ptr %dest, ptr %mask, ptr %src) {
 ; CHECK-BE-NEXT:    strmi r2, [r0]
 ; CHECK-BE-NEXT:    lsls r2, r1, #29
 ; CHECK-BE-NEXT:    itt mi
-; CHECK-BE-NEXT:    vmovmi r2, s1
+; CHECK-BE-NEXT:    vmovmi r2, s6
 ; CHECK-BE-NEXT:    strmi r2, [r0, #4]
 ; CHECK-BE-NEXT:    lsls r2, r1, #30
 ; CHECK-BE-NEXT:    itt mi
-; CHECK-BE-NEXT:    vmovmi r2, s2
+; CHECK-BE-NEXT:    vmovmi r2, s4
 ; CHECK-BE-NEXT:    strmi r2, [r0, #8]
 ; CHECK-BE-NEXT:    lsls r1, r1, #31
 ; CHECK-BE-NEXT:    itt ne
-; CHECK-BE-NEXT:    vmovne r1, s3
+; CHECK-BE-NEXT:    vmovne r1, s2
 ; CHECK-BE-NEXT:    strne r1, [r0, #12]
 ; CHECK-BE-NEXT:    add sp, #4
 ; CHECK-BE-NEXT:    pop {r7, pc}
diff --git a/llvm/test/CodeGen/Thumb2/mve-shuffle.ll b/llvm/test/CodeGen/Thumb2/mve-shuffle.ll
index f4643f8c6c4a1f..33816fec69bacf 100644
--- a/llvm/test/CodeGen/Thumb2/mve-shuffle.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-shuffle.ll
@@ -225,15 +225,15 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @shuffle3_i16(<8 x i16> %src) {
 ; CHECK-LABEL: shuffle3_i16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vmov q1, q0
-; CHECK-NEXT:    vmovx.f16 s2, s5
-; CHECK-NEXT:    vmovx.f16 s0, s4
-; CHECK-NEXT:    vins.f16 s5, s4
-; CHECK-NEXT:    vins.f16 s2, s0
-; CHECK-NEXT:    vmov.f32 s3, s5
-; CHECK-NEXT:    vmovx.f16 s1, s7
-; CHECK-NEXT:    vmov.f32 s0, s6
-; CHECK-NEXT:    vins.f16 s1, s7
+; CHECK-NEXT:    vmovx.f16 s5, s3
+; CHECK-NEXT:    vmovx.f16 s6, s1
+; CHECK-NEXT:    vmovx.f16 s4, s0
+; CHECK-NEXT:    vins.f16 s1, s0
+; CHECK-NEXT:    vins.f16 s6, s4
+; CHECK-NEXT:    vins.f16 s5, s3
+; CHECK-NEXT:    vmov.f32 s4, s2
+; CHECK-NEXT:    vmov.f32 s7, s1
+; CHECK-NEXT:    vmov q0, q1
 ; CHECK-NEXT:    bx lr
 entry:
   %out = shufflevector <8 x i16> %src, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 7, i32 6, i32 3, i32 1, i32 2, i32 0>
@@ -357,22 +357,22 @@ define arm_aapcs_vfpcc <8 x i16> @shuffle3step_i16(<32 x i16> %src) {
 ; CHECK-NEXT:    vmov.f32 s17, s4
 ; CHECK-NEXT:    vmovx.f16 s13, s3
 ; CHECK-NEXT:    vins.f16 s17, s2
-; CHECK-NEXT:    vmov.f32 s18, s7
 ; CHECK-NEXT:    vmovx.f16 s2, s8
-; CHECK-NEXT:    vmov.f32 s19, s10
+; CHECK-NEXT:    vmov.f32 s18, s7
+; CHECK-NEXT:    vmovx.f16 s14, s6
 ; CHECK-NEXT:    vins.f16 s18, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s11
+; CHECK-NEXT:    vmov.f32 s19, s10
+; CHECK-NEXT:    vmovx.f16 s15, s9
 ; CHECK-NEXT:    vins.f16 s19, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s1
 ; CHECK-NEXT:    vins.f16 s0, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s4
 ; CHECK-NEXT:    vins.f16 s3, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s7
-; CHECK-NEXT:    vmovx.f16 s4, s10
-; CHECK-NEXT:    vmovx.f16 s14, s6
-; CHECK-NEXT:    vmovx.f16 s15, s9
 ; CHECK-NEXT:    vins.f16 s6, s2
-; CHECK-NEXT:    vins.f16 s9, s4
+; CHECK-NEXT:    vmovx.f16 s2, s10
+; CHECK-NEXT:    vins.f16 s9, s2
 ; CHECK-NEXT:    vmov.f32 s1, s3
 ; CHECK-NEXT:    vins.f16 s14, s8
 ; CHECK-NEXT:    vins.f16 s15, s11
@@ -416,8 +416,8 @@ define arm_aapcs_vfpcc <8 x i16> @shuffle4step_i16(<32 x i16> %src) {
 ; CHECK-NEXT:    vmov.f32 s20, s1
 ; CHECK-NEXT:    vmovx.f16 s1, s10
 ; CHECK-NEXT:    vmov.f32 s22, s9
-; CHECK-NEXT:    vmov.f32 s23, s13
 ; CHECK-NEXT:    vmov.f32 s21, s5
+; CHECK-NEXT:    vmov.f32 s23, s13
 ; CHECK-NEXT:    vadd.i16 q4, q5, q4
 ; CHECK-NEXT:    vmovx.f16 s22, s8
 ; CHECK-NEXT:    vins.f16 s22, s1
@@ -434,9 +434,9 @@ define arm_aapcs_vfpcc <8 x i16> @shuffle4step_i16(<32 x i16> %src) {
 ; CHECK-NEXT:    vins.f16 s4, s6
 ; CHECK-NEXT:    vins.f16 s21, s1
 ; CHECK-NEXT:    vins.f16 s0, s2
-; CHECK-NEXT:    vmov.f32 s3, s12
 ; CHECK-NEXT:    vmov.f32 s1, s4
 ; CHECK-NEXT:    vmov.f32 s2, s8
+; CHECK-NEXT:    vmov.f32 s3, s12
 ; CHECK-NEXT:    vadd.i16 q0, q0, q5
 ; CHECK-NEXT:    vadd.i16 q0, q0, q4
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11}
@@ -1147,15 +1147,15 @@ entry:
 define arm_aapcs_vfpcc <8 x half> @shuffle3_f16(<8 x half> %src) {
 ; CHECK-LABEL: shuffle3_f16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vmov q1, q0
-; CHECK-NEXT:    vmovx.f16 s2, s5
-; CHECK-NEXT:    vmovx.f16 s0, s4
-; CHECK-NEXT:    vins.f16 s5, s4
-; CHECK-NEXT:    vins.f16 s2, s0
-; CHECK-NEXT:    vmov.f32 s3, s5
-; CHECK-NEXT:    vmovx.f16 s1, s7
-; CHECK-NEXT:    vmov.f32 s0, s6
-; CHECK-NEXT:    vins.f16 s1, s7
+; CHECK-NEXT:    vmovx.f16 s5, s3
+; CHECK-NEXT:    vmovx.f16 s6, s1
+; CHECK-NEXT:    vmovx.f16 s4, s0
+; CHECK-NEXT:    vins.f16 s1, s0
+; CHECK-NEXT:    vins.f16 s6, s4
+; CHECK-NEXT:    vins.f16 s5, s3
+; CHECK-NEXT:    vmov.f32 s4, s2
+; CHECK-NEXT:    vmov.f32 s7, s1
+; CHECK-NEXT:    vmov q0, q1
 ; CHECK-NEXT:    bx lr
 entry:
   %out = shufflevector <8 x half> %src, <8 x half> undef, <8 x i32> <i32 4, i32 5, i32 7, i32 6, i32 3, i32 1, i32 2, i32 0>
@@ -1348,12 +1348,12 @@ define arm_aapcs_vfpcc <8 x half> @shuffle4step_f16(<32 x half> %src) {
 ; CHECKFP-NEXT:    vins.f16 s0, s2
 ; CHECKFP-NEXT:    vmov.f32 s24, s1
 ; CHECKFP-NEXT:    vmov.f32 s26, s9
-; CHECKFP-NEXT:    vmov.f32 s27, s13
 ; CHECKFP-NEXT:    vmov.f32 s25, s5
+; CHECKFP-NEXT:    vmov.f32 s27, s13
 ; CHECKFP-NEXT:    vmov.f32 s2, s8
 ; CHECKFP-NEXT:    vadd.f16 q4, q6, q4
-; CHECKFP-NEXT:    vmov.f32 s3, s12
 ; CHECKFP-NEXT:    vmov.f32 s1, s4
+; CHECKFP-NEXT:    vmov.f32 s3, s12
 ; CHECKFP-NEXT:    vadd.f16 q0, q0, q5
 ; CHECKFP-NEXT:    vadd.f16 q0, q0, q4
 ; CHECKFP-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
diff --git a/llvm/test/CodeGen/Thumb2/mve-vcvt16.ll b/llvm/test/CodeGen/Thumb2/mve-vcvt16.ll
index a5725a2a300483..cdaf4468340f73 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vcvt16.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vcvt16.ll
@@ -327,11 +327,12 @@ define arm_aapcs_vfpcc <4 x float> @load_ext_4(ptr %src) {
 ; CHECK-MVE:       @ %bb.0: @ %entry
 ; CHECK-MVE-NEXT:    ldrd r0, r1, [r0]
 ; CHECK-MVE-NEXT:    vmov.32 q0[0], r0
-; CHECK-MVE-NEXT:    vmov.32 q0[1], r1
-; CHECK-MVE-NEXT:    vcvtt.f32.f16 s3, s1
-; CHECK-MVE-NEXT:    vcvtb.f32.f16 s2, s1
+; CHECK-MVE-NEXT:    vmov q1, q0
 ; CHECK-MVE-NEXT:    vcvtt.f32.f16 s1, s0
+; CHECK-MVE-NEXT:    vmov.32 q1[1], r1
 ; CHECK-MVE-NEXT:    vcvtb.f32.f16 s0, s0
+; CHECK-MVE-NEXT:    vcvtt.f32.f16 s3, s5
+; CHECK-MVE-NEXT:    vcvtb.f32.f16 s2, s5
 ; CHECK-MVE-NEXT:    bx lr
 ;
 ; CHECK-MVEFP-LABEL: load_ext_4:
diff --git a/llvm/test/CodeGen/Thumb2/mve-vld2.ll b/llvm/test/CodeGen/Thumb2/mve-vld2.ll
index 633aef46bbffcc..36a035d56bcfa6 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vld2.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vld2.ll
@@ -514,11 +514,12 @@ define void @vld2_v2f16(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    ldr r2, [r0]
 ; CHECK-NEXT:    ldr r0, [r0, #4]
 ; CHECK-NEXT:    vmov.32 q0[0], r2
-; CHECK-NEXT:    vmov.32 q0[1], r0
+; CHECK-NEXT:    vmov q1, q0
+; CHECK-NEXT:    vmov.32 q1[1], r0
 ; CHECK-NEXT:    vmovx.f16 s4, s0
-; CHECK-NEXT:    vmovx.f16 s2, s1
+; CHECK-NEXT:    vmovx.f16 s2, s5
 ; CHECK-NEXT:    vins.f16 s4, s2
-; CHECK-NEXT:    vins.f16 s0, s1
+; CHECK-NEXT:    vins.f16 s0, s5
 ; CHECK-NEXT:    vadd.f16 q0, q0, q1
 ; CHECK-NEXT:    vmov r0, s0
 ; CHECK-NEXT:    str r0, [r1]
diff --git a/llvm/test/CodeGen/Thumb2/mve-vld3.ll b/llvm/test/CodeGen/Thumb2/mve-vld3.ll
index ccdc996d75970e..b207ce7bdefd1b 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vld3.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vld3.ll
@@ -313,11 +313,11 @@ define void @vld3_v8i16(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    vmovx.f16 s2, s8
 ; CHECK-NEXT:    vins.f16 s3, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s11
-; CHECK-NEXT:    vmovx.f16 s8, s14
 ; CHECK-NEXT:    vmovx.f16 s18, s10
-; CHECK-NEXT:    vmovx.f16 s19, s13
 ; CHECK-NEXT:    vins.f16 s10, s2
-; CHECK-NEXT:    vins.f16 s13, s8
+; CHECK-NEXT:    vmovx.f16 s2, s14
+; CHECK-NEXT:    vmovx.f16 s19, s13
+; CHECK-NEXT:    vins.f16 s13, s2
 ; CHECK-NEXT:    vmov.f32 s1, s3
 ; CHECK-NEXT:    vins.f16 s18, s12
 ; CHECK-NEXT:    vins.f16 s19, s15
@@ -368,55 +368,55 @@ define void @vld3_v16i16(ptr %src, ptr %dst) {
 ; CHECK-LV-NEXT:    vmovx.f16 s2, s8
 ; CHECK-LV-NEXT:    vins.f16 s3, s2
 ; CHECK-LV-NEXT:    vmovx.f16 s2, s11
-; CHECK-LV-NEXT:    vmovx.f16 s8, s14
 ; CHECK-LV-NEXT:    vmovx.f16 s18, s10
-; CHECK-LV-NEXT:    vmovx.f16 s19, s13
 ; CHECK-LV-NEXT:    vins.f16 s10, s2
-; CHECK-LV-NEXT:    vins.f16 s13, s8
+; CHECK-LV-NEXT:    vmovx.f16 s2, s14
+; CHECK-LV-NEXT:    vmovx.f16 s19, s13
+; CHECK-LV-NEXT:    vins.f16 s13, s2
 ; CHECK-LV-NEXT:    vmov.f32 s1, s3
 ; CHECK-LV-NEXT:    vins.f16 s18, s12
 ; CHECK-LV-NEXT:    vins.f16 s19, s15
-; CHECK-LV-NEXT:    vmov.f32 s3, s13
-; CHECK-LV-NEXT:    vldrw.u32 q3, [r0, #16]
 ; CHECK-LV-NEXT:    vins.f16 s17, s9
 ; CHECK-LV-NEXT:    vmov.f32 s2, s10
+; CHECK-LV-NEXT:    vmov.f32 s3, s13
+; CHECK-LV-NEXT:    vldrw.u32 q2, [r0, #16]
 ; CHECK-LV-NEXT:    vadd.i16 q0, q0, q4
-; CHECK-LV-NEXT:    vldrw.u32 q2, [r0, #32]
-; CHECK-LV-NEXT:    vadd.i16 q0, q0, q1
-; CHECK-LV-NEXT:    vmovx.f16 s6, s14
 ; CHECK-LV-NEXT:    vldrw.u32 q4, [r0]
-; CHECK-LV-NEXT:    vins.f16 s6, s8
-; CHECK-LV-NEXT:    vmov.f32 s22, s15
-; CHECK-LV-NEXT:    vmovx.f16 s8, s8
-; CHECK-LV-NEXT:    vins.f16 s22, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s11
-; CHECK-LV-NEXT:    vmov.f32 s23, s10
-; CHECK-LV-NEXT:    vmovx.f16 s4, s16
-; CHECK-LV-NEXT:    vins.f16 s23, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s17
-; CHECK-LV-NEXT:    vins.f16 s16, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s12
-; CHECK-LV-NEXT:    vmovx.f16 s5, s19
-; CHECK-LV-NEXT:    vins.f16 s19, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s15
-; CHECK-LV-NEXT:    vmovx.f16 s7, s9
-; CHECK-LV-NEXT:    vins.f16 s14, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s10
-; CHECK-LV-NEXT:    vins.f16 s4, s18
+; CHECK-LV-NEXT:    vadd.i16 q0, q0, q1
+; CHECK-LV-NEXT:    vldrw.u32 q1, [r0, #32]
+; CHECK-LV-NEXT:    vmovx.f16 s14, s10
+; CHECK-LV-NEXT:    vmov.f32 s22, s11
+; CHECK-LV-NEXT:    vins.f16 s14, s4
+; CHECK-LV-NEXT:    vmovx.f16 s4, s4
+; CHECK-LV-NEXT:    vins.f16 s22, s4
+; CHECK-LV-NEXT:    vmovx.f16 s4, s7
+; CHECK-LV-NEXT:    vmov.f32 s23, s6
+; CHECK-LV-NEXT:    vmovx.f16 s12, s16
+; CHECK-LV-NEXT:    vins.f16 s23, s4
+; CHECK-LV-NEXT:    vmovx.f16 s4, s17
+; CHECK-LV-NEXT:    vins.f16 s16, s4
+; CHECK-LV-NEXT:    vmovx.f16 s4, s8
+; CHECK-LV-NEXT:    vmovx.f16 s13, s19
+; CHECK-LV-NEXT:    vins.f16 s19, s4
+; CHECK-LV-NEXT:    vmovx.f16 s4, s11
+; CHECK-LV-NEXT:    vmovx.f16 s15, s5
+; CHECK-LV-NEXT:    vins.f16 s10, s4
+; CHECK-LV-NEXT:    vmovx.f16 s4, s6
+; CHECK-LV-NEXT:    vins.f16 s12, s18
 ; CHECK-LV-NEXT:    vmov.f32 s20, s17
 ; CHECK-LV-NEXT:    vmovx.f16 s18, s18
-; CHECK-LV-NEXT:    vins.f16 s9, s8
-; CHECK-LV-NEXT:    vins.f16 s5, s13
+; CHECK-LV-NEXT:    vins.f16 s5, s4
+; CHECK-LV-NEXT:    vins.f16 s13, s9
 ; CHECK-LV-NEXT:    vins.f16 s20, s18
 ; CHECK-LV-NEXT:    vmov.f32 s17, s19
-; CHECK-LV-NEXT:    vins.f16 s7, s11
-; CHECK-LV-NEXT:    vmovx.f16 s13, s13
-; CHECK-LV-NEXT:    vmov.f32 s21, s12
-; CHECK-LV-NEXT:    vmov.f32 s18, s14
-; CHECK-LV-NEXT:    vins.f16 s21, s13
-; CHECK-LV-NEXT:    vmov.f32 s19, s9
+; CHECK-LV-NEXT:    vins.f16 s15, s7
+; CHECK-LV-NEXT:    vmovx.f16 s9, s9
+; CHECK-LV-NEXT:    vmov.f32 s21, s8
+; CHECK-LV-NEXT:    vmov.f32 s18, s10
+; CHECK-LV-NEXT:    vins.f16 s21, s9
+; CHECK-LV-NEXT:    vmov.f32 s19, s5
 ; CHECK-LV-NEXT:    vstrw.32 q0, [r1, #16]
-; CHECK-LV-NEXT:    vadd.i16 q1, q4, q1
+; CHECK-LV-NEXT:    vadd.i16 q1, q4, q3
 ; CHECK-LV-NEXT:    vadd.i16 q1, q1, q5
 ; CHECK-LV-NEXT:    vstrw.32 q1, [r1]
 ; CHECK-LV-NEXT:    vpop {d8, d9, d10, d11}
@@ -449,55 +449,55 @@ define void @vld3_v16i16(ptr %src, ptr %dst) {
 ; CHECK-LIS-NEXT:    vmovx.f16 s2, s8
 ; CHECK-LIS-NEXT:    vins.f16 s3, s2
 ; CHECK-LIS-NEXT:    vmovx.f16 s2, s11
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s14
 ; CHECK-LIS-NEXT:    vmovx.f16 s18, s10
-; CHECK-LIS-NEXT:    vmovx.f16 s19, s13
 ; CHECK-LIS-NEXT:    vins.f16 s10, s2
-; CHECK-LIS-NEXT:    vins.f16 s13, s8
+; CHECK-LIS-NEXT:    vmovx.f16 s2, s14
+; CHECK-LIS-NEXT:    vmovx.f16 s19, s13
+; CHECK-LIS-NEXT:    vins.f16 s13, s2
 ; CHECK-LIS-NEXT:    vmov.f32 s1, s3
 ; CHECK-LIS-NEXT:    vins.f16 s18, s12
 ; CHECK-LIS-NEXT:    vins.f16 s19, s15
 ; CHECK-LIS-NEXT:    vmov.f32 s3, s13
 ; CHECK-LIS-NEXT:    vins.f16 s17, s9
 ; CHECK-LIS-NEXT:    vmov.f32 s2, s10
-; CHECK-LIS-NEXT:    vldrw.u32 q2, [r0, #32]
+; CHECK-LIS-NEXT:    vldrw.u32 q3, [r0, #16]
 ; CHECK-LIS-NEXT:    vadd.i16 q0, q0, q4
-; CHECK-LIS-NEXT:    vldrw.u32 q4, [r0, #16]
+; CHECK-LIS-NEXT:    vldrw.u32 q4, [r0]
 ; CHECK-LIS-NEXT:    vadd.i16 q0, q0, q1
-; CHECK-LIS-NEXT:    vldrw.u32 q3, [r0]
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s18
-; CHECK-LIS-NEXT:    vmov.f32 s22, s19
-; CHECK-LIS-NEXT:    vins.f16 s6, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s8
-; CHECK-LIS-NEXT:    vins.f16 s22, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s11
-; CHECK-LIS-NEXT:    vmov.f32 s23, s10
-; CHECK-LIS-NEXT:    vmovx.f16 s4, s12
-; CHECK-LIS-NEXT:    vins.f16 s23, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s13
-; CHECK-LIS-NEXT:    vins.f16 s12, s8
+; CHECK-LIS-NEXT:    vldrw.u32 q1, [r0, #32]
+; CHECK-LIS-NEXT:    vmovx.f16 s10, s14
+; CHECK-LIS-NEXT:    vmov.f32 s22, s15
+; CHECK-LIS-NEXT:    vins.f16 s10, s4
+; CHECK-LIS-NEXT:    vmovx.f16 s4, s4
+; CHECK-LIS-NEXT:    vins.f16 s22, s4
+; CHECK-LIS-NEXT:    vmovx.f16 s4, s7
+; CHECK-LIS-NEXT:    vmov.f32 s23, s6
 ; CHECK-LIS-NEXT:    vmovx.f16 s8, s16
-; CHECK-LIS-NEXT:    vmovx.f16 s5, s15
-; CHECK-LIS-NEXT:    vins.f16 s15, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s19
-; CHECK-LIS-NEXT:    vins.f16 s4, s14
-; CHECK-LIS-NEXT:    vmov.f32 s20, s13
-; CHECK-LIS-NEXT:    vmovx.f16 s14, s14
-; CHECK-LIS-NEXT:    vins.f16 s18, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s10
-; CHECK-LIS-NEXT:    vmovx.f16 s7, s9
-; CHECK-LIS-NEXT:    vins.f16 s20, s14
-; CHECK-LIS-NEXT:    vmovx.f16 s14, s17
-; CHECK-LIS-NEXT:    vmov.f32 s21, s16
-; CHECK-LIS-NEXT:    vins.f16 s9, s8
-; CHECK-LIS-NEXT:    vins.f16 s21, s14
-; CHECK-LIS-NEXT:    vmov.f32 s13, s15
-; CHECK-LIS-NEXT:    vins.f16 s7, s11
-; CHECK-LIS-NEXT:    vins.f16 s5, s17
-; CHECK-LIS-NEXT:    vmov.f32 s14, s18
-; CHECK-LIS-NEXT:    vmov.f32 s15, s9
+; CHECK-LIS-NEXT:    vins.f16 s23, s4
+; CHECK-LIS-NEXT:    vmovx.f16 s4, s17
+; CHECK-LIS-NEXT:    vins.f16 s16, s4
+; CHECK-LIS-NEXT:    vmovx.f16 s4, s12
+; CHECK-LIS-NEXT:    vmovx.f16 s9, s19
+; CHECK-LIS-NEXT:    vins.f16 s19, s4
+; CHECK-LIS-NEXT:    vmovx.f16 s4, s15
+; CHECK-LIS-NEXT:    vmovx.f16 s11, s5
+; CHECK-LIS-NEXT:    vins.f16 s14, s4
+; CHECK-LIS-NEXT:    vmovx.f16 s4, s6
+; CHECK-LIS-NEXT:    vins.f16 s8, s18
+; CHECK-LIS-NEXT:    vmov.f32 s20, s17
+; CHECK-LIS-NEXT:    vmovx.f16 s18, s18
+; CHECK-LIS-NEXT:    vins.f16 s5, s4
+; CHECK-LIS-NEXT:    vins.f16 s9, s13
+; CHECK-LIS-NEXT:    vins.f16 s20, s18
+; CHECK-LIS-NEXT:    vmov.f32 s17, s19
+; CHECK-LIS-NEXT:    vins.f16 s11, s7
+; CHECK-LIS-NEXT:    vmovx.f16 s13, s13
+; CHECK-LIS-NEXT:    vmov.f32 s21, s12
+; CHECK-LIS-NEXT:    vmov.f32 s18, s14
+; CHECK-LIS-NEXT:    vins.f16 s21, s13
+; CHECK-LIS-NEXT:    vmov.f32 s19, s5
 ; CHECK-LIS-NEXT:    vstrw.32 q0, [r1, #16]
-; CHECK-LIS-NEXT:    vadd.i16 q1, q3, q1
+; CHECK-LIS-NEXT:    vadd.i16 q1, q4, q2
 ; CHECK-LIS-NEXT:    vadd.i16 q1, q1, q5
 ; CHECK-LIS-NEXT:    vstrw.32 q1, [r1]
 ; CHECK-LIS-NEXT:    vpop {d8, d9, d10, d11}
@@ -1194,24 +1194,25 @@ define void @vld3_v4f16(ptr %src, ptr %dst) {
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    ldrd r2, r3, [r0, #16]
 ; CHECK-NEXT:    vldrw.u32 q1, [r0]
-; CHECK-NEXT:    vmov.32 q2[0], r2
-; CHECK-NEXT:    vmovx.f16 s12, s4
-; CHECK-NEXT:    vmov.32 q2[1], r3
-; CHECK-NEXT:    vmovx.f16 s13, s7
-; CHECK-NEXT:    vmovx.f16 s0, s9
-; CHECK-NEXT:    vmov.f32 s1, s8
-; CHECK-NEXT:    vins.f16 s1, s0
-; CHECK-NEXT:    vmovx.f16 s0, s5
-; CHECK-NEXT:    vins.f16 s4, s0
-; CHECK-NEXT:    vmovx.f16 s0, s6
-; CHECK-NEXT:    vins.f16 s5, s0
-; CHECK-NEXT:    vmovx.f16 s0, s8
+; CHECK-NEXT:    vmov.32 q0[0], r2
+; CHECK-NEXT:    vmovx.f16 s8, s4
+; CHECK-NEXT:    vmov q3, q0
+; CHECK-NEXT:    vmov.f32 s1, s0
+; CHECK-NEXT:    vmov.32 q3[1], r3
+; CHECK-NEXT:    vmovx.f16 s0, s0
+; CHECK-NEXT:    vmovx.f16 s2, s13
+; CHECK-NEXT:    vmovx.f16 s9, s7
+; CHECK-NEXT:    vins.f16 s1, s2
+; CHECK-NEXT:    vmovx.f16 s2, s5
+; CHECK-NEXT:    vins.f16 s4, s2
+; CHECK-NEXT:    vmovx.f16 s2, s6
+; CHECK-NEXT:    vins.f16 s5, s2
 ; CHECK-NEXT:    vins.f16 s7, s0
 ; CHECK-NEXT:    vmov.f32 s0, s5
-; CHECK-NEXT:    vins.f16 s12, s6
-; CHECK-NEXT:    vins.f16 s13, s9
+; CHECK-NEXT:    vins.f16 s8, s6
+; CHECK-NEXT:    vins.f16 s9, s13
 ; CHECK-NEXT:    vmov.f32 s5, s7
-; CHECK-NEXT:    vadd.f16 q1, q1, q3
+; CHECK-NEXT:    vadd.f16 q1, q1, q2
 ; CHECK-NEXT:    vadd.f16 q0, q1, q0
 ; CHECK-NEXT:    vmov r0, r2, d0
 ; CHECK-NEXT:    strd r0, r2, [r1]
@@ -1228,93 +1229,49 @@ entry:
 }
 
 define void @vld3_v8f16(ptr %src, ptr %dst) {
-; CHECK-LV-LABEL: vld3_v8f16:
-; CHECK-LV:       @ %bb.0: @ %entry
-; CHECK-LV-NEXT:    .vsave {d8, d9}
-; CHECK-LV-NEXT:    vpush {d8, d9}
-; CHECK-LV-NEXT:    vldrw.u32 q2, [r0, #16]
-; CHECK-LV-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-LV-NEXT:    vldrw.u32 q3, [r0, #32]
-; CHECK-LV-NEXT:    vmov.f32 s5, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s8
-; CHECK-LV-NEXT:    vmovx.f16 s17, s3
-; CHECK-LV-NEXT:    vins.f16 s3, s8
-; CHECK-LV-NEXT:    vmovx.f16 s8, s11
-; CHECK-LV-NEXT:    vmovx.f16 s18, s10
-; CHECK-LV-NEXT:    vmovx.f16 s16, s0
-; CHECK-LV-NEXT:    vins.f16 s10, s8
-; CHECK-LV-NEXT:    vmovx.f16 s6, s2
-; CHECK-LV-NEXT:    vmov.f32 s4, s1
-; CHECK-LV-NEXT:    vmovx.f16 s8, s14
-; CHECK-LV-NEXT:    vmovx.f16 s19, s13
-; CHECK-LV-NEXT:    vins.f16 s4, s6
-; CHECK-LV-NEXT:    vmovx.f16 s6, s9
-; CHECK-LV-NEXT:    vins.f16 s16, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s15
-; CHECK-LV-NEXT:    vmovx.f16 s7, s12
-; CHECK-LV-NEXT:    vins.f16 s18, s12
-; CHECK-LV-NEXT:    vmovx.f16 s12, s1
-; CHECK-LV-NEXT:    vins.f16 s13, s8
-; CHECK-LV-NEXT:    vins.f16 s5, s6
-; CHECK-LV-NEXT:    vmov.f32 s6, s11
-; CHECK-LV-NEXT:    vins.f16 s14, s2
-; CHECK-LV-NEXT:    vmov.f32 s1, s3
-; CHECK-LV-NEXT:    vins.f16 s19, s15
-; CHECK-LV-NEXT:    vins.f16 s17, s9
-; CHECK-LV-NEXT:    vins.f16 s0, s12
-; CHECK-LV-NEXT:    vmov.f32 s2, s10
-; CHECK-LV-NEXT:    vmov.f32 s3, s13
-; CHECK-LV-NEXT:    vins.f16 s6, s7
-; CHECK-LV-NEXT:    vmov.f32 s7, s14
-; CHECK-LV-NEXT:    vadd.f16 q0, q0, q4
-; CHECK-LV-NEXT:    vadd.f16 q0, q0, q1
-; CHECK-LV-NEXT:    vstrw.32 q0, [r1]
-; CHECK-LV-NEXT:    vpop {d8, d9}
-; CHECK-LV-NEXT:    bx lr
-;
-; CHECK-LIS-LABEL: vld3_v8f16:
-; CHECK-LIS:       @ %bb.0: @ %entry
-; CHECK-LIS-NEXT:    .vsave {d8, d9}
-; CHECK-LIS-NEXT:    vpush {d8, d9}
-; CHECK-LIS-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-LIS-NEXT:    vldrw.u32 q2, [r0, #16]
-; CHECK-LIS-NEXT:    vldrw.u32 q4, [r0, #32]
-; CHECK-LIS-NEXT:    vmov.f32 s4, s1
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s2
-; CHECK-LIS-NEXT:    vins.f16 s4, s6
-; CHECK-LIS-NEXT:    vmov.f32 s5, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s9
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s13, s3
-; CHECK-LIS-NEXT:    vins.f16 s5, s6
-; CHECK-LIS-NEXT:    vins.f16 s3, s8
-; CHECK-LIS-NEXT:    vmov.f32 s6, s11
-; CHECK-LIS-NEXT:    vmovx.f16 s12, s16
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s11
-; CHECK-LIS-NEXT:    vmovx.f16 s14, s10
-; CHECK-LIS-NEXT:    vins.f16 s6, s12
-; CHECK-LIS-NEXT:    vmovx.f16 s12, s0
-; CHECK-LIS-NEXT:    vins.f16 s10, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s8, s18
-; CHECK-LIS-NEXT:    vmovx.f16 s15, s17
-; CHECK-LIS-NEXT:    vins.f16 s12, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s19
-; CHECK-LIS-NEXT:    vmovx.f16 s1, s1
-; CHECK-LIS-NEXT:    vins.f16 s17, s8
-; CHECK-LIS-NEXT:    vins.f16 s18, s2
-; CHECK-LIS-NEXT:    vins.f16 s0, s1
-; CHECK-LIS-NEXT:    vmov.f32 s1, s3
-; CHECK-LIS-NEXT:    vins.f16 s14, s16
-; CHECK-LIS-NEXT:    vins.f16 s15, s19
-; CHECK-LIS-NEXT:    vins.f16 s13, s9
-; CHECK-LIS-NEXT:    vmov.f32 s2, s10
-; CHECK-LIS-NEXT:    vmov.f32 s3, s17
-; CHECK-LIS-NEXT:    vmov.f32 s7, s18
-; CHECK-LIS-NEXT:    vadd.f16 q0, q0, q3
-; CHECK-LIS-NEXT:    vadd.f16 q0, q0, q1
-; CHECK-LIS-NEXT:    vstrw.32 q0, [r1]
-; CHECK-LIS-NEXT:    vpop {d8, d9}
-; CHECK-LIS-NEXT:    bx lr
+; CHECK-LABEL: vld3_v8f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
+; CHECK-NEXT:    vldrw.u32 q2, [r0, #16]
+; CHECK-NEXT:    vldrw.u32 q0, [r0]
+; CHECK-NEXT:    vldrw.u32 q3, [r0, #32]
+; CHECK-NEXT:    vmov.f32 s5, s8
+; CHECK-NEXT:    vmovx.f16 s8, s8
+; CHECK-NEXT:    vmovx.f16 s17, s3
+; CHECK-NEXT:    vins.f16 s3, s8
+; CHECK-NEXT:    vmovx.f16 s8, s11
+; CHECK-NEXT:    vmovx.f16 s18, s10
+; CHECK-NEXT:    vmovx.f16 s16, s0
+; CHECK-NEXT:    vins.f16 s10, s8
+; CHECK-NEXT:    vmovx.f16 s6, s2
+; CHECK-NEXT:    vmov.f32 s4, s1
+; CHECK-NEXT:    vmovx.f16 s8, s14
+; CHECK-NEXT:    vmovx.f16 s19, s13
+; CHECK-NEXT:    vins.f16 s4, s6
+; CHECK-NEXT:    vmovx.f16 s6, s9
+; CHECK-NEXT:    vins.f16 s16, s2
+; CHECK-NEXT:    vmovx.f16 s2, s15
+; CHECK-NEXT:    vmovx.f16 s7, s12
+; CHECK-NEXT:    vins.f16 s18, s12
+; CHECK-NEXT:    vmovx.f16 s12, s1
+; CHECK-NEXT:    vins.f16 s13, s8
+; CHECK-NEXT:    vins.f16 s5, s6
+; CHECK-NEXT:    vmov.f32 s6, s11
+; CHECK-NEXT:    vins.f16 s14, s2
+; CHECK-NEXT:    vmov.f32 s1, s3
+; CHECK-NEXT:    vins.f16 s19, s15
+; CHECK-NEXT:    vins.f16 s17, s9
+; CHECK-NEXT:    vins.f16 s0, s12
+; CHECK-NEXT:    vmov.f32 s2, s10
+; CHECK-NEXT:    vmov.f32 s3, s13
+; CHECK-NEXT:    vins.f16 s6, s7
+; CHECK-NEXT:    vmov.f32 s7, s14
+; CHECK-NEXT:    vadd.f16 q0, q0, q4
+; CHECK-NEXT:    vadd.f16 q0, q0, q1
+; CHECK-NEXT:    vstrw.32 q0, [r1]
+; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    bx lr
 entry:
   %l1 = load <24 x half>, ptr %src, align 4
   %s1 = shufflevector <24 x half> %l1, <24 x half> undef, <8 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21>
@@ -1327,167 +1284,86 @@ entry:
 }
 
 define void @vld3_v16f16(ptr %src, ptr %dst) {
-; CHECK-LV-LABEL: vld3_v16f16:
-; CHECK-LV:       @ %bb.0: @ %entry
-; CHECK-LV-NEXT:    .vsave {d8, d9}
-; CHECK-LV-NEXT:    vpush {d8, d9}
-; CHECK-LV-NEXT:    vldrw.u32 q0, [r0, #48]
-; CHECK-LV-NEXT:    vldrw.u32 q2, [r0, #64]
-; CHECK-LV-NEXT:    vldrw.u32 q3, [r0, #80]
-; CHECK-LV-NEXT:    vmovx.f16 s6, s2
-; CHECK-LV-NEXT:    vmov.f32 s4, s1
-; CHECK-LV-NEXT:    vins.f16 s4, s6
-; CHECK-LV-NEXT:    vmovx.f16 s6, s9
-; CHECK-LV-NEXT:    vmov.f32 s5, s8
-; CHECK-LV-NEXT:    vmovx.f16 s7, s12
-; CHECK-LV-NEXT:    vins.f16 s5, s6
-; CHECK-LV-NEXT:    vmov.f32 s6, s11
-; CHECK-LV-NEXT:    vins.f16 s6, s7
-; CHECK-LV-NEXT:    vmovx.f16 s16, s15
-; CHECK-LV-NEXT:    vmov.f32 s7, s14
-; CHECK-LV-NEXT:    vmovx.f16 s17, s3
-; CHECK-LV-NEXT:    vins.f16 s7, s16
-; CHECK-LV-NEXT:    vmovx.f16 s16, s0
-; CHECK-LV-NEXT:    vins.f16 s16, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s1
-; CHECK-LV-NEXT:    vins.f16 s0, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s8
-; CHECK-LV-NEXT:    vins.f16 s3, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s11
-; CHECK-LV-NEXT:    vmovx.f16 s18, s10
-; CHECK-LV-NEXT:    vins.f16 s10, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s14
-; CHECK-LV-NEXT:    vmovx.f16 s19, s13
-; CHECK-LV-NEXT:    vins.f16 s13, s2
-; CHECK-LV-NEXT:    vmov.f32 s1, s3
-; CHECK-LV-NEXT:    vins.f16 s18, s12
-; CHECK-LV-NEXT:    vins.f16 s19, s15
-; CHECK-LV-NEXT:    vmov.f32 s3, s13
-; CHECK-LV-NEXT:    vins.f16 s17, s9
-; CHECK-LV-NEXT:    vmov.f32 s2, s10
-; CHECK-LV-NEXT:    vldrw.u32 q3, [r0, #16]
-; CHECK-LV-NEXT:    vadd.f16 q0, q0, q4
-; CHECK-LV-NEXT:    vadd.f16 q2, q0, q1
-; CHECK-LV-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-LV-NEXT:    vldrw.u32 q1, [r0, #32]
-; CHECK-LV-NEXT:    vstrw.32 q2, [r1, #16]
-; CHECK-LV-NEXT:    vmovx.f16 s10, s2
-; CHECK-LV-NEXT:    vmov.f32 s8, s1
-; CHECK-LV-NEXT:    vins.f16 s8, s10
-; CHECK-LV-NEXT:    vmovx.f16 s10, s13
-; CHECK-LV-NEXT:    vmov.f32 s9, s12
-; CHECK-LV-NEXT:    vmovx.f16 s11, s4
-; CHECK-LV-NEXT:    vins.f16 s9, s10
-; CHECK-LV-NEXT:    vmov.f32 s10, s15
-; CHECK-LV-NEXT:    vins.f16 s10, s11
-; CHECK-LV-NEXT:    vmovx.f16 s16, s7
-; CHECK-LV-NEXT:    vmov.f32 s11, s6
-; CHECK-LV-NEXT:    vmovx.f16 s17, s3
-; CHECK-LV-NEXT:    vins.f16 s11, s16
-; CHECK-LV-NEXT:    vmovx.f16 s16, s0
-; CHECK-LV-NEXT:    vins.f16 s16, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s1
-; CHECK-LV-NEXT:    vins.f16 s0, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s12
-; CHECK-LV-NEXT:    vins.f16 s3, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s15
-; CHECK-LV-NEXT:    vmovx.f16 s18, s14
-; CHECK-LV-NEXT:    vins.f16 s14, s2
-; CHECK-LV-NEXT:    vmovx.f16 s2, s6
-; CHECK-LV-NEXT:    vmovx.f16 s19, s5
-; CHECK-LV-NEXT:    vins.f16 s5, s2
-; CHECK-LV-NEXT:    vmov.f32 s1, s3
-; CHECK-LV-NEXT:    vins.f16 s18, s4
-; CHECK-LV-NEXT:    vins.f16 s19, s7
-; CHECK-LV-NEXT:    vins.f16 s17, s13
-; CHECK-LV-NEXT:    vmov.f32 s2, s14
-; CHECK-LV-NEXT:    vmov.f32 s3, s5
-; CHECK-LV-NEXT:    vadd.f16 q0, q0, q4
-; CHECK-LV-NEXT:    vadd.f16 q0, q0, q2
-; CHECK-LV-NEXT:    vstrw.32 q0, [r1]
-; CHECK-LV-NEXT:    vpop {d8, d9}
-; CHECK-LV-NEXT:    bx lr
-;
-; CHECK-LIS-LABEL: vld3_v16f16:
-; CHECK-LIS:       @ %bb.0: @ %entry
-; CHECK-LIS-NEXT:    .vsave {d8, d9}
-; CHECK-LIS-NEXT:    vpush {d8, d9}
-; CHECK-LIS-NEXT:    vldrw.u32 q0, [r0, #48]
-; CHECK-LIS-NEXT:    vldrw.u32 q2, [r0, #64]
-; CHECK-LIS-NEXT:    vldrw.u32 q3, [r0, #80]
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s2
-; CHECK-LIS-NEXT:    vmov.f32 s4, s1
-; CHECK-LIS-NEXT:    vins.f16 s4, s6
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s9
-; CHECK-LIS-NEXT:    vmov.f32 s5, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s7, s12
-; CHECK-LIS-NEXT:    vins.f16 s5, s6
-; CHECK-LIS-NEXT:    vmov.f32 s6, s11
-; CHECK-LIS-NEXT:    vins.f16 s6, s7
-; CHECK-LIS-NEXT:    vmovx.f16 s16, s15
-; CHECK-LIS-NEXT:    vmov.f32 s7, s14
-; CHECK-LIS-NEXT:    vmovx.f16 s17, s3
-; CHECK-LIS-NEXT:    vins.f16 s7, s16
-; CHECK-LIS-NEXT:    vmovx.f16 s16, s0
-; CHECK-LIS-NEXT:    vins.f16 s16, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s1
-; CHECK-LIS-NEXT:    vins.f16 s0, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s8
-; CHECK-LIS-NEXT:    vins.f16 s3, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s11
-; CHECK-LIS-NEXT:    vmovx.f16 s18, s10
-; CHECK-LIS-NEXT:    vins.f16 s10, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s14
-; CHECK-LIS-NEXT:    vmovx.f16 s19, s13
-; CHECK-LIS-NEXT:    vins.f16 s13, s2
-; CHECK-LIS-NEXT:    vmov.f32 s1, s3
-; CHECK-LIS-NEXT:    vins.f16 s18, s12
-; CHECK-LIS-NEXT:    vins.f16 s19, s15
-; CHECK-LIS-NEXT:    vmov.f32 s3, s13
-; CHECK-LIS-NEXT:    vins.f16 s17, s9
-; CHECK-LIS-NEXT:    vmov.f32 s2, s10
-; CHECK-LIS-NEXT:    vldrw.u32 q3, [r0, #16]
-; CHECK-LIS-NEXT:    vadd.f16 q0, q0, q4
-; CHECK-LIS-NEXT:    vldrw.u32 q2, [r0, #32]
-; CHECK-LIS-NEXT:    vadd.f16 q1, q0, q1
-; CHECK-LIS-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-LIS-NEXT:    vstrw.32 q1, [r1, #16]
-; CHECK-LIS-NEXT:    vmov.f32 s5, s12
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s2
-; CHECK-LIS-NEXT:    vmov.f32 s4, s1
-; CHECK-LIS-NEXT:    vins.f16 s4, s6
-; CHECK-LIS-NEXT:    vmovx.f16 s6, s13
-; CHECK-LIS-NEXT:    vins.f16 s5, s6
-; CHECK-LIS-NEXT:    vmov.f32 s6, s15
-; CHECK-LIS-NEXT:    vmovx.f16 s7, s8
-; CHECK-LIS-NEXT:    vmovx.f16 s16, s11
-; CHECK-LIS-NEXT:    vins.f16 s6, s7
-; CHECK-LIS-NEXT:    vmov.f32 s7, s10
-; CHECK-LIS-NEXT:    vins.f16 s7, s16
-; CHECK-LIS-NEXT:    vmovx.f16 s16, s0
-; CHECK-LIS-NEXT:    vins.f16 s16, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s1
-; CHECK-LIS-NEXT:    vins.f16 s0, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s12
-; CHECK-LIS-NEXT:    vmovx.f16 s17, s3
-; CHECK-LIS-NEXT:    vins.f16 s3, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s15
-; CHECK-LIS-NEXT:    vmovx.f16 s18, s14
-; CHECK-LIS-NEXT:    vins.f16 s14, s2
-; CHECK-LIS-NEXT:    vmovx.f16 s2, s10
-; CHECK-LIS-NEXT:    vmovx.f16 s19, s9
-; CHECK-LIS-NEXT:    vins.f16 s9, s2
-; CHECK-LIS-NEXT:    vmov.f32 s1, s3
-; CHECK-LIS-NEXT:    vins.f16 s18, s8
-; CHECK-LIS-NEXT:    vins.f16 s19, s11
-; CHECK-LIS-NEXT:    vins.f16 s17, s13
-; CHECK-LIS-NEXT:    vmov.f32 s2, s14
-; CHECK-LIS-NEXT:    vmov.f32 s3, s9
-; CHECK-LIS-NEXT:    vadd.f16 q0, q0, q4
-; CHECK-LIS-NEXT:    vadd.f16 q0, q0, q1
-; CHECK-LIS-NEXT:    vstrw.32 q0, [r1]
-; CHECK-LIS-NEXT:    vpop {d8, d9}
-; CHECK-LIS-NEXT:    bx lr
+; CHECK-LABEL: vld3_v16f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
+; CHECK-NEXT:    vldrw.u32 q0, [r0, #48]
+; CHECK-NEXT:    vldrw.u32 q2, [r0, #64]
+; CHECK-NEXT:    vldrw.u32 q3, [r0, #80]
+; CHECK-NEXT:    vmovx.f16 s6, s2
+; CHECK-NEXT:    vmov.f32 s4, s1
+; CHECK-NEXT:    vins.f16 s4, s6
+; CHECK-NEXT:    vmovx.f16 s6, s9
+; CHECK-NEXT:    vmov.f32 s5, s8
+; CHECK-NEXT:    vmovx.f16 s7, s12
+; CHECK-NEXT:    vins.f16 s5, s6
+; CHECK-NEXT:    vmov.f32 s6, s11
+; CHECK-NEXT:    vins.f16 s6, s7
+; CHECK-NEXT:    vmovx.f16 s16, s15
+; CHECK-NEXT:    vmov.f32 s7, s14
+; CHECK-NEXT:    vmovx.f16 s17, s3
+; CHECK-NEXT:    vins.f16 s7, s16
+; CHECK-NEXT:    vmovx.f16 s16, s0
+; CHECK-NEXT:    vins.f16 s16, s2
+; CHECK-NEXT:    vmovx.f16 s2, s1
+; CHECK-NEXT:    vins.f16 s0, s2
+; CHECK-NEXT:    vmovx.f16 s2, s8
+; CHECK-NEXT:    vins.f16 s3, s2
+; CHECK-NEXT:    vmovx.f16 s2, s11
+; CHECK-NEXT:    vmovx.f16 s18, s10
+; CHECK-NEXT:    vins.f16 s10, s2
+; CHECK-NEXT:    vmovx.f16 s2, s14
+; CHECK-NEXT:    vmovx.f16 s19, s13
+; CHECK-NEXT:    vins.f16 s13, s2
+; CHECK-NEXT:    vmov.f32 s1, s3
+; CHECK-NEXT:    vins.f16 s18, s12
+; CHECK-NEXT:    vins.f16 s19, s15
+; CHECK-NEXT:    vins.f16 s17, s9
+; CHECK-NEXT:    vmov.f32 s2, s10
+; CHECK-NEXT:    vmov.f32 s3, s13
+; CHECK-NEXT:    vldrw.u32 q2, [r0, #16]
+; CHECK-NEXT:    vadd.f16 q0, q0, q4
+; CHECK-NEXT:    vadd.f16 q3, q0, q1
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    vldrw.u32 q0, [r0, #32]
+; CHECK-NEXT:    vstrw.32 q3, [r1, #16]
+; CHECK-NEXT:    vmovx.f16 s14, s6
+; CHECK-NEXT:    vmov.f32 s12, s5
+; CHECK-NEXT:    vins.f16 s12, s14
+; CHECK-NEXT:    vmovx.f16 s14, s9
+; CHECK-NEXT:    vmov.f32 s13, s8
+; CHECK-NEXT:    vmovx.f16 s18, s10
+; CHECK-NEXT:    vins.f16 s13, s14
+; CHECK-NEXT:    vmovx.f16 s15, s0
+; CHECK-NEXT:    vmov.f32 s14, s11
+; CHECK-NEXT:    vins.f16 s18, s0
+; CHECK-NEXT:    vins.f16 s14, s15
+; CHECK-NEXT:    vmovx.f16 s16, s3
+; CHECK-NEXT:    vmov.f32 s15, s2
+; CHECK-NEXT:    vmovx.f16 s0, s5
+; CHECK-NEXT:    vins.f16 s15, s16
+; CHECK-NEXT:    vmovx.f16 s16, s4
+; CHECK-NEXT:    vins.f16 s4, s0
+; CHECK-NEXT:    vmovx.f16 s0, s8
+; CHECK-NEXT:    vmovx.f16 s17, s7
+; CHECK-NEXT:    vins.f16 s7, s0
+; CHECK-NEXT:    vmovx.f16 s0, s11
+; CHECK-NEXT:    vmovx.f16 s19, s1
+; CHECK-NEXT:    vins.f16 s10, s0
+; CHECK-NEXT:    vmovx.f16 s0, s2
+; CHECK-NEXT:    vins.f16 s1, s0
+; CHECK-NEXT:    vins.f16 s16, s6
+; CHECK-NEXT:    vmov.f32 s5, s7
+; CHECK-NEXT:    vins.f16 s19, s3
+; CHECK-NEXT:    vins.f16 s17, s9
+; CHECK-NEXT:    vmov.f32 s6, s10
+; CHECK-NEXT:    vmov.f32 s7, s1
+; CHECK-NEXT:    vadd.f16 q0, q1, q4
+; CHECK-NEXT:    vadd.f16 q0, q0, q3
+; CHECK-NEXT:    vstrw.32 q0, [r1]
+; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    bx lr
 entry:
   %l1 = load <48 x half>, ptr %src, align 4
   %s1 = shufflevector <48 x half> %l1, <48 x half> undef, <16 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21, i32 24, i32 27, i32 30, i32 33, i32 36, i32 39, i32 42, i32 45>
diff --git a/llvm/test/CodeGen/Thumb2/mve-vld4.ll b/llvm/test/CodeGen/Thumb2/mve-vld4.ll
index b49f19e55c895a..bf16c5b7d10ff4 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vld4.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vld4.ll
@@ -360,50 +360,50 @@ define void @vld4_v8i16_align1(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11}
 ; CHECK-NEXT:    vldrb.u8 q1, [r0, #32]
-; CHECK-NEXT:    vldrb.u8 q2, [r0, #48]
 ; CHECK-NEXT:    vmovx.f16 s18, s5
 ; CHECK-NEXT:    vmovx.f16 s0, s7
 ; CHECK-NEXT:    vins.f16 s18, s0
-; CHECK-NEXT:    vmovx.f16 s19, s9
-; CHECK-NEXT:    vmovx.f16 s0, s11
+; CHECK-NEXT:    vldrb.u8 q0, [r0, #48]
 ; CHECK-NEXT:    vins.f16 s5, s7
-; CHECK-NEXT:    vins.f16 s19, s0
-; CHECK-NEXT:    vldrb.u8 q0, [r0]
-; CHECK-NEXT:    vins.f16 s9, s11
+; CHECK-NEXT:    vmovx.f16 s19, s1
+; CHECK-NEXT:    vmovx.f16 s8, s3
+; CHECK-NEXT:    vins.f16 s19, s8
+; CHECK-NEXT:    vldrb.u8 q2, [r0]
+; CHECK-NEXT:    vins.f16 s1, s3
 ; CHECK-NEXT:    vmov.f32 s22, s5
-; CHECK-NEXT:    vmovx.f16 s16, s1
-; CHECK-NEXT:    vmovx.f16 s12, s3
+; CHECK-NEXT:    vmovx.f16 s16, s9
+; CHECK-NEXT:    vmovx.f16 s12, s11
 ; CHECK-NEXT:    vins.f16 s16, s12
 ; CHECK-NEXT:    vldrb.u8 q3, [r0, #16]
-; CHECK-NEXT:    vins.f16 s1, s3
-; CHECK-NEXT:    vmov.f32 s23, s9
+; CHECK-NEXT:    vins.f16 s9, s11
+; CHECK-NEXT:    vmov.f32 s23, s1
 ; CHECK-NEXT:    vmovx.f16 s17, s13
 ; CHECK-NEXT:    vmovx.f16 s20, s15
 ; CHECK-NEXT:    vins.f16 s13, s15
 ; CHECK-NEXT:    vins.f16 s17, s20
-; CHECK-NEXT:    vmov.f32 s20, s1
+; CHECK-NEXT:    vmov.f32 s20, s9
 ; CHECK-NEXT:    vmovx.f16 s1, s6
 ; CHECK-NEXT:    vmov.f32 s21, s13
 ; CHECK-NEXT:    vadd.i16 q4, q5, q4
 ; CHECK-NEXT:    vmovx.f16 s22, s4
 ; CHECK-NEXT:    vins.f16 s22, s1
-; CHECK-NEXT:    vmovx.f16 s23, s8
-; CHECK-NEXT:    vmovx.f16 s1, s10
-; CHECK-NEXT:    vmovx.f16 s20, s0
-; CHECK-NEXT:    vins.f16 s23, s1
+; CHECK-NEXT:    vmovx.f16 s23, s0
 ; CHECK-NEXT:    vmovx.f16 s1, s2
+; CHECK-NEXT:    vmovx.f16 s20, s8
+; CHECK-NEXT:    vins.f16 s23, s1
+; CHECK-NEXT:    vmovx.f16 s1, s10
 ; CHECK-NEXT:    vins.f16 s20, s1
 ; CHECK-NEXT:    vmovx.f16 s21, s12
 ; CHECK-NEXT:    vmovx.f16 s1, s14
-; CHECK-NEXT:    vins.f16 s8, s10
-; CHECK-NEXT:    vins.f16 s4, s6
+; CHECK-NEXT:    vins.f16 s0, s2
 ; CHECK-NEXT:    vins.f16 s12, s14
+; CHECK-NEXT:    vins.f16 s4, s6
+; CHECK-NEXT:    vins.f16 s8, s10
 ; CHECK-NEXT:    vins.f16 s21, s1
-; CHECK-NEXT:    vins.f16 s0, s2
-; CHECK-NEXT:    vmov.f32 s3, s8
-; CHECK-NEXT:    vmov.f32 s1, s12
-; CHECK-NEXT:    vmov.f32 s2, s4
-; CHECK-NEXT:    vadd.i16 q0, q0, q5
+; CHECK-NEXT:    vmov.f32 s9, s12
+; CHECK-NEXT:    vmov.f32 s10, s4
+; CHECK-NEXT:    vmov.f32 s11, s0
+; CHECK-NEXT:    vadd.i16 q0, q2, q5
 ; CHECK-NEXT:    vadd.i16 q0, q0, q4
 ; CHECK-NEXT:    vstrw.32 q0, [r1]
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11}
@@ -1081,51 +1081,51 @@ define void @vld4_v8f16_align1(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    vldrb.u8 q0, [r0, #32]
-; CHECK-NEXT:    vldrb.u8 q2, [r0, #48]
-; CHECK-NEXT:    vmovx.f16 s18, s1
+; CHECK-NEXT:    vldrb.u8 q4, [r0, #16]
+; CHECK-NEXT:    vmovx.f16 s14, s1
 ; CHECK-NEXT:    vmovx.f16 s4, s3
-; CHECK-NEXT:    vins.f16 s18, s4
-; CHECK-NEXT:    vmovx.f16 s19, s9
-; CHECK-NEXT:    vmovx.f16 s4, s11
+; CHECK-NEXT:    vins.f16 s14, s4
+; CHECK-NEXT:    vldrb.u8 q1, [r0, #48]
 ; CHECK-NEXT:    vins.f16 s1, s3
-; CHECK-NEXT:    vins.f16 s19, s4
-; CHECK-NEXT:    vldrb.u8 q1, [r0]
 ; CHECK-NEXT:    vmovx.f16 s22, s0
+; CHECK-NEXT:    vmovx.f16 s15, s5
+; CHECK-NEXT:    vmovx.f16 s8, s7
+; CHECK-NEXT:    vins.f16 s15, s8
+; CHECK-NEXT:    vldrb.u8 q2, [r0]
 ; CHECK-NEXT:    vmovx.f16 s3, s2
-; CHECK-NEXT:    vmovx.f16 s16, s5
-; CHECK-NEXT:    vmovx.f16 s12, s7
-; CHECK-NEXT:    vins.f16 s16, s12
-; CHECK-NEXT:    vldrb.u8 q3, [r0, #16]
+; CHECK-NEXT:    vmovx.f16 s23, s4
+; CHECK-NEXT:    vmovx.f16 s12, s9
+; CHECK-NEXT:    vmovx.f16 s13, s11
+; CHECK-NEXT:    vins.f16 s12, s13
+; CHECK-NEXT:    vmovx.f16 s13, s17
+; CHECK-NEXT:    vmovx.f16 s20, s19
 ; CHECK-NEXT:    vins.f16 s22, s3
-; CHECK-NEXT:    vmovx.f16 s23, s8
-; CHECK-NEXT:    vmovx.f16 s17, s13
-; CHECK-NEXT:    vmovx.f16 s20, s15
-; CHECK-NEXT:    vmovx.f16 s3, s10
-; CHECK-NEXT:    vins.f16 s17, s20
-; CHECK-NEXT:    vins.f16 s23, s3
-; CHECK-NEXT:    vmovx.f16 s20, s4
 ; CHECK-NEXT:    vmovx.f16 s3, s6
-; CHECK-NEXT:    vins.f16 s9, s11
+; CHECK-NEXT:    vins.f16 s13, s20
+; CHECK-NEXT:    vins.f16 s23, s3
+; CHECK-NEXT:    vmovx.f16 s20, s8
+; CHECK-NEXT:    vmovx.f16 s3, s10
 ; CHECK-NEXT:    vins.f16 s5, s7
-; CHECK-NEXT:    vins.f16 s13, s15
+; CHECK-NEXT:    vins.f16 s9, s11
+; CHECK-NEXT:    vins.f16 s17, s19
 ; CHECK-NEXT:    vins.f16 s20, s3
-; CHECK-NEXT:    vmovx.f16 s21, s12
-; CHECK-NEXT:    vmovx.f16 s3, s14
-; CHECK-NEXT:    vins.f16 s8, s10
-; CHECK-NEXT:    vins.f16 s0, s2
-; CHECK-NEXT:    vins.f16 s12, s14
+; CHECK-NEXT:    vmovx.f16 s21, s16
+; CHECK-NEXT:    vmovx.f16 s3, s18
 ; CHECK-NEXT:    vins.f16 s4, s6
-; CHECK-NEXT:    vmov.f32 s24, s5
+; CHECK-NEXT:    vins.f16 s0, s2
+; CHECK-NEXT:    vins.f16 s16, s18
+; CHECK-NEXT:    vins.f16 s8, s10
+; CHECK-NEXT:    vmov.f32 s24, s9
 ; CHECK-NEXT:    vins.f16 s21, s3
 ; CHECK-NEXT:    vmov.f32 s26, s1
-; CHECK-NEXT:    vmov.f32 s27, s9
-; CHECK-NEXT:    vmov.f32 s25, s13
-; CHECK-NEXT:    vmov.f32 s6, s0
-; CHECK-NEXT:    vadd.f16 q4, q6, q4
-; CHECK-NEXT:    vmov.f32 s7, s8
-; CHECK-NEXT:    vmov.f32 s5, s12
-; CHECK-NEXT:    vadd.f16 q0, q1, q5
-; CHECK-NEXT:    vadd.f16 q0, q0, q4
+; CHECK-NEXT:    vmov.f32 s25, s17
+; CHECK-NEXT:    vmov.f32 s27, s5
+; CHECK-NEXT:    vmov.f32 s10, s0
+; CHECK-NEXT:    vadd.f16 q3, q6, q3
+; CHECK-NEXT:    vmov.f32 s9, s16
+; CHECK-NEXT:    vmov.f32 s11, s4
+; CHECK-NEXT:    vadd.f16 q0, q2, q5
+; CHECK-NEXT:    vadd.f16 q0, q0, q3
 ; CHECK-NEXT:    vstrw.32 q0, [r1]
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    bx lr
diff --git a/llvm/test/CodeGen/Thumb2/mve-vldst4.ll b/llvm/test/CodeGen/Thumb2/mve-vldst4.ll
index 2e51e9e059f65e..c82bc582fa87c1 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vldst4.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vldst4.ll
@@ -6,118 +6,114 @@ define void @vldst4(ptr nocapture readonly %pIn, ptr nocapture %pOut, i32 %numRo
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
-; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
-; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    mul r12, r3, r2
 ; CHECK-NEXT:    lsrs.w r2, r12, #2
 ; CHECK-NEXT:    beq.w .LBB0_3
 ; CHECK-NEXT:  @ %bb.1: @ %vector.ph
 ; CHECK-NEXT:    mvn r3, #7
-; CHECK-NEXT:    ldr r2, [sp, #56]
+; CHECK-NEXT:    ldr r2, [sp, #72]
 ; CHECK-NEXT:    and.w r3, r3, r12, lsr #2
 ; CHECK-NEXT:    sub.w r12, r3, #8
 ; CHECK-NEXT:    movs r3, #1
 ; CHECK-NEXT:    add.w lr, r3, r12, lsr #3
 ; CHECK-NEXT:  .LBB0_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    vldrh.u16 q1, [r0, #32]
-; CHECK-NEXT:    vldrh.u16 q4, [r0, #48]
-; CHECK-NEXT:    vldrh.u16 q3, [r0], #64
-; CHECK-NEXT:    vmovx.f16 s26, s4
-; CHECK-NEXT:    vins.f16 s4, s6
-; CHECK-NEXT:    vmovx.f16 s6, s6
+; CHECK-NEXT:    vldrh.u16 q3, [r0, #32]
+; CHECK-NEXT:    vldrh.u16 q0, [r0, #48]
+; CHECK-NEXT:    vldrh.u16 q4, [r0], #64
+; CHECK-NEXT:    vmovx.f16 s8, s15
+; CHECK-NEXT:    vmovx.f16 s10, s13
 ; CHECK-NEXT:    vldrh.u16 q5, [r0, #-48]
-; CHECK-NEXT:    vmovx.f16 s27, s16
-; CHECK-NEXT:    vins.f16 s26, s6
-; CHECK-NEXT:    vmovx.f16 s6, s18
-; CHECK-NEXT:    vmovx.f16 s8, s7
-; CHECK-NEXT:    vmovx.f16 s10, s5
-; CHECK-NEXT:    vmovx.f16 s24, s12
 ; CHECK-NEXT:    vins.f16 s10, s8
-; CHECK-NEXT:    vins.f16 s27, s6
-; CHECK-NEXT:    vmovx.f16 s6, s14
-; CHECK-NEXT:    vmovx.f16 s8, s19
-; CHECK-NEXT:    vmovx.f16 s11, s17
-; CHECK-NEXT:    vmov.f32 s0, s13
+; CHECK-NEXT:    vmovx.f16 s8, s3
+; CHECK-NEXT:    vmovx.f16 s11, s1
+; CHECK-NEXT:    vmov.f32 s7, s1
 ; CHECK-NEXT:    vins.f16 s11, s8
-; CHECK-NEXT:    vmovx.f16 s25, s20
-; CHECK-NEXT:    vins.f16 s24, s6
-; CHECK-NEXT:    vmovx.f16 s6, s22
-; CHECK-NEXT:    vmovx.f16 s1, s15
-; CHECK-NEXT:    vmovx.f16 s8, s13
-; CHECK-NEXT:    vins.f16 s20, s22
-; CHECK-NEXT:    vins.f16 s16, s18
-; CHECK-NEXT:    vmov.f32 s2, s5
-; CHECK-NEXT:    vins.f16 s25, s6
-; CHECK-NEXT:    vmov.f32 s3, s17
-; CHECK-NEXT:    vins.f16 s0, s15
+; CHECK-NEXT:    vmovx.f16 s1, s19
+; CHECK-NEXT:    vmovx.f16 s8, s17
+; CHECK-NEXT:    vmov.f32 s6, s13
 ; CHECK-NEXT:    vmovx.f16 s9, s21
+; CHECK-NEXT:    vmov.f32 s4, s17
+; CHECK-NEXT:    vins.f16 s21, s23
 ; CHECK-NEXT:    vins.f16 s8, s1
 ; CHECK-NEXT:    vmovx.f16 s1, s23
-; CHECK-NEXT:    vins.f16 s12, s14
-; CHECK-NEXT:    vins.f16 s21, s23
-; CHECK-NEXT:    vmov.f32 s14, s4
-; CHECK-NEXT:    vmov.f32 s15, s16
 ; CHECK-NEXT:    vins.f16 s9, s1
-; CHECK-NEXT:    vmov.f32 s13, s20
-; CHECK-NEXT:    vmul.f16 q6, q6, r2
-; CHECK-NEXT:    vmul.f16 q3, q3, r2
-; CHECK-NEXT:    vins.f16 s2, s7
-; CHECK-NEXT:    vins.f16 s3, s19
-; CHECK-NEXT:    vmov.f32 s1, s21
-; CHECK-NEXT:    vmul.f16 q0, q0, r2
-; CHECK-NEXT:    vmovx.f16 s4, s12
-; CHECK-NEXT:    vmovx.f16 s6, s24
+; CHECK-NEXT:    vins.f16 s6, s15
+; CHECK-NEXT:    vins.f16 s7, s3
+; CHECK-NEXT:    vins.f16 s4, s19
+; CHECK-NEXT:    vmov.f32 s5, s21
+; CHECK-NEXT:    vmovx.f16 s27, s0
+; CHECK-NEXT:    vmul.f16 q1, q1, r2
+; CHECK-NEXT:    vins.f16 s0, s2
+; CHECK-NEXT:    vmovx.f16 s2, s2
 ; CHECK-NEXT:    vmul.f16 q2, q2, r2
-; CHECK-NEXT:    vmovx.f16 s7, s0
-; CHECK-NEXT:    vins.f16 s0, s8
+; CHECK-NEXT:    vmovx.f16 s3, s4
+; CHECK-NEXT:    vins.f16 s4, s8
 ; CHECK-NEXT:    vmovx.f16 s8, s8
-; CHECK-NEXT:    vins.f16 s4, s6
-; CHECK-NEXT:    vmovx.f16 s5, s1
-; CHECK-NEXT:    vmovx.f16 s6, s9
-; CHECK-NEXT:    vins.f16 s7, s8
-; CHECK-NEXT:    vins.f16 s5, s6
-; CHECK-NEXT:    vmovx.f16 s6, s13
-; CHECK-NEXT:    vmovx.f16 s8, s25
-; CHECK-NEXT:    vins.f16 s6, s8
-; CHECK-NEXT:    vmovx.f16 s19, s2
-; CHECK-NEXT:    vmovx.f16 s8, s10
-; CHECK-NEXT:    vmovx.f16 s18, s14
-; CHECK-NEXT:    vins.f16 s19, s8
-; CHECK-NEXT:    vmovx.f16 s8, s26
-; CHECK-NEXT:    vins.f16 s18, s8
-; CHECK-NEXT:    vmovx.f16 s23, s3
-; CHECK-NEXT:    vmovx.f16 s8, s11
-; CHECK-NEXT:    vins.f16 s14, s26
-; CHECK-NEXT:    vins.f16 s23, s8
-; CHECK-NEXT:    vmovx.f16 s22, s15
-; CHECK-NEXT:    vins.f16 s15, s27
-; CHECK-NEXT:    vmovx.f16 s8, s27
-; CHECK-NEXT:    vins.f16 s12, s24
-; CHECK-NEXT:    vins.f16 s13, s25
-; CHECK-NEXT:    vins.f16 s2, s10
-; CHECK-NEXT:    vins.f16 s3, s11
-; CHECK-NEXT:    vins.f16 s1, s9
-; CHECK-NEXT:    vins.f16 s22, s8
-; CHECK-NEXT:    vmov q2, q3
-; CHECK-NEXT:    vmov q6, q0
-; CHECK-NEXT:    vmov.f32 s10, s4
-; CHECK-NEXT:    vmov.f32 s11, s7
-; CHECK-NEXT:    vmov.f32 s9, s0
-; CHECK-NEXT:    vmov.f32 s17, s2
-; CHECK-NEXT:    vmov.f32 s16, s14
-; CHECK-NEXT:    vmov.f32 s21, s3
-; CHECK-NEXT:    vstrh.16 q4, [r1, #32]
-; CHECK-NEXT:    vmov.f32 s20, s15
-; CHECK-NEXT:    vmov.f32 s7, s5
-; CHECK-NEXT:    vstrh.16 q5, [r1, #48]
-; CHECK-NEXT:    vstrh.16 q2, [r1], #64
-; CHECK-NEXT:    vmov.f32 s4, s13
-; CHECK-NEXT:    vmov.f32 s5, s25
-; CHECK-NEXT:    vstrh.16 q1, [r1, #-48]
+; CHECK-NEXT:    vmovx.f16 s24, s16
+; CHECK-NEXT:    vins.f16 s27, s2
+; CHECK-NEXT:    vmovx.f16 s2, s18
+; CHECK-NEXT:    vins.f16 s3, s8
+; CHECK-NEXT:    vmovx.f16 s26, s12
+; CHECK-NEXT:    vmovx.f16 s25, s20
+; CHECK-NEXT:    vins.f16 s20, s22
+; CHECK-NEXT:    vmovx.f16 s8, s14
+; CHECK-NEXT:    vins.f16 s24, s2
+; CHECK-NEXT:    vmovx.f16 s2, s22
+; CHECK-NEXT:    vins.f16 s12, s14
+; CHECK-NEXT:    vins.f16 s16, s18
+; CHECK-NEXT:    vins.f16 s26, s8
+; CHECK-NEXT:    vins.f16 s25, s2
+; CHECK-NEXT:    vmov.f32 s18, s12
+; CHECK-NEXT:    vmov.f32 s17, s20
+; CHECK-NEXT:    vmul.f16 q6, q6, r2
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmovx.f16 s0, s24
+; CHECK-NEXT:    vmul.f16 q4, q4, r2
+; CHECK-NEXT:    vmovx.f16 s15, s5
+; CHECK-NEXT:    vmovx.f16 s2, s16
+; CHECK-NEXT:    vmovx.f16 s14, s17
+; CHECK-NEXT:    vins.f16 s2, s0
+; CHECK-NEXT:    vmovx.f16 s0, s9
+; CHECK-NEXT:    vins.f16 s15, s0
+; CHECK-NEXT:    vmovx.f16 s0, s25
+; CHECK-NEXT:    vins.f16 s14, s0
+; CHECK-NEXT:    vmovx.f16 s23, s6
+; CHECK-NEXT:    vmovx.f16 s0, s10
+; CHECK-NEXT:    vmovx.f16 s22, s18
+; CHECK-NEXT:    vins.f16 s23, s0
+; CHECK-NEXT:    vmovx.f16 s0, s26
+; CHECK-NEXT:    vins.f16 s22, s0
+; CHECK-NEXT:    vmovx.f16 s31, s7
+; CHECK-NEXT:    vmovx.f16 s0, s11
+; CHECK-NEXT:    vins.f16 s6, s10
+; CHECK-NEXT:    vins.f16 s18, s26
+; CHECK-NEXT:    vins.f16 s7, s11
+; CHECK-NEXT:    vins.f16 s31, s0
+; CHECK-NEXT:    vmovx.f16 s30, s19
+; CHECK-NEXT:    vins.f16 s19, s27
+; CHECK-NEXT:    vmovx.f16 s0, s27
+; CHECK-NEXT:    vins.f16 s16, s24
+; CHECK-NEXT:    vins.f16 s5, s9
+; CHECK-NEXT:    vins.f16 s17, s25
+; CHECK-NEXT:    vins.f16 s30, s0
+; CHECK-NEXT:    vmov.f32 s1, s4
+; CHECK-NEXT:    vmov.f32 s0, s16
+; CHECK-NEXT:    vmov.f32 s21, s6
+; CHECK-NEXT:    vmov.f32 s20, s18
+; CHECK-NEXT:    vmov.f32 s29, s7
+; CHECK-NEXT:    vstrh.16 q5, [r1, #32]
+; CHECK-NEXT:    vmov.f32 s28, s19
+; CHECK-NEXT:    vstrh.16 q7, [r1, #48]
+; CHECK-NEXT:    vstrh.16 q0, [r1], #64
+; CHECK-NEXT:    vmov.f32 s12, s17
+; CHECK-NEXT:    vmov.f32 s13, s5
+; CHECK-NEXT:    vstrh.16 q3, [r1, #-48]
 ; CHECK-NEXT:    le lr, .LBB0_2
 ; CHECK-NEXT:  .LBB0_3: @ %while.end
-; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
   %tmp.0.extract.trunc = trunc i32 %scale.coerce to i16
diff --git a/llvm/test/CodeGen/Thumb2/mve-vst2.ll b/llvm/test/CodeGen/Thumb2/mve-vst2.ll
index 57d08a7f3c4b25..f5a129fa2dc941 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vst2.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vst2.ll
@@ -503,25 +503,28 @@ entry:
 define void @vst2_v4f16(ptr %src, ptr %dst) {
 ; CHECK-LABEL: vst2_v4f16:
 ; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
 ; CHECK-NEXT:    ldrd r2, r12, [r0]
 ; CHECK-NEXT:    ldrd r3, r0, [r0, #8]
-; CHECK-NEXT:    vmov.32 q0[0], r2
-; CHECK-NEXT:    vmov.32 q1[0], r3
+; CHECK-NEXT:    vmov.32 q4[0], r2
+; CHECK-NEXT:    vmov q0, q4
+; CHECK-NEXT:    vmov.32 q3[0], r3
+; CHECK-NEXT:    vmov q2, q3
 ; CHECK-NEXT:    vmov.32 q0[1], r12
-; CHECK-NEXT:    vmov.32 q1[1], r0
-; CHECK-NEXT:    vmovx.f16 s2, s0
-; CHECK-NEXT:    vins.f16 s0, s4
-; CHECK-NEXT:    vmovx.f16 s4, s4
-; CHECK-NEXT:    vins.f16 s2, s4
-; CHECK-NEXT:    vmovx.f16 s4, s1
-; CHECK-NEXT:    vins.f16 s1, s5
-; CHECK-NEXT:    vmovx.f16 s6, s5
-; CHECK-NEXT:    vmov q2, q0
-; CHECK-NEXT:    vins.f16 s4, s6
-; CHECK-NEXT:    vmov.f32 s9, s2
-; CHECK-NEXT:    vmov.f32 s10, s1
-; CHECK-NEXT:    vmov.f32 s11, s4
-; CHECK-NEXT:    vstrh.16 q2, [r1]
+; CHECK-NEXT:    vmov.32 q2[1], r0
+; CHECK-NEXT:    vmovx.f16 s0, s12
+; CHECK-NEXT:    vmovx.f16 s5, s16
+; CHECK-NEXT:    vmov.f32 s4, s16
+; CHECK-NEXT:    vins.f16 s5, s0
+; CHECK-NEXT:    vmovx.f16 s7, s1
+; CHECK-NEXT:    vins.f16 s1, s9
+; CHECK-NEXT:    vmovx.f16 s0, s9
+; CHECK-NEXT:    vins.f16 s4, s12
+; CHECK-NEXT:    vins.f16 s7, s0
+; CHECK-NEXT:    vmov.f32 s6, s1
+; CHECK-NEXT:    vstrh.16 q1, [r1]
+; CHECK-NEXT:    vpop {d8, d9}
 ; CHECK-NEXT:    bx lr
 entry:
   %l1 = load <4 x half>, ptr %src, align 4
diff --git a/llvm/test/CodeGen/Thumb2/mve-vst3.ll b/llvm/test/CodeGen/Thumb2/mve-vst3.ll
index 85317e1fe4626a..903069b7e776d6 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vst3.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vst3.ll
@@ -12,13 +12,11 @@ define void @vst3_v2i32(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    ldrd r3, r2, [r0, #8]
 ; CHECK-NEXT:    ldrd r4, r0, [r0, #16]
 ; CHECK-NEXT:    vmov q1[2], q1[0], lr, r3
-; CHECK-NEXT:    str r2, [r1, #16]
-; CHECK-NEXT:    vmov.32 q0[0], r4
+; CHECK-NEXT:    strd r2, r0, [r1, #16]
 ; CHECK-NEXT:    vmov q1[3], q1[1], r12, r2
-; CHECK-NEXT:    vmov.32 q0[1], r0
+; CHECK-NEXT:    vmov.32 q0[0], r4
 ; CHECK-NEXT:    vmov.f32 s8, s4
 ; CHECK-NEXT:    vmov.f32 s9, s6
-; CHECK-NEXT:    str r0, [r1, #20]
 ; CHECK-NEXT:    vmov.f32 s10, s0
 ; CHECK-NEXT:    vmov.f32 s11, s5
 ; CHECK-NEXT:    vstrw.32 q2, [r1]
@@ -337,51 +335,49 @@ define void @vst3_v8i16(ptr %src, ptr %dst) {
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11}
-; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    vldrw.u32 q0, [r0]
 ; CHECK-NEXT:    vldrw.u32 q2, [r0, #16]
-; CHECK-NEXT:    vmov.f32 s12, s7
+; CHECK-NEXT:    vmov.f32 s12, s3
 ; CHECK-NEXT:    vmov.u16 r2, q2[5]
-; CHECK-NEXT:    vmov.16 q0[0], r2
+; CHECK-NEXT:    vmov.16 q1[0], r2
 ; CHECK-NEXT:    vins.f16 s12, s11
-; CHECK-NEXT:    vmov.f32 s1, s12
+; CHECK-NEXT:    vmov.f32 s5, s12
 ; CHECK-NEXT:    vmov.u16 r2, q2[7]
 ; CHECK-NEXT:    vldrw.u32 q3, [r0, #32]
-; CHECK-NEXT:    vmov.16 q0[6], r2
-; CHECK-NEXT:    vmov.f32 s2, s7
+; CHECK-NEXT:    vmov.16 q1[6], r2
+; CHECK-NEXT:    vmov.f32 s20, s0
 ; CHECK-NEXT:    vmov.u16 r0, q2[3]
-; CHECK-NEXT:    vmovx.f16 s7, s14
+; CHECK-NEXT:    vmovx.f16 s6, s14
 ; CHECK-NEXT:    vmov.16 q4[2], r0
-; CHECK-NEXT:    vins.f16 s0, s7
-; CHECK-NEXT:    vmovx.f16 s7, s15
-; CHECK-NEXT:    vins.f16 s3, s7
-; CHECK-NEXT:    vmov.f32 s7, s6
+; CHECK-NEXT:    vins.f16 s4, s6
+; CHECK-NEXT:    vmovx.f16 s6, s3
+; CHECK-NEXT:    vmovx.f16 s3, s15
+; CHECK-NEXT:    vins.f16 s15, s6
+; CHECK-NEXT:    vmov.f32 s6, s15
+; CHECK-NEXT:    vins.f16 s7, s3
+; CHECK-NEXT:    vmov.f32 s3, s2
+; CHECK-NEXT:    vmovx.f16 s15, s1
 ; CHECK-NEXT:    vmovx.f16 s2, s2
-; CHECK-NEXT:    vins.f16 s7, s10
-; CHECK-NEXT:    vmov.f32 s20, s4
-; CHECK-NEXT:    vins.f16 s15, s2
-; CHECK-NEXT:    vmov.f32 s18, s7
 ; CHECK-NEXT:    vins.f16 s20, s8
-; CHECK-NEXT:    vmov.f32 s7, s6
-; CHECK-NEXT:    vmovx.f16 s6, s5
-; CHECK-NEXT:    vmov.f32 s2, s15
-; CHECK-NEXT:    vmovx.f16 s15, s13
-; CHECK-NEXT:    vins.f16 s13, s6
-; CHECK-NEXT:    vmovx.f16 s6, s7
 ; CHECK-NEXT:    vmov.u16 r0, q2[1]
-; CHECK-NEXT:    vmovx.f16 s4, s4
-; CHECK-NEXT:    vins.f16 s14, s6
-; CHECK-NEXT:    vmovx.f16 s6, s12
+; CHECK-NEXT:    vmovx.f16 s0, s0
+; CHECK-NEXT:    vmovx.f16 s16, s13
+; CHECK-NEXT:    vins.f16 s14, s2
+; CHECK-NEXT:    vins.f16 s3, s10
+; CHECK-NEXT:    vmovx.f16 s2, s12
 ; CHECK-NEXT:    vmov.16 q5[4], r0
-; CHECK-NEXT:    vins.f16 s5, s9
-; CHECK-NEXT:    vins.f16 s12, s4
-; CHECK-NEXT:    vins.f16 s17, s15
+; CHECK-NEXT:    vins.f16 s1, s9
+; CHECK-NEXT:    vins.f16 s12, s0
+; CHECK-NEXT:    vins.f16 s13, s15
+; CHECK-NEXT:    vins.f16 s17, s16
 ; CHECK-NEXT:    vmov.f32 s16, s13
-; CHECK-NEXT:    vins.f16 s22, s6
+; CHECK-NEXT:    vmov.f32 s18, s3
+; CHECK-NEXT:    vins.f16 s22, s2
 ; CHECK-NEXT:    vmov.f32 s19, s14
-; CHECK-NEXT:    vstrw.32 q0, [r1, #32]
-; CHECK-NEXT:    vmov.f32 s23, s5
-; CHECK-NEXT:    vstrw.32 q4, [r1, #16]
+; CHECK-NEXT:    vstrw.32 q1, [r1, #32]
 ; CHECK-NEXT:    vmov.f32 s21, s12
+; CHECK-NEXT:    vstrw.32 q4, [r1, #16]
+; CHECK-NEXT:    vmov.f32 s23, s1
 ; CHECK-NEXT:    vstrw.32 q5, [r1]
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11}
 ; CHECK-NEXT:    bx lr
@@ -405,110 +401,105 @@ define void @vst3_v16i16(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    .pad #48
 ; CHECK-NEXT:    sub sp, #48
-; CHECK-NEXT:    vldrw.u32 q2, [r0]
-; CHECK-NEXT:    vldrw.u32 q1, [r0, #32]
-; CHECK-NEXT:    vldrw.u32 q7, [r0, #80]
-; CHECK-NEXT:    vmov.f32 s0, s11
-; CHECK-NEXT:    vmov.u16 r2, q1[5]
-; CHECK-NEXT:    vmov.16 q3[0], r2
-; CHECK-NEXT:    vins.f16 s0, s7
-; CHECK-NEXT:    vmov.f32 s2, s11
-; CHECK-NEXT:    vmov.u16 r2, q1[7]
-; CHECK-NEXT:    vmov.f64 d12, d4
-; CHECK-NEXT:    vstrw.32 q1, [sp, #32] @ 16-byte Spill
-; CHECK-NEXT:    vldrw.u32 q1, [r0, #16]
-; CHECK-NEXT:    vmov.f32 s26, s10
+; CHECK-NEXT:    vldrw.u32 q2, [r0, #32]
+; CHECK-NEXT:    vldrw.u32 q7, [r0]
+; CHECK-NEXT:    vldrw.u32 q3, [r0, #16]
+; CHECK-NEXT:    vldrw.u32 q6, [r0, #80]
+; CHECK-NEXT:    vmov.f32 s0, s31
+; CHECK-NEXT:    vmov.u16 r2, q2[5]
+; CHECK-NEXT:    vmov.16 q1[0], r2
+; CHECK-NEXT:    vins.f16 s0, s11
+; CHECK-NEXT:    vstrw.32 q2, [sp, #32] @ 16-byte Spill
+; CHECK-NEXT:    vmov.u16 r2, q2[7]
 ; CHECK-NEXT:    vldrw.u32 q2, [r0, #64]
-; CHECK-NEXT:    vmov.f32 s13, s0
-; CHECK-NEXT:    vstrw.32 q6, [sp] @ 16-byte Spill
-; CHECK-NEXT:    vmov.16 q3[6], r2
+; CHECK-NEXT:    vmov.f32 s5, s0
+; CHECK-NEXT:    vmov.16 q1[6], r2
+; CHECK-NEXT:    vmov.f32 s20, s15
 ; CHECK-NEXT:    vmovx.f16 s0, s10
-; CHECK-NEXT:    vins.f16 s12, s0
-; CHECK-NEXT:    vmovx.f16 s0, s2
-; CHECK-NEXT:    vmov.f32 s14, s11
-; CHECK-NEXT:    vins.f16 s14, s0
-; CHECK-NEXT:    vmov.f32 s20, s7
-; CHECK-NEXT:    vmov q0, q3
-; CHECK-NEXT:    vldrw.u32 q3, [r0, #48]
-; CHECK-NEXT:    vmov.u16 r2, q3[5]
-; CHECK-NEXT:    vins.f16 s20, s15
+; CHECK-NEXT:    vmov.f32 s6, s11
+; CHECK-NEXT:    vins.f16 s4, s0
+; CHECK-NEXT:    vmovx.f16 s0, s31
+; CHECK-NEXT:    vins.f16 s6, s0
+; CHECK-NEXT:    vmovx.f16 s11, s11
+; CHECK-NEXT:    vmov q0, q1
+; CHECK-NEXT:    vldrw.u32 q1, [r0, #48]
+; CHECK-NEXT:    vins.f16 s3, s11
+; CHECK-NEXT:    vmovx.f16 s15, s15
+; CHECK-NEXT:    vmov.u16 r2, q1[5]
+; CHECK-NEXT:    vins.f16 s20, s7
 ; CHECK-NEXT:    vmov.16 q4[0], r2
-; CHECK-NEXT:    vmov.u16 r2, q3[7]
+; CHECK-NEXT:    vmov.u16 r2, q1[7]
 ; CHECK-NEXT:    vmov.f32 s17, s20
-; CHECK-NEXT:    vmovx.f16 s20, s31
+; CHECK-NEXT:    vmovx.f16 s20, s27
 ; CHECK-NEXT:    vmov.16 q4[6], r2
-; CHECK-NEXT:    vmov.f32 s18, s7
-; CHECK-NEXT:    vmovx.f16 s7, s30
-; CHECK-NEXT:    vins.f16 s16, s7
-; CHECK-NEXT:    vmovx.f16 s7, s18
-; CHECK-NEXT:    vins.f16 s31, s7
-; CHECK-NEXT:    vmovx.f16 s7, s11
-; CHECK-NEXT:    vins.f16 s3, s7
-; CHECK-NEXT:    vins.f16 s19, s20
 ; CHECK-NEXT:    vstrw.32 q0, [sp, #16] @ 16-byte Spill
+; CHECK-NEXT:    vins.f16 s19, s20
+; CHECK-NEXT:    vmov.f64 d10, d14
 ; CHECK-NEXT:    vldrw.u32 q0, [sp, #32] @ 16-byte Reload
-; CHECK-NEXT:    vmov.f32 s20, s24
-; CHECK-NEXT:    vmovx.f16 s11, s8
-; CHECK-NEXT:    vmov.f32 s7, s25
-; CHECK-NEXT:    vins.f16 s20, s0
+; CHECK-NEXT:    vins.f16 s27, s15
+; CHECK-NEXT:    vmovx.f16 s18, s26
+; CHECK-NEXT:    vins.f16 s28, s0
 ; CHECK-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-NEXT:    vins.f16 s7, s1
+; CHECK-NEXT:    vmov.f32 s22, s30
+; CHECK-NEXT:    vmov.16 q7[4], r0
+; CHECK-NEXT:    vmovx.f16 s15, s20
+; CHECK-NEXT:    vstrw.32 q5, [sp] @ 16-byte Spill
+; CHECK-NEXT:    vmovx.f16 s20, s8
+; CHECK-NEXT:    vmov.u16 r0, q1[1]
+; CHECK-NEXT:    vins.f16 s30, s20
+; CHECK-NEXT:    vmov.f32 s20, s12
+; CHECK-NEXT:    vins.f16 s20, s4
+; CHECK-NEXT:    vmov.f32 s11, s21
 ; CHECK-NEXT:    vmov.16 q5[4], r0
-; CHECK-NEXT:    vmov.u16 r0, q3[1]
-; CHECK-NEXT:    vmov.f32 s23, s7
-; CHECK-NEXT:    vmovx.f16 s7, s24
-; CHECK-NEXT:    vmov.f32 s24, s4
-; CHECK-NEXT:    vins.f16 s8, s7
-; CHECK-NEXT:    vins.f16 s24, s12
-; CHECK-NEXT:    vmov.f32 s21, s8
-; CHECK-NEXT:    vmov.f32 s8, s5
-; CHECK-NEXT:    vmov.16 q6[4], r0
-; CHECK-NEXT:    vins.f16 s8, s13
-; CHECK-NEXT:    vmovx.f16 s4, s4
-; CHECK-NEXT:    vmov.f32 s27, s8
-; CHECK-NEXT:    vmovx.f16 s8, s28
-; CHECK-NEXT:    vins.f16 s28, s4
-; CHECK-NEXT:    vmov.f32 s4, s6
-; CHECK-NEXT:    vmov.u16 r0, q3[3]
-; CHECK-NEXT:    vins.f16 s4, s14
+; CHECK-NEXT:    vmov.u16 r0, q1[3]
+; CHECK-NEXT:    vins.f16 s11, s1
 ; CHECK-NEXT:    vmov.16 q0[2], r0
-; CHECK-NEXT:    vins.f16 s26, s8
-; CHECK-NEXT:    vmov.f32 s2, s4
-; CHECK-NEXT:    vmovx.f16 s4, s29
-; CHECK-NEXT:    vins.f16 s1, s4
-; CHECK-NEXT:    vmovx.f16 s4, s6
-; CHECK-NEXT:    vmovx.f16 s0, s5
-; CHECK-NEXT:    vins.f16 s30, s4
+; CHECK-NEXT:    vmovx.f16 s12, s12
+; CHECK-NEXT:    vins.f16 s8, s15
+; CHECK-NEXT:    vmov.f32 s31, s11
+; CHECK-NEXT:    vmovx.f16 s11, s24
+; CHECK-NEXT:    vins.f16 s24, s12
+; CHECK-NEXT:    vmov.f32 s12, s14
+; CHECK-NEXT:    vmovx.f16 s2, s25
+; CHECK-NEXT:    vmov.f32 s29, s8
+; CHECK-NEXT:    vmov.f32 s8, s13
+; CHECK-NEXT:    vins.f16 s1, s2
+; CHECK-NEXT:    vmovx.f16 s2, s14
+; CHECK-NEXT:    vins.f16 s12, s6
+; CHECK-NEXT:    vins.f16 s8, s5
+; CHECK-NEXT:    vins.f16 s26, s2
+; CHECK-NEXT:    vmovx.f16 s0, s13
+; CHECK-NEXT:    vmov.f32 s2, s12
+; CHECK-NEXT:    vldrw.u32 q3, [sp, #32] @ 16-byte Reload
 ; CHECK-NEXT:    vldrw.u32 q1, [sp] @ 16-byte Reload
-; CHECK-NEXT:    vins.f16 s29, s0
-; CHECK-NEXT:    vmov.f32 s0, s29
-; CHECK-NEXT:    vins.f16 s22, s11
-; CHECK-NEXT:    vmov.f32 s3, s30
-; CHECK-NEXT:    vstrw.32 q5, [r1]
-; CHECK-NEXT:    vmov.f32 s29, s5
-; CHECK-NEXT:    vstrw.32 q0, [r1, #64]
-; CHECK-NEXT:    vmov.f32 s30, s6
+; CHECK-NEXT:    vmov.f32 s23, s8
+; CHECK-NEXT:    vins.f16 s25, s0
 ; CHECK-NEXT:    vmov.f32 s8, s6
-; CHECK-NEXT:    vldrw.u32 q1, [sp, #32] @ 16-byte Reload
-; CHECK-NEXT:    vmov.f32 s18, s31
-; CHECK-NEXT:    vmov.u16 r0, q1[3]
-; CHECK-NEXT:    vins.f16 s8, s6
-; CHECK-NEXT:    vmov.16 q1[2], r0
-; CHECK-NEXT:    vmov.f32 s25, s28
-; CHECK-NEXT:    vmov.f32 s6, s8
-; CHECK-NEXT:    vmovx.f16 s8, s9
-; CHECK-NEXT:    vmovx.f16 s4, s29
-; CHECK-NEXT:    vins.f16 s5, s8
-; CHECK-NEXT:    vmovx.f16 s8, s30
-; CHECK-NEXT:    vins.f16 s9, s4
-; CHECK-NEXT:    vins.f16 s10, s8
-; CHECK-NEXT:    vmov.f32 s4, s9
-; CHECK-NEXT:    vmov.f32 s7, s10
-; CHECK-NEXT:    vstrw.32 q6, [r1, #48]
-; CHECK-NEXT:    vstrw.32 q1, [r1, #16]
+; CHECK-NEXT:    vmov.u16 r0, q3[3]
+; CHECK-NEXT:    vins.f16 s8, s14
+; CHECK-NEXT:    vmov.16 q3[2], r0
+; CHECK-NEXT:    vmovx.f16 s14, s9
+; CHECK-NEXT:    vmovx.f16 s12, s5
+; CHECK-NEXT:    vins.f16 s13, s14
+; CHECK-NEXT:    vmovx.f16 s14, s6
+; CHECK-NEXT:    vins.f16 s9, s12
+; CHECK-NEXT:    vins.f16 s10, s14
 ; CHECK-NEXT:    vldrw.u32 q1, [sp, #16] @ 16-byte Reload
-; CHECK-NEXT:    vstrw.32 q4, [r1, #80]
+; CHECK-NEXT:    vins.f16 s16, s18
+; CHECK-NEXT:    vmov.f32 s18, s27
+; CHECK-NEXT:    vins.f16 s22, s11
+; CHECK-NEXT:    vmov.f32 s21, s24
 ; CHECK-NEXT:    vstrw.32 q1, [r1, #32]
+; CHECK-NEXT:    vmov.f32 s0, s25
+; CHECK-NEXT:    vstrw.32 q5, [r1, #48]
+; CHECK-NEXT:    vmov.f32 s3, s26
+; CHECK-NEXT:    vstrw.32 q4, [r1, #80]
+; CHECK-NEXT:    vmov.f32 s12, s9
+; CHECK-NEXT:    vstrw.32 q0, [r1, #64]
+; CHECK-NEXT:    vmov.f32 s14, s8
+; CHECK-NEXT:    vstrw.32 q7, [r1]
+; CHECK-NEXT:    vmov.f32 s15, s10
+; CHECK-NEXT:    vstrw.32 q3, [r1, #16]
 ; CHECK-NEXT:    add sp, #48
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    bx lr
@@ -628,13 +619,13 @@ define void @vst3_v8i8(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    vmovx.f16 s2, s7
 ; CHECK-NEXT:    vmovx.f16 s3, s11
 ; CHECK-NEXT:    vins.f16 s3, s2
-; CHECK-NEXT:    vmovx.f16 s16, s1
+; CHECK-NEXT:    vmovx.f16 s2, s1
 ; CHECK-NEXT:    vmov.f32 s1, s15
-; CHECK-NEXT:    vmovx.f16 s18, s15
+; CHECK-NEXT:    vmovx.f16 s16, s15
+; CHECK-NEXT:    vins.f16 s1, s2
 ; CHECK-NEXT:    vmov.f32 s2, s7
 ; CHECK-NEXT:    vmov.u16 r0, q3[0]
-; CHECK-NEXT:    vins.f16 s1, s16
-; CHECK-NEXT:    vins.f16 s2, s18
+; CHECK-NEXT:    vins.f16 s2, s16
 ; CHECK-NEXT:    vmov.8 q4[0], r0
 ; CHECK-NEXT:    vmov.u16 r0, q2[0]
 ; CHECK-NEXT:    vmov.8 q4[1], r0
@@ -1183,16 +1174,18 @@ define void @vst3_v2f16(ptr %src, ptr %dst) {
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    ldrd r2, r3, [r0]
 ; CHECK-NEXT:    ldr r0, [r0, #8]
-; CHECK-NEXT:    vmov.32 q0[0], r2
+; CHECK-NEXT:    vmov.32 q1[0], r2
+; CHECK-NEXT:    vmov q0, q1
+; CHECK-NEXT:    vmov.32 q2[0], r0
 ; CHECK-NEXT:    vmov.32 q0[1], r3
-; CHECK-NEXT:    vmov.32 q1[0], r0
-; CHECK-NEXT:    vmovx.f16 s2, s0
-; CHECK-NEXT:    vmovx.f16 s6, s4
-; CHECK-NEXT:    vins.f16 s4, s2
+; CHECK-NEXT:    vmovx.f16 s2, s4
+; CHECK-NEXT:    vmov.f32 s0, s4
+; CHECK-NEXT:    vmovx.f16 s4, s8
+; CHECK-NEXT:    vins.f16 s8, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s1
 ; CHECK-NEXT:    vins.f16 s0, s1
-; CHECK-NEXT:    vmov.f32 s1, s4
-; CHECK-NEXT:    vins.f16 s2, s6
+; CHECK-NEXT:    vmov.f32 s1, s8
+; CHECK-NEXT:    vins.f16 s2, s4
 ; CHECK-NEXT:    vmov r3, s2
 ; CHECK-NEXT:    vmov r0, r2, d0
 ; CHECK-NEXT:    stm r1!, {r0, r2, r3}
@@ -1219,25 +1212,27 @@ define void @vst3_v4f16(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    ldrd r3, r2, [r0, #8]
 ; CHECK-NEXT:    ldrd r4, r0, [r0, #16]
 ; CHECK-NEXT:    vmov q0[2], q0[0], lr, r3
-; CHECK-NEXT:    vmov.32 q1[0], r4
+; CHECK-NEXT:    vmov.32 q2[0], r4
 ; CHECK-NEXT:    vmov q0[3], q0[1], r12, r2
-; CHECK-NEXT:    vmov.32 q1[1], r0
+; CHECK-NEXT:    vmov q1, q2
 ; CHECK-NEXT:    vmovx.f16 s9, s3
-; CHECK-NEXT:    vmovx.f16 s6, s0
+; CHECK-NEXT:    vmov.32 q1[1], r0
+; CHECK-NEXT:    vmovx.f16 s4, s0
+; CHECK-NEXT:    vmov.f32 s6, s8
 ; CHECK-NEXT:    vins.f16 s0, s2
-; CHECK-NEXT:    vmovx.f16 s8, s4
+; CHECK-NEXT:    vins.f16 s6, s4
+; CHECK-NEXT:    vmovx.f16 s4, s8
 ; CHECK-NEXT:    vmovx.f16 s2, s2
-; CHECK-NEXT:    vins.f16 s4, s6
-; CHECK-NEXT:    vmovx.f16 s6, s1
-; CHECK-NEXT:    vins.f16 s2, s8
+; CHECK-NEXT:    vins.f16 s2, s4
+; CHECK-NEXT:    vmovx.f16 s4, s1
 ; CHECK-NEXT:    vmovx.f16 s8, s5
-; CHECK-NEXT:    vins.f16 s5, s6
+; CHECK-NEXT:    vins.f16 s5, s4
 ; CHECK-NEXT:    vins.f16 s9, s8
 ; CHECK-NEXT:    vmov.f32 s8, s5
 ; CHECK-NEXT:    vins.f16 s1, s3
 ; CHECK-NEXT:    vmov r0, r2, d4
 ; CHECK-NEXT:    vmov q2, q0
-; CHECK-NEXT:    vmov.f32 s9, s4
+; CHECK-NEXT:    vmov.f32 s9, s6
 ; CHECK-NEXT:    vmov.f32 s10, s2
 ; CHECK-NEXT:    vmov.f32 s11, s1
 ; CHECK-NEXT:    vstrw.32 q2, [r1]
@@ -1259,58 +1254,58 @@ entry:
 define void @vst3_v8f16(ptr %src, ptr %dst) {
 ; CHECK-LABEL: vst3_v8f16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .vsave {d8, d9}
-; CHECK-NEXT:    vpush {d8, d9}
-; CHECK-NEXT:    vldrw.u32 q4, [r0, #16]
-; CHECK-NEXT:    vldrw.u32 q3, [r0]
-; CHECK-NEXT:    vmovx.f16 s0, s18
-; CHECK-NEXT:    vmov.f32 s4, s15
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vldrw.u32 q5, [r0, #16]
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    vmovx.f16 s0, s22
+; CHECK-NEXT:    vmov.f32 s8, s7
 ; CHECK-NEXT:    vmov r2, s0
-; CHECK-NEXT:    vins.f16 s4, s19
+; CHECK-NEXT:    vins.f16 s8, s23
 ; CHECK-NEXT:    vmov.16 q0[0], r2
-; CHECK-NEXT:    vmovx.f16 s10, s16
-; CHECK-NEXT:    vmov.f32 s1, s4
-; CHECK-NEXT:    vmovx.f16 s4, s19
-; CHECK-NEXT:    vmov r2, s4
-; CHECK-NEXT:    vldrw.u32 q1, [r0, #32]
+; CHECK-NEXT:    vmov.f32 s16, s4
+; CHECK-NEXT:    vmov.f32 s1, s8
+; CHECK-NEXT:    vmovx.f16 s8, s23
+; CHECK-NEXT:    vmov r2, s8
+; CHECK-NEXT:    vldrw.u32 q2, [r0, #32]
 ; CHECK-NEXT:    vmov.16 q0[6], r2
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vmovx.f16 s8, s6
-; CHECK-NEXT:    vmovx.f16 s2, s15
-; CHECK-NEXT:    vins.f16 s0, s8
-; CHECK-NEXT:    vmovx.f16 s8, s7
-; CHECK-NEXT:    vins.f16 s3, s8
-; CHECK-NEXT:    vmov.f32 s8, s12
-; CHECK-NEXT:    vins.f16 s8, s16
-; CHECK-NEXT:    vins.f16 s7, s2
-; CHECK-NEXT:    vmov.f32 s2, s13
-; CHECK-NEXT:    vmov.16 q2[4], r0
-; CHECK-NEXT:    vins.f16 s2, s17
-; CHECK-NEXT:    vmov.f32 s11, s2
-; CHECK-NEXT:    vmovx.f16 s2, s12
-; CHECK-NEXT:    vmovx.f16 s12, s4
-; CHECK-NEXT:    vins.f16 s4, s2
-; CHECK-NEXT:    vins.f16 s10, s12
-; CHECK-NEXT:    vmovx.f16 s12, s17
-; CHECK-NEXT:    vmov.f32 s2, s14
-; CHECK-NEXT:    vmov r0, s12
-; CHECK-NEXT:    vins.f16 s2, s18
-; CHECK-NEXT:    vmov.16 q4[2], r0
-; CHECK-NEXT:    vmovx.f16 s12, s5
-; CHECK-NEXT:    vmov.f32 s18, s2
-; CHECK-NEXT:    vmovx.f16 s2, s13
-; CHECK-NEXT:    vins.f16 s5, s2
-; CHECK-NEXT:    vmovx.f16 s2, s14
-; CHECK-NEXT:    vins.f16 s6, s2
-; CHECK-NEXT:    vmov.f32 s2, s7
-; CHECK-NEXT:    vmov.f32 s9, s4
-; CHECK-NEXT:    vins.f16 s17, s12
-; CHECK-NEXT:    vmov.f32 s16, s5
+; CHECK-NEXT:    vins.f16 s16, s20
+; CHECK-NEXT:    vmovx.f16 s2, s10
+; CHECK-NEXT:    vmovx.f16 s12, s11
+; CHECK-NEXT:    vins.f16 s0, s2
+; CHECK-NEXT:    vmovx.f16 s2, s7
+; CHECK-NEXT:    vins.f16 s11, s2
+; CHECK-NEXT:    vmovx.f16 s2, s20
+; CHECK-NEXT:    vmov r0, s2
+; CHECK-NEXT:    vmovx.f16 s2, s4
+; CHECK-NEXT:    vmovx.f16 s4, s8
+; CHECK-NEXT:    vins.f16 s8, s2
+; CHECK-NEXT:    vmovx.f16 s2, s21
+; CHECK-NEXT:    vmov.16 q4[4], r0
+; CHECK-NEXT:    vmov r0, s2
+; CHECK-NEXT:    vmovx.f16 s2, s5
+; CHECK-NEXT:    vins.f16 s3, s12
+; CHECK-NEXT:    vmov.f32 s12, s5
+; CHECK-NEXT:    vins.f16 s18, s4
+; CHECK-NEXT:    vmov.f32 s4, s6
+; CHECK-NEXT:    vmovx.f16 s14, s9
+; CHECK-NEXT:    vins.f16 s9, s2
+; CHECK-NEXT:    vmovx.f16 s2, s6
+; CHECK-NEXT:    vins.f16 s12, s21
+; CHECK-NEXT:    vins.f16 s4, s22
+; CHECK-NEXT:    vmov.16 q5[2], r0
+; CHECK-NEXT:    vins.f16 s10, s2
+; CHECK-NEXT:    vmov.f32 s2, s11
+; CHECK-NEXT:    vins.f16 s21, s14
+; CHECK-NEXT:    vmov.f32 s20, s9
+; CHECK-NEXT:    vmov.f32 s22, s4
 ; CHECK-NEXT:    vstrw.32 q0, [r1, #32]
-; CHECK-NEXT:    vmov.f32 s19, s6
-; CHECK-NEXT:    vstrw.32 q2, [r1]
-; CHECK-NEXT:    vstrw.32 q4, [r1, #16]
-; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    vmov.f32 s23, s10
+; CHECK-NEXT:    vmov.f32 s17, s8
+; CHECK-NEXT:    vstrw.32 q5, [r1, #16]
+; CHECK-NEXT:    vmov.f32 s19, s12
+; CHECK-NEXT:    vstrw.32 q4, [r1]
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
 ; CHECK-NEXT:    bx lr
 entry:
   %l1 = load <8 x half>, ptr %src, align 4
@@ -1330,128 +1325,118 @@ define void @vst3_v16f16(ptr %src, ptr %dst) {
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13, d14, d15}
-; CHECK-NEXT:    .pad #48
-; CHECK-NEXT:    sub sp, #48
-; CHECK-NEXT:    vldrw.u32 q3, [r0, #16]
+; CHECK-NEXT:    .pad #72
+; CHECK-NEXT:    sub sp, #72
+; CHECK-NEXT:    vldrw.u32 q5, [r0, #16]
 ; CHECK-NEXT:    vldrw.u32 q1, [r0, #48]
-; CHECK-NEXT:    vldrw.u32 q6, [r0, #32]
-; CHECK-NEXT:    vmov.f32 s8, s12
-; CHECK-NEXT:    vmovx.f16 s2, s4
-; CHECK-NEXT:    vmov.f32 s0, s13
-; CHECK-NEXT:    vins.f16 s8, s4
-; CHECK-NEXT:    vmov r2, s2
-; CHECK-NEXT:    vins.f16 s0, s5
-; CHECK-NEXT:    vmov.16 q2[4], r2
-; CHECK-NEXT:    vmov q4, q3
-; CHECK-NEXT:    vmov.f32 s11, s0
-; CHECK-NEXT:    vmovx.f16 s0, s16
-; CHECK-NEXT:    vmov.f32 s12, s8
-; CHECK-NEXT:    vmov.f64 d11, d9
-; CHECK-NEXT:    vmov.f32 s21, s17
-; CHECK-NEXT:    vmov.f64 d7, d5
 ; CHECK-NEXT:    vldrw.u32 q2, [r0, #80]
+; CHECK-NEXT:    vldrw.u32 q7, [r0, #32]
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    vldrw.u32 q4, [r0, #64]
+; CHECK-NEXT:    vins.f16 s0, s4
 ; CHECK-NEXT:    vmovx.f16 s2, s8
+; CHECK-NEXT:    vmov.f32 s12, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    vins.f16 s0, s5
+; CHECK-NEXT:    vstr s0, [sp, #68] @ 4-byte Spill
+; CHECK-NEXT:    vmovx.f16 s0, s4
+; CHECK-NEXT:    vmov r2, s0
+; CHECK-NEXT:    vmovx.f16 s0, s20
+; CHECK-NEXT:    vmov.16 q3[4], r2
 ; CHECK-NEXT:    vins.f16 s8, s0
 ; CHECK-NEXT:    vins.f16 s14, s2
-; CHECK-NEXT:    vmovx.f16 s2, s24
-; CHECK-NEXT:    vstrw.32 q3, [sp, #16] @ 16-byte Spill
-; CHECK-NEXT:    vldrw.u32 q3, [r0]
-; CHECK-NEXT:    vmov r2, s2
-; CHECK-NEXT:    vmov.f32 s16, s12
-; CHECK-NEXT:    vmov.f32 s0, s13
-; CHECK-NEXT:    vins.f16 s16, s24
-; CHECK-NEXT:    vmov.16 q4[4], r2
-; CHECK-NEXT:    vins.f16 s0, s25
-; CHECK-NEXT:    vmov.f32 s19, s0
-; CHECK-NEXT:    vmovx.f16 s0, s12
-; CHECK-NEXT:    vmov.f64 d15, d13
-; CHECK-NEXT:    vmov.f32 s17, s13
-; CHECK-NEXT:    vmov.f32 s24, s16
-; CHECK-NEXT:    vmov.f64 d13, d9
-; CHECK-NEXT:    vmov.f64 d9, d7
-; CHECK-NEXT:    vldrw.u32 q3, [r0, #64]
-; CHECK-NEXT:    vmovx.f16 s2, s12
-; CHECK-NEXT:    vins.f16 s12, s0
-; CHECK-NEXT:    vins.f16 s26, s2
+; CHECK-NEXT:    vldrw.u32 q0, [r0]
+; CHECK-NEXT:    vstrw.32 q3, [sp, #48] @ 16-byte Spill
+; CHECK-NEXT:    vmov.f32 s12, s0
+; CHECK-NEXT:    vmov q6, q0
+; CHECK-NEXT:    vmovx.f16 s0, s28
+; CHECK-NEXT:    vins.f16 s12, s28
+; CHECK-NEXT:    vmov r2, s0
+; CHECK-NEXT:    vmovx.f16 s2, s16
+; CHECK-NEXT:    vmov.16 q3[4], r2
+; CHECK-NEXT:    vmovx.f16 s0, s24
+; CHECK-NEXT:    vins.f16 s14, s2
 ; CHECK-NEXT:    vmovx.f16 s2, s30
-; CHECK-NEXT:    vmov.f32 s0, s19
-; CHECK-NEXT:    vstrw.32 q6, [sp, #32] @ 16-byte Spill
+; CHECK-NEXT:    vins.f16 s16, s0
+; CHECK-NEXT:    vmov.f32 s0, s27
 ; CHECK-NEXT:    vmov r0, s2
+; CHECK-NEXT:    vstrw.32 q3, [sp, #32] @ 16-byte Spill
 ; CHECK-NEXT:    vins.f16 s0, s31
-; CHECK-NEXT:    vmov.f32 s29, s25
-; CHECK-NEXT:    vmov.16 q6[0], r0
-; CHECK-NEXT:    vmov.f32 s25, s0
+; CHECK-NEXT:    vmov.16 q3[0], r0
+; CHECK-NEXT:    vmov.f32 s13, s0
 ; CHECK-NEXT:    vmovx.f16 s0, s31
 ; CHECK-NEXT:    vmov r0, s0
-; CHECK-NEXT:    vmovx.f16 s0, s14
-; CHECK-NEXT:    vmov.16 q6[6], r0
-; CHECK-NEXT:    vmovx.f16 s2, s15
-; CHECK-NEXT:    vins.f16 s24, s0
-; CHECK-NEXT:    vmovx.f16 s0, s19
-; CHECK-NEXT:    vins.f16 s15, s0
+; CHECK-NEXT:    vmovx.f16 s0, s18
+; CHECK-NEXT:    vmov.16 q3[6], r0
+; CHECK-NEXT:    vmov.f32 s4, s1
+; CHECK-NEXT:    vins.f16 s12, s0
+; CHECK-NEXT:    vmovx.f16 s0, s27
+; CHECK-NEXT:    vmovx.f16 s2, s19
+; CHECK-NEXT:    vins.f16 s19, s0
+; CHECK-NEXT:    vins.f16 s4, s29
 ; CHECK-NEXT:    vmovx.f16 s0, s6
+; CHECK-NEXT:    vstr s4, [sp, #4] @ 4-byte Spill
 ; CHECK-NEXT:    vmov.f32 s4, s23
-; CHECK-NEXT:    vins.f16 s27, s2
 ; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    vins.f16 s15, s2
 ; CHECK-NEXT:    vins.f16 s4, s7
 ; CHECK-NEXT:    vmov.16 q0[0], r0
-; CHECK-NEXT:    vstrw.32 q7, [sp] @ 16-byte Spill
 ; CHECK-NEXT:    vmov.f32 s1, s4
 ; CHECK-NEXT:    vmovx.f16 s4, s7
 ; CHECK-NEXT:    vmov r0, s4
-; CHECK-NEXT:    vmovx.f16 s4, s10
-; CHECK-NEXT:    vmov.16 q0[6], r0
-; CHECK-NEXT:    vldrw.u32 q7, [sp, #32] @ 16-byte Reload
-; CHECK-NEXT:    vins.f16 s0, s4
 ; CHECK-NEXT:    vmovx.f16 s4, s11
-; CHECK-NEXT:    vmovx.f16 s2, s23
+; CHECK-NEXT:    vmov.16 q0[6], r0
+; CHECK-NEXT:    vstrw.32 q6, [sp, #8] @ 16-byte Spill
+; CHECK-NEXT:    vmovx.f16 s2, s10
 ; CHECK-NEXT:    vins.f16 s3, s4
-; CHECK-NEXT:    vmovx.f16 s4, s5
+; CHECK-NEXT:    vins.f16 s0, s2
+; CHECK-NEXT:    vmovx.f16 s2, s23
 ; CHECK-NEXT:    vins.f16 s11, s2
 ; CHECK-NEXT:    vmov.f32 s2, s22
-; CHECK-NEXT:    vmov r0, s4
 ; CHECK-NEXT:    vins.f16 s2, s6
+; CHECK-NEXT:    vldrw.u32 q6, [sp, #32] @ 16-byte Reload
+; CHECK-NEXT:    vstr s2, [sp, #28] @ 4-byte Spill
+; CHECK-NEXT:    vmovx.f16 s2, s5
+; CHECK-NEXT:    vmov r0, s2
+; CHECK-NEXT:    vldr s27, [sp, #4] @ 4-byte Reload
 ; CHECK-NEXT:    vmov.16 q1[2], r0
-; CHECK-NEXT:    vmov.f32 s29, s12
+; CHECK-NEXT:    vmov.f32 s2, s11
 ; CHECK-NEXT:    vmovx.f16 s4, s21
-; CHECK-NEXT:    vmovx.f16 s12, s9
+; CHECK-NEXT:    vmovx.f16 s6, s9
 ; CHECK-NEXT:    vins.f16 s9, s4
 ; CHECK-NEXT:    vmovx.f16 s4, s22
+; CHECK-NEXT:    vldrw.u32 q5, [sp, #8] @ 16-byte Reload
 ; CHECK-NEXT:    vins.f16 s10, s4
-; CHECK-NEXT:    vmov.f32 s21, s17
-; CHECK-NEXT:    vmov.f32 s22, s18
-; CHECK-NEXT:    vins.f16 s5, s12
-; CHECK-NEXT:    vmov.f32 s4, s18
-; CHECK-NEXT:    vldrw.u32 q4, [sp] @ 16-byte Reload
-; CHECK-NEXT:    vstrw.32 q7, [sp, #32] @ 16-byte Spill
-; CHECK-NEXT:    vmov.f32 s6, s2
-; CHECK-NEXT:    vmovx.f16 s12, s17
-; CHECK-NEXT:    vins.f16 s4, s18
-; CHECK-NEXT:    vmov r0, s12
-; CHECK-NEXT:    vmovx.f16 s12, s13
-; CHECK-NEXT:    vmov.16 q7[2], r0
-; CHECK-NEXT:    vmov.f32 s2, s11
-; CHECK-NEXT:    vmov.f32 s30, s4
+; CHECK-NEXT:    vmovx.f16 s4, s29
+; CHECK-NEXT:    vins.f16 s5, s6
+; CHECK-NEXT:    vmov.f32 s11, s22
+; CHECK-NEXT:    vmovx.f16 s6, s17
+; CHECK-NEXT:    vmov r0, s4
 ; CHECK-NEXT:    vmovx.f16 s4, s21
-; CHECK-NEXT:    vins.f16 s13, s4
+; CHECK-NEXT:    vins.f16 s11, s30
+; CHECK-NEXT:    vmov.16 q7[2], r0
+; CHECK-NEXT:    vins.f16 s17, s4
 ; CHECK-NEXT:    vmovx.f16 s4, s22
-; CHECK-NEXT:    vins.f16 s14, s4
-; CHECK-NEXT:    vldrw.u32 q5, [sp, #16] @ 16-byte Reload
+; CHECK-NEXT:    vldrw.u32 q5, [sp, #48] @ 16-byte Reload
+; CHECK-NEXT:    vins.f16 s18, s4
+; CHECK-NEXT:    vins.f16 s29, s6
+; CHECK-NEXT:    vldr s23, [sp, #68] @ 4-byte Reload
+; CHECK-NEXT:    vldr s6, [sp, #28] @ 4-byte Reload
+; CHECK-NEXT:    vmov.f32 s25, s16
+; CHECK-NEXT:    vmov.f32 s14, s19
 ; CHECK-NEXT:    vstrw.32 q0, [r1, #80]
-; CHECK-NEXT:    vldrw.u32 q0, [sp, #32] @ 16-byte Reload
-; CHECK-NEXT:    vmov.f32 s26, s15
-; CHECK-NEXT:    vins.f16 s29, s12
 ; CHECK-NEXT:    vmov.f32 s21, s8
-; CHECK-NEXT:    vstrw.32 q6, [r1, #32]
+; CHECK-NEXT:    vstrw.32 q3, [r1, #32]
 ; CHECK-NEXT:    vmov.f32 s4, s9
 ; CHECK-NEXT:    vstrw.32 q5, [r1, #48]
 ; CHECK-NEXT:    vmov.f32 s7, s10
-; CHECK-NEXT:    vstrw.32 q0, [r1]
-; CHECK-NEXT:    vmov.f32 s28, s13
+; CHECK-NEXT:    vstrw.32 q6, [r1]
+; CHECK-NEXT:    vmov.f32 s28, s17
 ; CHECK-NEXT:    vstrw.32 q1, [r1, #64]
-; CHECK-NEXT:    vmov.f32 s31, s14
+; CHECK-NEXT:    vmov.f32 s30, s11
+; CHECK-NEXT:    vmov.f32 s31, s18
 ; CHECK-NEXT:    vstrw.32 q7, [r1, #16]
-; CHECK-NEXT:    add sp, #48
+; CHECK-NEXT:    add sp, #72
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13, d14, d15}
 ; CHECK-NEXT:    bx lr
 entry:
diff --git a/llvm/test/CodeGen/Thumb2/mve-vst4.ll b/llvm/test/CodeGen/Thumb2/mve-vst4.ll
index b36904495e878d..9dba1cc2d6ed76 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vst4.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vst4.ll
@@ -397,55 +397,51 @@ define void @vst4_v8i16_align1(ptr %src, ptr %dst) {
 ; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    vldrw.u32 q1, [r0, #32]
 ; CHECK-NEXT:    vldrw.u32 q2, [r0, #48]
-; CHECK-NEXT:    vldrw.u32 q4, [r0, #16]
-; CHECK-NEXT:    vmovx.f16 s12, s5
-; CHECK-NEXT:    vmovx.f16 s0, s9
-; CHECK-NEXT:    vins.f16 s5, s9
-; CHECK-NEXT:    vins.f16 s12, s0
-; CHECK-NEXT:    vmov q0, q1
-; CHECK-NEXT:    vmovx.f16 s27, s4
-; CHECK-NEXT:    vins.f16 s4, s8
-; CHECK-NEXT:    vmov.f32 s3, s12
 ; CHECK-NEXT:    vldrw.u32 q3, [r0]
-; CHECK-NEXT:    vmov.f32 s5, s4
-; CHECK-NEXT:    vmovx.f16 s8, s8
-; CHECK-NEXT:    vmovx.f16 s0, s17
-; CHECK-NEXT:    vmovx.f16 s2, s13
-; CHECK-NEXT:    vins.f16 s27, s8
-; CHECK-NEXT:    vmovx.f16 s4, s12
-; CHECK-NEXT:    vmovx.f16 s8, s16
-; CHECK-NEXT:    vins.f16 s13, s17
-; CHECK-NEXT:    vins.f16 s12, s16
-; CHECK-NEXT:    vmov q5, q3
+; CHECK-NEXT:    vldrw.u32 q4, [r0, #16]
+; CHECK-NEXT:    vmovx.f16 s23, s4
 ; CHECK-NEXT:    vins.f16 s4, s8
-; CHECK-NEXT:    vmov.f32 s22, s4
-; CHECK-NEXT:    vmovx.f16 s4, s11
-; CHECK-NEXT:    vmov.f32 s23, s27
+; CHECK-NEXT:    vmov.f32 s21, s4
+; CHECK-NEXT:    vmovx.f16 s22, s12
+; CHECK-NEXT:    vmovx.f16 s4, s16
 ; CHECK-NEXT:    vmovx.f16 s27, s7
+; CHECK-NEXT:    vins.f16 s22, s4
+; CHECK-NEXT:    vmovx.f16 s4, s11
 ; CHECK-NEXT:    vins.f16 s7, s11
 ; CHECK-NEXT:    vins.f16 s27, s4
 ; CHECK-NEXT:    vmovx.f16 s26, s15
 ; CHECK-NEXT:    vmovx.f16 s4, s19
+; CHECK-NEXT:    vmovx.f16 s3, s5
+; CHECK-NEXT:    vins.f16 s5, s9
+; CHECK-NEXT:    vmovx.f16 s0, s9
 ; CHECK-NEXT:    vmov.f32 s25, s7
 ; CHECK-NEXT:    vins.f16 s26, s4
 ; CHECK-NEXT:    vmovx.f16 s7, s6
 ; CHECK-NEXT:    vmovx.f16 s4, s10
 ; CHECK-NEXT:    vins.f16 s6, s10
-; CHECK-NEXT:    vmov.f32 s21, s5
+; CHECK-NEXT:    vins.f16 s3, s0
+; CHECK-NEXT:    vmovx.f16 s2, s13
+; CHECK-NEXT:    vmovx.f16 s0, s17
+; CHECK-NEXT:    vmov.f32 s1, s5
+; CHECK-NEXT:    vmovx.f16 s8, s8
+; CHECK-NEXT:    vins.f16 s12, s16
 ; CHECK-NEXT:    vins.f16 s15, s19
 ; CHECK-NEXT:    vins.f16 s7, s4
 ; CHECK-NEXT:    vmov.f32 s5, s6
 ; CHECK-NEXT:    vmovx.f16 s6, s14
 ; CHECK-NEXT:    vmovx.f16 s4, s18
+; CHECK-NEXT:    vins.f16 s13, s17
 ; CHECK-NEXT:    vins.f16 s14, s18
 ; CHECK-NEXT:    vins.f16 s2, s0
 ; CHECK-NEXT:    vmov.f32 s0, s13
-; CHECK-NEXT:    vmov.f32 s24, s15
 ; CHECK-NEXT:    vins.f16 s6, s4
-; CHECK-NEXT:    vmov.f32 s4, s14
+; CHECK-NEXT:    vmov.f32 s24, s15
+; CHECK-NEXT:    vins.f16 s23, s8
+; CHECK-NEXT:    vmov.f32 s20, s12
 ; CHECK-NEXT:    vstrb.8 q6, [r1, #48]
-; CHECK-NEXT:    vstrb.8 q1, [r1, #32]
+; CHECK-NEXT:    vmov.f32 s4, s14
 ; CHECK-NEXT:    vstrb.8 q0, [r1, #16]
+; CHECK-NEXT:    vstrb.8 q1, [r1, #32]
 ; CHECK-NEXT:    vstrb.8 q5, [r1]
 ; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    bx lr
@@ -1002,22 +998,28 @@ entry:
 define void @vst4_v2f16(ptr %src, ptr %dst) {
 ; CHECK-LABEL: vst4_v2f16:
 ; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
 ; CHECK-NEXT:    ldm.w r0, {r2, r3, r12}
-; CHECK-NEXT:    vmov.32 q1[0], r12
+; CHECK-NEXT:    vmov.32 q3[0], r12
+; CHECK-NEXT:    vmov q4, q3
+; CHECK-NEXT:    vmov.f32 s1, s12
 ; CHECK-NEXT:    ldr r0, [r0, #12]
-; CHECK-NEXT:    vmov.32 q0[0], r2
-; CHECK-NEXT:    vmov.32 q0[1], r3
-; CHECK-NEXT:    vmov.32 q1[1], r0
-; CHECK-NEXT:    vmovx.f16 s2, s0
-; CHECK-NEXT:    vmovx.f16 s6, s1
-; CHECK-NEXT:    vmovx.f16 s3, s4
+; CHECK-NEXT:    vmovx.f16 s3, s12
+; CHECK-NEXT:    vmov.32 q1[0], r2
+; CHECK-NEXT:    vmov q2, q1
+; CHECK-NEXT:    vmov.32 q4[1], r0
+; CHECK-NEXT:    vmov.32 q2[1], r3
+; CHECK-NEXT:    vmov.f32 s0, s4
+; CHECK-NEXT:    vmovx.f16 s2, s4
+; CHECK-NEXT:    vmovx.f16 s6, s9
+; CHECK-NEXT:    vmovx.f16 s4, s17
+; CHECK-NEXT:    vins.f16 s0, s9
+; CHECK-NEXT:    vins.f16 s1, s17
 ; CHECK-NEXT:    vins.f16 s2, s6
-; CHECK-NEXT:    vmovx.f16 s6, s5
-; CHECK-NEXT:    vins.f16 s4, s5
-; CHECK-NEXT:    vins.f16 s0, s1
-; CHECK-NEXT:    vins.f16 s3, s6
-; CHECK-NEXT:    vmov.f32 s1, s4
+; CHECK-NEXT:    vins.f16 s3, s4
 ; CHECK-NEXT:    vstrh.16 q0, [r1]
+; CHECK-NEXT:    vpop {d8, d9}
 ; CHECK-NEXT:    bx lr
 entry:
   %l1 = load <2 x half>, ptr %src, align 4
@@ -1155,61 +1157,57 @@ entry:
 define void @vst4_v8f16_align1(ptr %src, ptr %dst) {
 ; CHECK-LABEL: vst4_v8f16_align1:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .vsave {d9, d10, d11, d12, d13}
-; CHECK-NEXT:    vpush {d9, d10, d11, d12, d13}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    vldrw.u32 q1, [r0, #32]
 ; CHECK-NEXT:    vldrw.u32 q5, [r0, #48]
-; CHECK-NEXT:    vldrw.u32 q6, [r0, #16]
 ; CHECK-NEXT:    vldrw.u32 q2, [r0]
-; CHECK-NEXT:    vmovx.f16 s0, s5
-; CHECK-NEXT:    vmovx.f16 s2, s21
-; CHECK-NEXT:    vins.f16 s0, s2
+; CHECK-NEXT:    vldrw.u32 q6, [r0, #16]
+; CHECK-NEXT:    vmovx.f16 s3, s5
+; CHECK-NEXT:    vmovx.f16 s0, s21
+; CHECK-NEXT:    vins.f16 s3, s0
 ; CHECK-NEXT:    vmovx.f16 s2, s9
-; CHECK-NEXT:    vmovx.f16 s12, s25
-; CHECK-NEXT:    vmovx.f16 s19, s4
-; CHECK-NEXT:    vins.f16 s2, s12
-; CHECK-NEXT:    vmovx.f16 s12, s20
-; CHECK-NEXT:    vins.f16 s19, s12
-; CHECK-NEXT:    vmovx.f16 s12, s8
-; CHECK-NEXT:    vmovx.f16 s14, s24
-; CHECK-NEXT:    vmovx.f16 s15, s7
-; CHECK-NEXT:    vins.f16 s12, s14
-; CHECK-NEXT:    vmovx.f16 s14, s23
-; CHECK-NEXT:    vins.f16 s15, s14
-; CHECK-NEXT:    vmovx.f16 s14, s11
-; CHECK-NEXT:    vmovx.f16 s1, s27
+; CHECK-NEXT:    vmovx.f16 s0, s25
+; CHECK-NEXT:    vmovx.f16 s15, s4
+; CHECK-NEXT:    vins.f16 s2, s0
+; CHECK-NEXT:    vmovx.f16 s0, s20
+; CHECK-NEXT:    vins.f16 s15, s0
+; CHECK-NEXT:    vmovx.f16 s14, s8
+; CHECK-NEXT:    vmovx.f16 s0, s24
+; CHECK-NEXT:    vmovx.f16 s19, s7
+; CHECK-NEXT:    vins.f16 s14, s0
+; CHECK-NEXT:    vmovx.f16 s0, s23
+; CHECK-NEXT:    vins.f16 s19, s0
+; CHECK-NEXT:    vmovx.f16 s18, s11
+; CHECK-NEXT:    vmovx.f16 s0, s27
 ; CHECK-NEXT:    vins.f16 s7, s23
-; CHECK-NEXT:    vins.f16 s14, s1
+; CHECK-NEXT:    vins.f16 s18, s0
 ; CHECK-NEXT:    vmovx.f16 s23, s6
-; CHECK-NEXT:    vmovx.f16 s1, s22
-; CHECK-NEXT:    vins.f16 s6, s22
+; CHECK-NEXT:    vmovx.f16 s0, s22
 ; CHECK-NEXT:    vins.f16 s5, s21
+; CHECK-NEXT:    vins.f16 s6, s22
 ; CHECK-NEXT:    vins.f16 s4, s20
-; CHECK-NEXT:    vins.f16 s23, s1
+; CHECK-NEXT:    vins.f16 s8, s24
+; CHECK-NEXT:    vins.f16 s11, s27
+; CHECK-NEXT:    vins.f16 s23, s0
 ; CHECK-NEXT:    vmovx.f16 s22, s10
 ; CHECK-NEXT:    vins.f16 s10, s26
-; CHECK-NEXT:    vmovx.f16 s1, s26
+; CHECK-NEXT:    vmovx.f16 s0, s26
 ; CHECK-NEXT:    vins.f16 s9, s25
-; CHECK-NEXT:    vins.f16 s8, s24
-; CHECK-NEXT:    vins.f16 s11, s27
-; CHECK-NEXT:    vmov q6, q1
-; CHECK-NEXT:    vins.f16 s22, s1
-; CHECK-NEXT:    vmov.f32 s1, s25
-; CHECK-NEXT:    vmov q6, q2
-; CHECK-NEXT:    vmov.f32 s3, s0
+; CHECK-NEXT:    vins.f16 s22, s0
 ; CHECK-NEXT:    vmov.f32 s0, s9
-; CHECK-NEXT:    vmov.f32 s26, s12
+; CHECK-NEXT:    vmov.f32 s1, s5
+; CHECK-NEXT:    vmov.f32 s13, s4
 ; CHECK-NEXT:    vstrb.8 q0, [r1, #16]
-; CHECK-NEXT:    vmov.f32 s25, s4
-; CHECK-NEXT:    vmov.f32 s27, s19
-; CHECK-NEXT:    vmov.f32 s13, s7
-; CHECK-NEXT:    vstrb.8 q6, [r1]
-; CHECK-NEXT:    vmov.f32 s12, s11
+; CHECK-NEXT:    vmov.f32 s12, s8
+; CHECK-NEXT:    vmov.f32 s17, s7
+; CHECK-NEXT:    vstrb.8 q3, [r1]
+; CHECK-NEXT:    vmov.f32 s16, s11
 ; CHECK-NEXT:    vmov.f32 s21, s6
-; CHECK-NEXT:    vstrb.8 q3, [r1, #48]
+; CHECK-NEXT:    vstrb.8 q4, [r1, #48]
 ; CHECK-NEXT:    vmov.f32 s20, s10
 ; CHECK-NEXT:    vstrb.8 q5, [r1, #32]
-; CHECK-NEXT:    vpop {d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
 ; CHECK-NEXT:    bx lr
 entry:
   %l1 = load <8 x half>, ptr %src, align 4
diff --git a/llvm/test/CodeGen/VE/Scalar/select.ll b/llvm/test/CodeGen/VE/Scalar/select.ll
index 184513a3f820bb..0a98b91e780204 100644
--- a/llvm/test/CodeGen/VE/Scalar/select.ll
+++ b/llvm/test/CodeGen/VE/Scalar/select.ll
@@ -352,9 +352,9 @@ define fp128 @select_quad_mimm(i1 zeroext %0, fp128 %1) {
 ; CHECK-NEXT:    and %s1, %s1, (32)0
 ; CHECK-NEXT:    lea.sl %s1, .LCPI{{[0-9]+}}_0 at hi(, %s1)
 ; CHECK-NEXT:    ld %s4, 8(, %s1)
-; CHECK-NEXT:    ld %s5, (, %s1)
+; CHECK-NEXT:    ld %s1, (, %s1)
 ; CHECK-NEXT:    cmov.w.ne %s2, %s4, %s0
-; CHECK-NEXT:    cmov.w.ne %s3, %s5, %s0
+; CHECK-NEXT:    cmov.w.ne %s3, %s1, %s0
 ; CHECK-NEXT:    or %s0, 0, %s2
 ; CHECK-NEXT:    or %s1, 0, %s3
 ; CHECK-NEXT:    b.l.t (, %s10)
diff --git a/llvm/test/CodeGen/VE/Scalar/va_caller.ll b/llvm/test/CodeGen/VE/Scalar/va_caller.ll
index 3cffc5e6c9fcd1..421a419547d925 100644
--- a/llvm/test/CodeGen/VE/Scalar/va_caller.ll
+++ b/llvm/test/CodeGen/VE/Scalar/va_caller.ll
@@ -24,15 +24,15 @@ define i32 @caller() {
 ; CHECK-NEXT:    st %s1, 192(, %s11)
 ; CHECK-NEXT:    or %s1, 1, (0)1
 ; CHECK-NEXT:    st %s1, 184(, %s11)
-; CHECK-NEXT:    lea %s1, .LCPI{{[0-9]+}}_0 at lo
-; CHECK-NEXT:    and %s1, %s1, (32)0
-; CHECK-NEXT:    lea.sl %s1, .LCPI{{[0-9]+}}_0 at hi(, %s1)
-; CHECK-NEXT:    ld %s34, 8(, %s1)
-; CHECK-NEXT:    ld %s35, (, %s1)
 ; CHECK-NEXT:    st %s0, 176(, %s11)
-; CHECK-NEXT:    lea.sl %s0, 1086324736
-; CHECK-NEXT:    st %s0, 224(, %s11)
-; CHECK-NEXT:    st %s34, 280(, %s11)
+; CHECK-NEXT:    lea %s0, .LCPI{{[0-9]+}}_0 at lo
+; CHECK-NEXT:    and %s0, %s0, (32)0
+; CHECK-NEXT:    lea.sl %s0,  .LCPI{{[0-9]+}}_0 at hi(, %s0)
+; CHECK-NEXT:    ld %s2, 8(, %s0)
+; CHECK-NEXT:    lea.sl %s1, 1086324736
+; CHECK-NEXT:    st %s1, 224(, %s11)
+; CHECK-NEXT:    ld %s34, (, %s0)
+; CHECK-NEXT:    st %s2, 280(, %s11)
 ; CHECK-NEXT:    lea %s0, func at lo
 ; CHECK-NEXT:    and %s0, %s0, (32)0
 ; CHECK-NEXT:    lea.sl %s12, func at hi(, %s0)
@@ -44,10 +44,13 @@ define i32 @caller() {
 ; CHECK-NEXT:    or %s5, 5, (0)1
 ; CHECK-NEXT:    lea.sl %s6, 1086324736
 ; CHECK-NEXT:    or %s7, 0, (0)1
-; CHECK-NEXT:    st %s35, 272(, %s11)
+; CHECK-NEXT:    st %s34, 272(, %s11)
 ; CHECK-NEXT:    bsic %s10, (, %s12)
 ; CHECK-NEXT:    or %s0, 0, (0)1
 ; CHECK-NEXT:    or %s11, 0, %s9
+; CHECK-NEXT:    ld %s10, 8(, %s11)
+; CHECK-NEXT:    ld %s9, (, %s11)
+; CHECK-NEXT:    b.l.t (, %s10)
   call i32 (i32, ...) @func(i32 0, i16 1, i8 2, i32 3, i16 4, i8 5, float 6.0, ptr null, i64 8, double 9.0, i128 10, fp128 0xLA000000000000000)
   ret i32 0
 }
diff --git a/llvm/test/CodeGen/X86/AMX/amx-ldtilecfg-insert.ll b/llvm/test/CodeGen/X86/AMX/amx-ldtilecfg-insert.ll
index 8595024103dbd3..06cf968512db8b 100644
--- a/llvm/test/CodeGen/X86/AMX/amx-ldtilecfg-insert.ll
+++ b/llvm/test/CodeGen/X86/AMX/amx-ldtilecfg-insert.ll
@@ -230,7 +230,6 @@ exit:
 define dso_local void @test5(i16 signext %0, i16 signext %1) nounwind {
 ; CHECK-LABEL: test5:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    # kill: def $esi killed $esi def $rsi
 ; CHECK-NEXT:    vxorps %xmm0, %xmm0, %xmm0
 ; CHECK-NEXT:    vmovups %zmm0, -{{[0-9]+}}(%rsp)
 ; CHECK-NEXT:    movb $1, -{{[0-9]+}}(%rsp)
@@ -239,7 +238,8 @@ define dso_local void @test5(i16 signext %0, i16 signext %1) nounwind {
 ; CHECK-NEXT:    xorl %eax, %eax
 ; CHECK-NEXT:    movl $buf, %ecx
 ; CHECK-NEXT:    movl $32, %edx
-; CHECK-NEXT:    leal -1(%rsi), %r8d
+; CHECK-NEXT:    movl %esi, %r8d
+; CHECK-NEXT:    decl %r8d
 ; CHECK-NEXT:    jmp .LBB4_1
 ; CHECK-NEXT:    .p2align 4
 ; CHECK-NEXT:  .LBB4_3: # %if.false
diff --git a/llvm/test/CodeGen/X86/avx512-calling-conv.ll b/llvm/test/CodeGen/X86/avx512-calling-conv.ll
index c27cced9d5ffa7..23b46ee59154fb 100644
--- a/llvm/test/CodeGen/X86/avx512-calling-conv.ll
+++ b/llvm/test/CodeGen/X86/avx512-calling-conv.ll
@@ -166,7 +166,7 @@ define <16 x i32> @test6(<16 x i32>%a, <16 x i32>%b) {
 ; KNL-NEXT:    pushq %rax
 ; KNL-NEXT:    .cfi_def_cfa_offset 16
 ; KNL-NEXT:    vpcmpgtd %zmm1, %zmm0, %k1
-; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; KNL-NEXT:    vpmovdb %zmm0, %xmm0
 ; KNL-NEXT:    callq _func16xi1
 ; KNL-NEXT:    vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
@@ -194,7 +194,7 @@ define <16 x i32> @test6(<16 x i32>%a, <16 x i32>%b) {
 ; KNL_X32-NEXT:    subl $12, %esp
 ; KNL_X32-NEXT:    .cfi_def_cfa_offset 16
 ; KNL_X32-NEXT:    vpcmpgtd %zmm1, %zmm0, %k1
-; KNL_X32-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
+; KNL_X32-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; KNL_X32-NEXT:    vpmovdb %zmm0, %xmm0
 ; KNL_X32-NEXT:    calll _func16xi1
 ; KNL_X32-NEXT:    vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
diff --git a/llvm/test/CodeGen/X86/avx512-ext.ll b/llvm/test/CodeGen/X86/avx512-ext.ll
index 79e59fdcf4a125..c60d9a3ff17d3a 100644
--- a/llvm/test/CodeGen/X86/avx512-ext.ll
+++ b/llvm/test/CodeGen/X86/avx512-ext.ll
@@ -1603,7 +1603,7 @@ define   <16 x i32> @zext_16i1_to_16xi32(i16 %b) {
 ; KNL-LABEL: zext_16i1_to_16xi32:
 ; KNL:       # %bb.0:
 ; KNL-NEXT:    kmovw %edi, %k1
-; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; KNL-NEXT:    vpsrld $31, %zmm0, %zmm0
 ; KNL-NEXT:    retq
 ;
@@ -1629,7 +1629,7 @@ define   <8 x i64> @zext_8i1_to_8xi64(i8 %b) {
 ; KNL-LABEL: zext_8i1_to_8xi64:
 ; KNL:       # %bb.0:
 ; KNL-NEXT:    kmovw %edi, %k1
-; KNL-NEXT:    vpternlogq $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
+; KNL-NEXT:    vpternlogq {{.*#+}} zmm0 {%k1} {z} = -1
 ; KNL-NEXT:    vpsrlq $63, %zmm0, %zmm0
 ; KNL-NEXT:    retq
 ;
@@ -1747,14 +1747,14 @@ define <8 x i32> @sext_8i1_8i32(<8 x i32> %a1, <8 x i32> %a2) nounwind {
 ; KNL-LABEL: sext_8i1_8i32:
 ; KNL:       # %bb.0:
 ; KNL-NEXT:    vpcmpgtd %ymm0, %ymm1, %ymm0
-; KNL-NEXT:    vpternlogq $15, %zmm0, %zmm0, %zmm0
+; KNL-NEXT:    vpternlogq {{.*#+}} zmm0 = ~zmm0
 ; KNL-NEXT:    # kill: def $ymm0 killed $ymm0 killed $zmm0
 ; KNL-NEXT:    retq
 ;
 ; AVX512DQ-LABEL: sext_8i1_8i32:
 ; AVX512DQ:       # %bb.0:
 ; AVX512DQ-NEXT:    vpcmpgtd %ymm0, %ymm1, %ymm0
-; AVX512DQ-NEXT:    vpternlogq $15, %ymm0, %ymm0, %ymm0
+; AVX512DQ-NEXT:    vpternlogq {{.*#+}} ymm0 = ~ymm0
 ; AVX512DQ-NEXT:    retq
   %x = icmp slt <8 x i32> %a1, %a2
   %x1 = xor <8 x i1>%x, <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>
@@ -1840,7 +1840,7 @@ define <16 x i32> @sext_16i1_16i32(<16 x i32> %a1, <16 x i32> %a2) nounwind {
 ; KNL-LABEL: sext_16i1_16i32:
 ; KNL:       # %bb.0:
 ; KNL-NEXT:    vpcmpgtd %zmm0, %zmm1, %k1
-; KNL-NEXT:    vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; KNL-NEXT:    retq
 ;
 ; AVX512DQ-LABEL: sext_16i1_16i32:
@@ -1897,7 +1897,7 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    movw $-5, %ax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kandw %k1, %k0, %k0
-; KNL-NEXT:    kmovw %k1, %k7
+; KNL-NEXT:    kmovw %k1, %k2
 ; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
@@ -1906,8 +1906,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-9, %ax
 ; KNL-NEXT:    kmovw %eax, %k1
-; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    kandw %k1, %k0, %k0
+; KNL-NEXT:    kmovw %k1, %k7
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
@@ -1924,9 +1924,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-33, %ax
 ; KNL-NEXT:    kmovw %eax, %k1
-; KNL-NEXT:    kandw %k1, %k0, %k0
-; KNL-NEXT:    kmovw %k1, %k2
 ; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; KNL-NEXT:    kandw %k1, %k0, %k0
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
@@ -1934,8 +1933,9 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-65, %ax
 ; KNL-NEXT:    kmovw %eax, %k1
-; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    kandw %k1, %k0, %k0
+; KNL-NEXT:    kmovw %k1, %k3
+; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
@@ -1943,9 +1943,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-129, %ax
 ; KNL-NEXT:    kmovw %eax, %k1
-; KNL-NEXT:    kandw %k1, %k0, %k0
-; KNL-NEXT:    kmovw %k1, %k3
 ; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; KNL-NEXT:    kandw %k1, %k0, %k0
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
@@ -1953,18 +1952,17 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-257, %ax # imm = 0xFEFF
 ; KNL-NEXT:    kmovw %eax, %k1
-; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    kandw %k1, %k0, %k0
+; KNL-NEXT:    kmovw %k1, %k4
+; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
 ; KNL-NEXT:    kshiftrw $7, %k1, %k1
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-513, %ax # imm = 0xFDFF
-; KNL-NEXT:    kmovw %eax, %k1
-; KNL-NEXT:    kandw %k1, %k0, %k0
-; KNL-NEXT:    kmovw %k1, %k4
-; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; KNL-NEXT:    kmovw %eax, %k5
+; KNL-NEXT:    kandw %k5, %k0, %k0
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
@@ -1980,8 +1978,9 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftrw $5, %k1, %k1
 ; KNL-NEXT:    korw %k1, %k0, %k0
 ; KNL-NEXT:    movw $-2049, %ax # imm = 0xF7FF
-; KNL-NEXT:    kmovw %eax, %k5
-; KNL-NEXT:    kandw %k5, %k0, %k0
+; KNL-NEXT:    kmovw %eax, %k1
+; KNL-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; KNL-NEXT:    kandw %k1, %k0, %k0
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k1
 ; KNL-NEXT:    kshiftlw $15, %k1, %k1
@@ -2026,49 +2025,50 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftrw $14, %k1, %k1
 ; KNL-NEXT:    kmovw %edi, %k6
 ; KNL-NEXT:    korw %k1, %k6, %k1
-; KNL-NEXT:    kandw %k7, %k1, %k1
+; KNL-NEXT:    kandw %k2, %k1, %k1
 ; KNL-NEXT:    kmovw %edx, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $13, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kmovw %k7, %k0
+; KNL-NEXT:    kmovw %k7, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; KNL-NEXT:    kandw %k7, %k1, %k1
 ; KNL-NEXT:    kmovw %ecx, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $12, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
-; KNL-NEXT:    kandw %k0, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; KNL-NEXT:    kandw %k2, %k1, %k1
 ; KNL-NEXT:    kmovw %r8d, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $11, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kandw %k2, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k1, %k1
 ; KNL-NEXT:    kmovw %r9d, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $10, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
-; KNL-NEXT:    kandw %k2, %k1, %k1
+; KNL-NEXT:    kandw %k3, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $9, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
 ; KNL-NEXT:    kandw %k3, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $8, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
-; KNL-NEXT:    kandw %k3, %k1, %k1
+; KNL-NEXT:    kandw %k4, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $7, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kandw %k4, %k1, %k1
+; KNL-NEXT:    kandw %k5, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
@@ -2081,8 +2081,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $5, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw %k5, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; KNL-NEXT:    kandw %k5, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
+; KNL-NEXT:    kandw %k6, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
@@ -2130,46 +2130,45 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $13, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kandw %k7, %k1, %k1
+; KNL-NEXT:    kandw %k0, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $12, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kandw %k0, %k1, %k1
+; KNL-NEXT:    kandw %k2, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $11, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
-; KNL-NEXT:    kandw %k0, %k1, %k1
+; KNL-NEXT:    kandw %k7, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $10, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kandw %k2, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; KNL-NEXT:    kandw %k0, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $9, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
-; KNL-NEXT:    kandw %k0, %k1, %k1
+; KNL-NEXT:    kandw %k3, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $8, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
 ; KNL-NEXT:    kandw %k3, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $7, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
-; KNL-NEXT:    kandw %k3, %k1, %k1
+; KNL-NEXT:    kandw %k5, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
@@ -2181,28 +2180,29 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $5, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kandw %k5, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; KNL-NEXT:    kandw %k0, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $4, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
-; KNL-NEXT:    kandw %k0, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; KNL-NEXT:    kandw %k2, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $3, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
-; KNL-NEXT:    kandw %k2, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
+; KNL-NEXT:    kandw %k6, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $2, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k1, %k1
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k1, %k1
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
+; KNL-NEXT:    kandw %k6, %k1, %k1
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k6
 ; KNL-NEXT:    kshiftlw $14, %k6, %k6
@@ -2221,56 +2221,55 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftrw $14, %k6, %k6
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    korw %k6, %k7, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $13, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $12, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $11, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $10, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $9, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; KNL-NEXT:    kandw %k7, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $8, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; KNL-NEXT:    kandw %k5, %k6, %k6
+; KNL-NEXT:    kandw %k3, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
 ; KNL-NEXT:    kshiftrw $7, %k7, %k7
 ; KNL-NEXT:    korw %k7, %k6, %k6
-; KNL-NEXT:    kandw %k3, %k6, %k6
+; KNL-NEXT:    kandw %k5, %k6, %k6
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k7
 ; KNL-NEXT:    kshiftlw $15, %k7, %k7
@@ -2282,20 +2281,20 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kshiftlw $15, %k6, %k6
 ; KNL-NEXT:    kshiftrw $5, %k6, %k6
 ; KNL-NEXT:    korw %k6, %k5, %k5
-; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
-; KNL-NEXT:    kandw %k3, %k5, %k4
+; KNL-NEXT:    kandw %k0, %k5, %k4
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k5
 ; KNL-NEXT:    kshiftlw $15, %k5, %k5
 ; KNL-NEXT:    kshiftrw $4, %k5, %k5
 ; KNL-NEXT:    korw %k5, %k4, %k4
-; KNL-NEXT:    kandw %k0, %k4, %k3
+; KNL-NEXT:    kandw %k2, %k4, %k3
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k4
 ; KNL-NEXT:    kshiftlw $15, %k4, %k4
 ; KNL-NEXT:    kshiftrw $3, %k4, %k4
 ; KNL-NEXT:    korw %k4, %k3, %k3
-; KNL-NEXT:    kandw %k2, %k3, %k2
+; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; KNL-NEXT:    kandw %k0, %k3, %k2
 ; KNL-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; KNL-NEXT:    kmovw %eax, %k3
 ; KNL-NEXT:    kshiftlw $15, %k3, %k3
@@ -2313,12 +2312,12 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; KNL-NEXT:    kmovw %eax, %k2
 ; KNL-NEXT:    kshiftlw $15, %k2, %k2
 ; KNL-NEXT:    korw %k2, %k0, %k2
-; KNL-NEXT:    vpternlogd $255, %zmm2, %zmm2, %zmm2 {%k2} {z}
-; KNL-NEXT:    vpternlogd $255, %zmm3, %zmm3, %zmm3 {%k1} {z}
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm2 {%k2} {z} = -1
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm3 {%k1} {z} = -1
 ; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; KNL-NEXT:    vpternlogd $255, %zmm4, %zmm4, %zmm4 {%k1} {z}
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm4 {%k1} {z} = -1
 ; KNL-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; KNL-NEXT:    vpternlogd $255, %zmm5, %zmm5, %zmm5 {%k1} {z}
+; KNL-NEXT:    vpternlogd {{.*#+}} zmm5 {%k1} {z} = -1
 ; KNL-NEXT:    vpmovdw %zmm2, %ymm2
 ; KNL-NEXT:    vpmovdw %zmm3, %ymm3
 ; KNL-NEXT:    vinserti64x4 $1, %ymm3, %zmm2, %zmm2
@@ -2351,7 +2350,7 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    movw $-5, %ax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw %k1, %k7
+; AVX512DQNOBW-NEXT:    kmovw %k1, %k2
 ; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
@@ -2360,8 +2359,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-9, %ax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %k1, %k7
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
@@ -2378,9 +2377,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-33, %ax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw %k1, %k2
 ; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
@@ -2388,8 +2386,9 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-65, %ax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %k1, %k3
+; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
@@ -2397,9 +2396,8 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-129, %ax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw %k1, %k3
 ; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
@@ -2407,18 +2405,17 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-257, %ax # imm = 0xFEFF
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %k1, %k4
+; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kshiftrw $7, %k1, %k1
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-513, %ax # imm = 0xFDFF
-; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw %k1, %k4
-; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    kmovw %eax, %k5
+; AVX512DQNOBW-NEXT:    kandw %k5, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
@@ -2434,8 +2431,9 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    kshiftrw $5, %k1, %k1
 ; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movw $-2049, %ax # imm = 0xF7FF
-; AVX512DQNOBW-NEXT:    kmovw %eax, %k5
-; AVX512DQNOBW-NEXT:    kandw %k5, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
+; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
@@ -2458,215 +2456,216 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kshiftrw $2, %k1, %k1
-; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k1, %k0, %k1
 ; AVX512DQNOBW-NEXT:    movw $-16385, %ax # imm = 0xBFFF
-; AVX512DQNOBW-NEXT:    kmovw %eax, %k1
-; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %eax, %k0
+; AVX512DQNOBW-NEXT:    kmovw %k0, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    kandw %k0, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $14, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftlw $1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftrw $1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftlw $1, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftrw $1, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k0
 ; AVX512DQNOBW-NEXT:    kmovw %k0, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    andl $1, %edi
-; AVX512DQNOBW-NEXT:    kmovw %esi, %k0
-; AVX512DQNOBW-NEXT:    kshiftlw $15, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftrw $14, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %esi, %k1
+; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftrw $14, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw %edi, %k6
-; AVX512DQNOBW-NEXT:    korw %k0, %k6, %k0
-; AVX512DQNOBW-NEXT:    kandw %k7, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k1, %k6, %k1
+; AVX512DQNOBW-NEXT:    kandw %k2, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw %edx, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $13, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k7, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw %k7, %k0
+; AVX512DQNOBW-NEXT:    kmovw %k7, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    kandw %k7, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw %ecx, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $12, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k2, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw %r8d, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $11, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k2, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw %r9d, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $10, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k2, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k3, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $9, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k3, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k3, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $8, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k3, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k4, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $7, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k4, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k5, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $6, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k4 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k4, %k0, %k0
+; AVX512DQNOBW-NEXT:    kandw %k4, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $5, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw %k5, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; AVX512DQNOBW-NEXT:    kandw %k5, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $4, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $3, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $2, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $14, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftlw $1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftrw $1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftlw $1, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftrw $1, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw %k0, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw %k1, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    andl $1, %eax
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %ecx
-; AVX512DQNOBW-NEXT:    kmovw %ecx, %k0
-; AVX512DQNOBW-NEXT:    kshiftlw $15, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftrw $14, %k0, %k0
+; AVX512DQNOBW-NEXT:    kmovw %ecx, %k1
+; AVX512DQNOBW-NEXT:    kshiftlw $15, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftrw $14, %k1, %k1
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
-; AVX512DQNOBW-NEXT:    korw %k0, %k6, %k0
+; AVX512DQNOBW-NEXT:    korw %k1, %k6, %k1
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $13, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k7, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k0, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $12, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k2, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $11, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k7, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $10, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k2, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k0, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $9, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k3, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $8, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k3, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k3, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $7, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k3, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k5, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $6, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k4, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kandw %k4, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $5, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kandw %k5, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k0, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $4, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k2, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $3, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k2, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $2, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k6 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $14, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftlw $1, %k0, %k0
-; AVX512DQNOBW-NEXT:    kshiftrw $1, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftlw $1, %k1, %k1
+; AVX512DQNOBW-NEXT:    kshiftrw $1, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k6
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
-; AVX512DQNOBW-NEXT:    korw %k6, %k0, %k0
+; AVX512DQNOBW-NEXT:    korw %k6, %k1, %k1
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    andl $1, %eax
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %ecx
@@ -2675,56 +2674,55 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    kshiftrw $14, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    korw %k6, %k7, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $13, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $12, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $11, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $10, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $9, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k7, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $8, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k5 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
+; AVX512DQNOBW-NEXT:    kandw %k3, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
 ; AVX512DQNOBW-NEXT:    kshiftrw $7, %k7, %k7
 ; AVX512DQNOBW-NEXT:    korw %k7, %k6, %k6
-; AVX512DQNOBW-NEXT:    kandw %k3, %k6, %k6
+; AVX512DQNOBW-NEXT:    kandw %k5, %k6, %k6
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k7
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k7, %k7
@@ -2736,39 +2734,39 @@ define <64 x i16> @test21(<64 x i16> %x , <64 x i1> %mask) nounwind readnone {
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k6, %k6
 ; AVX512DQNOBW-NEXT:    kshiftrw $5, %k6, %k6
 ; AVX512DQNOBW-NEXT:    korw %k6, %k5, %k5
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k3 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k3, %k5, %k4
+; AVX512DQNOBW-NEXT:    kandw %k0, %k5, %k4
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k5
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k5, %k5
 ; AVX512DQNOBW-NEXT:    kshiftrw $4, %k5, %k5
 ; AVX512DQNOBW-NEXT:    korw %k5, %k4, %k4
-; AVX512DQNOBW-NEXT:    kandw %k1, %k4, %k3
+; AVX512DQNOBW-NEXT:    kandw %k2, %k4, %k3
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k4
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k4, %k4
 ; AVX512DQNOBW-NEXT:    kshiftrw $3, %k4, %k4
 ; AVX512DQNOBW-NEXT:    korw %k4, %k3, %k3
-; AVX512DQNOBW-NEXT:    kandw %k2, %k3, %k2
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k0, %k3, %k2
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k3
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k3, %k3
 ; AVX512DQNOBW-NEXT:    kshiftrw $2, %k3, %k3
 ; AVX512DQNOBW-NEXT:    korw %k3, %k2, %k2
-; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k1 # 2-byte Reload
-; AVX512DQNOBW-NEXT:    kandw %k1, %k2, %k1
+; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; AVX512DQNOBW-NEXT:    kandw %k0, %k2, %k0
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k2
 ; AVX512DQNOBW-NEXT:    kshiftlw $14, %k2, %k2
-; AVX512DQNOBW-NEXT:    korw %k2, %k1, %k1
-; AVX512DQNOBW-NEXT:    kshiftlw $1, %k1, %k1
-; AVX512DQNOBW-NEXT:    kshiftrw $1, %k1, %k1
+; AVX512DQNOBW-NEXT:    korw %k2, %k0, %k0
+; AVX512DQNOBW-NEXT:    kshiftlw $1, %k0, %k0
+; AVX512DQNOBW-NEXT:    kshiftrw $1, %k0, %k0
 ; AVX512DQNOBW-NEXT:    movzbl {{[0-9]+}}(%rsp), %eax
 ; AVX512DQNOBW-NEXT:    kmovw %eax, %k2
 ; AVX512DQNOBW-NEXT:    kshiftlw $15, %k2, %k2
-; AVX512DQNOBW-NEXT:    korw %k2, %k1, %k1
-; AVX512DQNOBW-NEXT:    vpmovm2d %k1, %zmm2
-; AVX512DQNOBW-NEXT:    vpmovm2d %k0, %zmm3
+; AVX512DQNOBW-NEXT:    korw %k2, %k0, %k0
+; AVX512DQNOBW-NEXT:    vpmovm2d %k0, %zmm2
+; AVX512DQNOBW-NEXT:    vpmovm2d %k1, %zmm3
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
 ; AVX512DQNOBW-NEXT:    vpmovm2d %k0, %zmm4
 ; AVX512DQNOBW-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
diff --git a/llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll b/llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll
index f66d81c781fe0d..880062bbc4f9ed 100644
--- a/llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll
+++ b/llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll
@@ -200,10 +200,9 @@ define i128 @scalar_i128(i128 %x, i128 %y, ptr %divdst) nounwind {
 ; X86-NEXT:    movl %esi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
 ; X86-NEXT:    sbbl %eax, %edi
 ; X86-NEXT:    movl %edi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl 40(%ebp), %ecx
-; X86-NEXT:    movl %ecx, %edx
+; X86-NEXT:    movl 40(%ebp), %esi
+; X86-NEXT:    movl %esi, %edx
 ; X86-NEXT:    sarl $31, %edx
-; X86-NEXT:    movl %ecx, %esi
 ; X86-NEXT:    xorl %edx, %esi
 ; X86-NEXT:    movl 36(%ebp), %ecx
 ; X86-NEXT:    xorl %edx, %ecx
@@ -408,9 +407,9 @@ define i128 @scalar_i128(i128 %x, i128 %y, ptr %divdst) nounwind {
 ; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %eax # 4-byte Reload
 ; X86-NEXT:    adcl $-1, %eax
 ; X86-NEXT:    movl %eax, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %eax # 4-byte Reload
-; X86-NEXT:    adcl $-1, %eax
-; X86-NEXT:    movl %eax, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
+; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %ecx # 4-byte Reload
+; X86-NEXT:    adcl $-1, %ecx
+; X86-NEXT:    movl %ecx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
 ; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %ecx # 4-byte Reload
 ; X86-NEXT:    adcl $-1, %ecx
 ; X86-NEXT:    movl %ecx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
diff --git a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
index 257524e0d4db5c..0530c843acfe67 100644
--- a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
+++ b/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
@@ -2022,19 +2022,19 @@ define <4 x half> @test_fmaximum_v4f16(<4 x half> %x, <4 x half> %y) nounwind {
 ; X86-NEXT:    fstps {{[0-9]+}}(%esp)
 ; X86-NEXT:    fldt {{[-0-9]+}}(%e{{[sb]}}p) # 10-byte Folded Reload
 ; X86-NEXT:    fstps {{[0-9]+}}(%esp)
-; X86-NEXT:    vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero
-; X86-NEXT:    vmovd %xmm2, %eax
+; X86-NEXT:    vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; X86-NEXT:    vmovd %xmm1, %eax
 ; X86-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; X86-NEXT:    testl %eax, %eax
 ; X86-NEXT:    js .LBB33_1
 ; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovdqa %xmm2, %xmm1
+; X86-NEXT:    vmovdqa %xmm1, %xmm2
 ; X86-NEXT:    jmp .LBB33_3
 ; X86-NEXT:  .LBB33_1:
-; X86-NEXT:    vmovdqa %xmm0, %xmm1
-; X86-NEXT:    vmovdqa %xmm2, %xmm0
+; X86-NEXT:    vmovdqa %xmm0, %xmm2
+; X86-NEXT:    vmovdqa %xmm1, %xmm0
 ; X86-NEXT:  .LBB33_3:
-; X86-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vmaxss %xmm2, %xmm0, %xmm1
 ; X86-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
 ; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovups %xmm0, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
@@ -2154,43 +2154,43 @@ define <4 x bfloat> @test_fmaximum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) {
 ; SSE2-NEXT:    .cfi_offset %r14, -32
 ; SSE2-NEXT:    .cfi_offset %r15, -24
 ; SSE2-NEXT:    .cfi_offset %rbp, -16
-; SSE2-NEXT:    movdqa %xmm1, %xmm4
-; SSE2-NEXT:    movdqa %xmm0, %xmm5
 ; SSE2-NEXT:    pextrw $0, %xmm1, %r14d
 ; SSE2-NEXT:    pextrw $0, %xmm0, %r15d
-; SSE2-NEXT:    movdqa %xmm1, %xmm0
-; SSE2-NEXT:    psrld $16, %xmm0
-; SSE2-NEXT:    pextrw $0, %xmm0, %eax
-; SSE2-NEXT:    movdqa %xmm5, %xmm0
-; SSE2-NEXT:    psrld $16, %xmm0
-; SSE2-NEXT:    pextrw $0, %xmm0, %ecx
+; SSE2-NEXT:    movdqa %xmm1, %xmm2
+; SSE2-NEXT:    psrld $16, %xmm2
+; SSE2-NEXT:    pextrw $0, %xmm2, %eax
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    psrld $16, %xmm2
+; SSE2-NEXT:    pextrw $0, %xmm2, %ecx
 ; SSE2-NEXT:    shll $16, %ecx
 ; SSE2-NEXT:    movd %ecx, %xmm3
 ; SSE2-NEXT:    shll $16, %eax
 ; SSE2-NEXT:    movd %eax, %xmm2
 ; SSE2-NEXT:    testl %ecx, %ecx
-; SSE2-NEXT:    movdqa %xmm3, %xmm1
+; SSE2-NEXT:    movdqa %xmm3, %xmm7
 ; SSE2-NEXT:    js .LBB34_2
 ; SSE2-NEXT:  # %bb.1:
-; SSE2-NEXT:    movdqa %xmm2, %xmm1
+; SSE2-NEXT:    movdqa %xmm2, %xmm7
 ; SSE2-NEXT:  .LBB34_2:
-; SSE2-NEXT:    movdqa %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
-; SSE2-NEXT:    shufps {{.*#+}} xmm4 = xmm4[1,1,1,1]
-; SSE2-NEXT:    movdqa %xmm5, (%rsp) # 16-byte Spill
-; SSE2-NEXT:    shufps {{.*#+}} xmm5 = xmm5[1,1,1,1]
-; SSE2-NEXT:    movdqa %xmm1, %xmm0
-; SSE2-NEXT:    cmpunordss %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm6
-; SSE2-NEXT:    andps %xmm1, %xmm6
+; SSE2-NEXT:    movdqa %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
+; SSE2-NEXT:    movdqa %xmm1, %xmm5
+; SSE2-NEXT:    shufps {{.*#+}} xmm5 = xmm5[1,1],xmm1[1,1]
+; SSE2-NEXT:    movdqa %xmm0, (%rsp) # 16-byte Spill
+; SSE2-NEXT:    movdqa %xmm0, %xmm6
+; SSE2-NEXT:    shufps {{.*#+}} xmm6 = xmm6[1,1],xmm0[1,1]
+; SSE2-NEXT:    movdqa %xmm7, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm7, %xmm0
+; SSE2-NEXT:    movaps %xmm0, %xmm4
+; SSE2-NEXT:    andps %xmm7, %xmm4
 ; SSE2-NEXT:    js .LBB34_4
 ; SSE2-NEXT:  # %bb.3:
 ; SSE2-NEXT:    movdqa %xmm3, %xmm2
 ; SSE2-NEXT:  .LBB34_4:
-; SSE2-NEXT:    pextrw $0, %xmm4, %ebp
-; SSE2-NEXT:    pextrw $0, %xmm5, %ebx
-; SSE2-NEXT:    maxss %xmm2, %xmm1
-; SSE2-NEXT:    andnps %xmm1, %xmm0
-; SSE2-NEXT:    orps %xmm6, %xmm0
+; SSE2-NEXT:    pextrw $0, %xmm5, %ebp
+; SSE2-NEXT:    pextrw $0, %xmm6, %ebx
+; SSE2-NEXT:    maxss %xmm2, %xmm7
+; SSE2-NEXT:    andnps %xmm7, %xmm0
+; SSE2-NEXT:    orps %xmm4, %xmm0
 ; SSE2-NEXT:    callq __truncsfbf2 at PLT
 ; SSE2-NEXT:    movaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; SSE2-NEXT:    shll $16, %r15d
@@ -2312,7 +2312,7 @@ define <4 x bfloat> @test_fmaximum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) {
 ; AVX1-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm0[1,1,3,3]
 ; AVX1-NEXT:    vpextrw $0, %xmm4, %ebx
 ; AVX1-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
-; AVX1-NEXT:    vpextrw $0, %xmm4, %ebp
+; AVX1-NEXT:    vpextrw $0, %xmm4, %r14d
 ; AVX1-NEXT:    vpextrw $0, %xmm0, %r12d
 ; AVX1-NEXT:    vpextrw $0, %xmm1, %r13d
 ; AVX1-NEXT:    vpsrld $16, %xmm0, %xmm0
@@ -2331,7 +2331,7 @@ define <4 x bfloat> @test_fmaximum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) {
 ; AVX1-NEXT:    vmovdqa %xmm0, %xmm1
 ; AVX1-NEXT:    vmovdqa %xmm4, %xmm0
 ; AVX1-NEXT:  .LBB34_3:
-; AVX1-NEXT:    vpextrw $0, %xmm2, %r14d
+; AVX1-NEXT:    vpextrw $0, %xmm2, %ebp
 ; AVX1-NEXT:    vpextrw $0, %xmm3, %r15d
 ; AVX1-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
 ; AVX1-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
@@ -2355,8 +2355,8 @@ define <4 x bfloat> @test_fmaximum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) {
 ; AVX1-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX1-NEXT:    callq __truncsfbf2 at PLT
 ; AVX1-NEXT:    vmovaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
-; AVX1-NEXT:    shll $16, %ebp
-; AVX1-NEXT:    vmovd %ebp, %xmm0
+; AVX1-NEXT:    shll $16, %r14d
+; AVX1-NEXT:    vmovd %r14d, %xmm0
 ; AVX1-NEXT:    shll $16, %ebx
 ; AVX1-NEXT:    vmovd %ebx, %xmm2
 ; AVX1-NEXT:    js .LBB34_7
@@ -2374,8 +2374,8 @@ define <4 x bfloat> @test_fmaximum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) {
 ; AVX1-NEXT:    vmovaps %xmm0, (%rsp) # 16-byte Spill
 ; AVX1-NEXT:    shll $16, %r15d
 ; AVX1-NEXT:    vmovd %r15d, %xmm0
-; AVX1-NEXT:    shll $16, %r14d
-; AVX1-NEXT:    vmovd %r14d, %xmm2
+; AVX1-NEXT:    shll $16, %ebp
+; AVX1-NEXT:    vmovd %ebp, %xmm2
 ; AVX1-NEXT:    js .LBB34_10
 ; AVX1-NEXT:  # %bb.11:
 ; AVX1-NEXT:    vmovdqa %xmm2, %xmm1
diff --git a/llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll b/llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll
index bfff6ef41dbe00..c617b45707f8f3 100644
--- a/llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll
+++ b/llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll
@@ -1827,7 +1827,7 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    kmovw %eax, %k2
 ; AVX512-NEXT:    vmovss %xmm0, %xmm2, %xmm2 {%k2}
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm1
-; AVX512-NEXT:    vmovaps %xmm1, (%rsp) # 16-byte Spill
+; AVX512-NEXT:    vmovaps %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; AVX512-NEXT:    vcvtph2ps %xmm1, %xmm2
 ; AVX512-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
@@ -1842,10 +1842,10 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtph2ps %xmm0, %xmm0
 ; AVX512-NEXT:    movzwl {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %eax
 ; AVX512-NEXT:    vmovd %eax, %xmm2
-; AVX512-NEXT:    vcvtph2ps %xmm2, %xmm9
-; AVX512-NEXT:    vmulss %xmm0, %xmm9, %xmm0
-; AVX512-NEXT:    vxorps %xmm10, %xmm10, %xmm10
-; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm10[1,2,3]
+; AVX512-NEXT:    vcvtph2ps %xmm2, %xmm7
+; AVX512-NEXT:    vmulss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vxorps %xmm9, %xmm9, %xmm9
+; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm9[1,2,3]
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
 ; AVX512-NEXT:    vmovdqa %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; AVX512-NEXT:    vmovd %xmm0, %eax
@@ -1866,7 +1866,7 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtph2ps %xmm1, %xmm3
 ; AVX512-NEXT:    vmovss %xmm3, %xmm2, %xmm2 {%k1}
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm1
-; AVX512-NEXT:    vmovaps %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
+; AVX512-NEXT:    vmovaps %xmm1, (%rsp) # 16-byte Spill
 ; AVX512-NEXT:    vcvtph2ps %xmm1, %xmm2
 ; AVX512-NEXT:    vucomiss %xmm2, %xmm3
 ; AVX512-NEXT:    seta %al
@@ -1875,8 +1875,8 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm2
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm2, %xmm2
-; AVX512-NEXT:    vmulss %xmm2, %xmm9, %xmm2
-; AVX512-NEXT:    vblendps {{.*#+}} xmm2 = xmm2[0],xmm10[1,2,3]
+; AVX512-NEXT:    vmulss %xmm7, %xmm2, %xmm2
+; AVX512-NEXT:    vblendps {{.*#+}} xmm2 = xmm2[0],xmm9[1,2,3]
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm1
 ; AVX512-NEXT:    vmovdqa %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; AVX512-NEXT:    vmovd %xmm1, %eax
@@ -1915,8 +1915,9 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    setp %al
 ; AVX512-NEXT:    kmovw %eax, %k2
 ; AVX512-NEXT:    vmovss %xmm0, %xmm5, %xmm5 {%k2}
-; AVX512-NEXT:    vcvtps2ph $4, %xmm5, %xmm15
-; AVX512-NEXT:    vcvtph2ps %xmm15, %xmm5
+; AVX512-NEXT:    vcvtps2ph $4, %xmm5, %xmm1
+; AVX512-NEXT:    vmovaps %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
+; AVX512-NEXT:    vcvtph2ps %xmm1, %xmm5
 ; AVX512-NEXT:    vmovss %xmm5, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
 ; AVX512-NEXT:    vmovaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
@@ -1928,16 +1929,16 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm3, %xmm3
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm3 = xmm3[0],zero,zero,zero,xmm3[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm3, %xmm3
-; AVX512-NEXT:    vmulss %xmm3, %xmm9, %xmm3
-; AVX512-NEXT:    vblendps {{.*#+}} xmm3 = xmm3[0],xmm10[1,2,3]
+; AVX512-NEXT:    vmulss %xmm7, %xmm3, %xmm3
+; AVX512-NEXT:    vblendps {{.*#+}} xmm3 = xmm3[0],xmm9[1,2,3]
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm3, %xmm1
 ; AVX512-NEXT:    vmovdqa %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; AVX512-NEXT:    vmovd %xmm1, %eax
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm0, %xmm0
-; AVX512-NEXT:    vmulss %xmm0, %xmm9, %xmm0
-; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm10[1,2,3]
+; AVX512-NEXT:    vmulss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm9[1,2,3]
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
 ; AVX512-NEXT:    vmovdqa %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; AVX512-NEXT:    vmovd %xmm0, %ecx
@@ -1957,12 +1958,11 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    setp %al
 ; AVX512-NEXT:    kmovw %eax, %k2
 ; AVX512-NEXT:    vmovss %xmm0, %xmm2, %xmm2 {%k2}
-; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm11
-; AVX512-NEXT:    vcvtph2ps %xmm11, %xmm3
+; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm12
+; AVX512-NEXT:    vcvtph2ps %xmm12, %xmm3
 ; AVX512-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
-; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
-; AVX512-NEXT:    vmovaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
-; AVX512-NEXT:    vcvtph2ps %xmm0, %xmm2
+; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm15
+; AVX512-NEXT:    vcvtph2ps %xmm15, %xmm2
 ; AVX512-NEXT:    vucomiss %xmm2, %xmm3
 ; AVX512-NEXT:    seta %al
 ; AVX512-NEXT:    kmovw %eax, %k1
@@ -1978,11 +1978,11 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    setp %al
 ; AVX512-NEXT:    kmovw %eax, %k2
 ; AVX512-NEXT:    vmovss %xmm0, %xmm3, %xmm3 {%k2}
-; AVX512-NEXT:    vcvtps2ph $4, %xmm3, %xmm7
-; AVX512-NEXT:    vcvtph2ps %xmm7, %xmm3
+; AVX512-NEXT:    vcvtps2ph $4, %xmm3, %xmm10
+; AVX512-NEXT:    vcvtph2ps %xmm10, %xmm3
 ; AVX512-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
-; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm12
-; AVX512-NEXT:    vcvtph2ps %xmm12, %xmm0
+; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm11
+; AVX512-NEXT:    vcvtph2ps %xmm11, %xmm0
 ; AVX512-NEXT:    vucomiss %xmm0, %xmm3
 ; AVX512-NEXT:    seta %al
 ; AVX512-NEXT:    kmovw %eax, %k1
@@ -1990,20 +1990,20 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm2
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm2, %xmm2
-; AVX512-NEXT:    vmulss %xmm2, %xmm9, %xmm2
-; AVX512-NEXT:    vblendps {{.*#+}} xmm2 = xmm2[0],xmm10[1,2,3]
+; AVX512-NEXT:    vmulss %xmm7, %xmm2, %xmm2
+; AVX512-NEXT:    vblendps {{.*#+}} xmm2 = xmm2[0],xmm9[1,2,3]
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm14
 ; AVX512-NEXT:    vmovd %xmm14, %eax
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm0, %xmm0
-; AVX512-NEXT:    vmulss %xmm0, %xmm9, %xmm0
-; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm10[1,2,3]
+; AVX512-NEXT:    vmulss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm9[1,2,3]
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm13
 ; AVX512-NEXT:    vmovd %xmm13, %ecx
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
 ; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm2
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm5 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm8 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
 ; AVX512-NEXT:    vcvtph2ps %xmm4, %xmm0
 ; AVX512-NEXT:    vucomiss %xmm0, %xmm0
 ; AVX512-NEXT:    setp %al
@@ -2016,8 +2016,8 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm3
 ; AVX512-NEXT:    vcvtph2ps %xmm3, %xmm1
 ; AVX512-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
-; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm8
-; AVX512-NEXT:    vcvtph2ps %xmm8, %xmm2
+; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm5
+; AVX512-NEXT:    vcvtph2ps %xmm5, %xmm2
 ; AVX512-NEXT:    vucomiss %xmm2, %xmm1
 ; AVX512-NEXT:    seta %al
 ; AVX512-NEXT:    kmovw %eax, %k1
@@ -2045,125 +2045,125 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm2
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm2, %xmm2
-; AVX512-NEXT:    vmulss %xmm2, %xmm9, %xmm2
+; AVX512-NEXT:    vmulss %xmm7, %xmm2, %xmm2
 ; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm0
 ; AVX512-NEXT:    vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
 ; AVX512-NEXT:    vcvtph2ps %xmm0, %xmm0
-; AVX512-NEXT:    vmulss %xmm0, %xmm9, %xmm0
-; AVX512-NEXT:    vblendps {{.*#+}} xmm2 = xmm2[0],xmm10[1,2,3]
-; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm10[1,2,3]
-; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm9
-; AVX512-NEXT:    vmovd %xmm9, %eax
-; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm10
-; AVX512-NEXT:    vmovd %xmm10, %ecx
+; AVX512-NEXT:    vmulss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vblendps {{.*#+}} xmm2 = xmm2[0],xmm9[1,2,3]
+; AVX512-NEXT:    vblendps {{.*#+}} xmm0 = xmm0[0],xmm9[1,2,3]
+; AVX512-NEXT:    vcvtps2ph $4, %xmm2, %xmm7
+; AVX512-NEXT:    vmovd %xmm7, %eax
+; AVX512-NEXT:    vcvtps2ph $4, %xmm0, %xmm9
+; AVX512-NEXT:    vmovd %xmm9, %ecx
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
 ; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm2
 ; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
-; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1]
-; AVX512-NEXT:    vpunpcklqdq {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm6 # 16-byte Folded Reload
-; AVX512-NEXT:    # xmm6 = xmm0[0],mem[0]
-; AVX512-NEXT:    vmovdqa (%rsp), %xmm0 # 16-byte Reload
-; AVX512-NEXT:    vmovd %xmm0, %eax
-; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
-; AVX512-NEXT:    vmovd %xmm0, %ecx
-; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
-; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm2
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
+; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm8[0],xmm0[1],xmm8[1]
+; AVX512-NEXT:    vpunpcklqdq {{[-0-9]+}}(%r{{[sb]}}p), %xmm0, %xmm0 # 16-byte Folded Reload
+; AVX512-NEXT:    # xmm0 = xmm0[0],mem[0]
 ; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 16-byte Reload
 ; AVX512-NEXT:    vmovd %xmm2, %eax
-; AVX512-NEXT:    vmovd %xmm15, %ecx
-; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm2
-; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm5
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm2 = xmm5[0],xmm2[0],xmm5[1],xmm2[1],xmm5[2],xmm2[2],xmm5[3],xmm2[3]
-; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
-; AVX512-NEXT:    vmovd %xmm11, %eax
-; AVX512-NEXT:    vmovd %xmm7, %ecx
+; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 16-byte Reload
+; AVX512-NEXT:    vmovd %xmm2, %ecx
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm2
-; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm5
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm2 = xmm5[0],xmm2[0],xmm5[1],xmm2[1],xmm5[2],xmm2[2],xmm5[3],xmm2[3]
+; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm6
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm2 = xmm6[0],xmm2[0],xmm6[1],xmm2[1],xmm6[2],xmm2[2],xmm6[3],xmm2[3]
+; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm6 # 16-byte Reload
+; AVX512-NEXT:    vmovd %xmm6, %eax
+; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm6 # 16-byte Reload
+; AVX512-NEXT:    vmovd %xmm6, %ecx
+; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm6
+; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm8
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm6 = xmm8[0],xmm6[0],xmm8[1],xmm6[1],xmm8[2],xmm6[2],xmm8[3],xmm6[3]
+; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm2 = xmm6[0],xmm2[0],xmm6[1],xmm2[1]
+; AVX512-NEXT:    vmovd %xmm12, %eax
+; AVX512-NEXT:    vmovd %xmm10, %ecx
+; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm6
+; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm8
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm6 = xmm8[0],xmm6[0],xmm8[1],xmm6[1],xmm8[2],xmm6[2],xmm8[3],xmm6[3]
 ; AVX512-NEXT:    vmovd %xmm3, %eax
 ; AVX512-NEXT:    vmovd %xmm4, %ecx
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm3
 ; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm4
 ; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
-; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
-; AVX512-NEXT:    vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm0[0]
-; AVX512-NEXT:    vpxor %xmm0, %xmm0, %xmm0
-; AVX512-NEXT:    vpcmpeqw %xmm0, %xmm2, %xmm3
-; AVX512-NEXT:    vpblendvb %xmm3, %xmm2, %xmm6, %xmm2
+; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm3 = xmm3[0],xmm6[0],xmm3[1],xmm6[1]
+; AVX512-NEXT:    vpunpcklqdq {{.*#+}} xmm3 = xmm3[0],xmm2[0]
+; AVX512-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX512-NEXT:    vpcmpeqw %xmm2, %xmm3, %xmm4
+; AVX512-NEXT:    vpblendvb %xmm4, %xmm3, %xmm0, %xmm4
 ; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm3 # 16-byte Reload
 ; AVX512-NEXT:    vmovd %xmm3, %eax
-; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm3 # 16-byte Reload
+; AVX512-NEXT:    vmovdqa (%rsp), %xmm3 # 16-byte Reload
 ; AVX512-NEXT:    vmovd %xmm3, %ecx
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm3
-; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm4
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm3 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
-; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm4 # 16-byte Reload
-; AVX512-NEXT:    vmovd %xmm4, %eax
-; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm4 # 16-byte Reload
-; AVX512-NEXT:    vmovd %xmm4, %ecx
-; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm4
-; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm5
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3]
-; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm3 = xmm4[0],xmm3[0],xmm4[1],xmm3[1]
-; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm4 # 16-byte Reload
-; AVX512-NEXT:    vmovd %xmm4, %eax
-; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm4
-; AVX512-NEXT:    vmovd %xmm12, %eax
-; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm5
-; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3]
-; AVX512-NEXT:    vmovd %xmm8, %eax
+; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm6
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm3 = xmm6[0],xmm3[0],xmm6[1],xmm3[1],xmm6[2],xmm3[2],xmm6[3],xmm3[3]
+; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm6 # 16-byte Reload
+; AVX512-NEXT:    vmovd %xmm6, %eax
+; AVX512-NEXT:    vmovdqa {{[-0-9]+}}(%r{{[sb]}}p), %xmm6 # 16-byte Reload
+; AVX512-NEXT:    vmovd %xmm6, %ecx
+; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm6
+; AVX512-NEXT:    vpinsrw $0, %ecx, %xmm0, %xmm8
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm6 = xmm8[0],xmm6[0],xmm8[1],xmm6[1],xmm8[2],xmm6[2],xmm8[3],xmm6[3]
+; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm3 = xmm6[0],xmm3[0],xmm6[1],xmm3[1]
+; AVX512-NEXT:    vmovd %xmm15, %eax
+; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm6
+; AVX512-NEXT:    vmovd %xmm11, %eax
+; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm8
+; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm6 = xmm8[0],xmm6[0],xmm8[1],xmm6[1],xmm8[2],xmm6[2],xmm8[3],xmm6[3]
+; AVX512-NEXT:    vmovd %xmm5, %eax
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm5
 ; AVX512-NEXT:    vmovd %xmm1, %eax
 ; AVX512-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm1
 ; AVX512-NEXT:    vpunpcklwd {{.*#+}} xmm1 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3]
-; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1]
+; AVX512-NEXT:    vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm6[0],xmm1[1],xmm6[1]
 ; AVX512-NEXT:    vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
-; AVX512-NEXT:    vpcmpeqw %xmm0, %xmm1, %xmm0
-; AVX512-NEXT:    vpblendvb %xmm0, %xmm1, %xmm2, %xmm0
-; AVX512-NEXT:    vcvtph2ps %xmm10, %xmm1
+; AVX512-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm2
+; AVX512-NEXT:    vpblendvb %xmm2, %xmm1, %xmm4, %xmm1
+; AVX512-NEXT:    vcvtph2ps %xmm9, %xmm2
 ; AVX512-NEXT:    xorl %eax, %eax
-; AVX512-NEXT:    vpxor %xmm2, %xmm2, %xmm2
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $65535, %ecx # imm = 0xFFFF
 ; AVX512-NEXT:    movl $0, %edx
 ; AVX512-NEXT:    cmovel %ecx, %edx
-; AVX512-NEXT:    vcvtph2ps %xmm9, %xmm1
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vcvtph2ps %xmm7, %xmm2
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $0, %esi
 ; AVX512-NEXT:    cmovel %ecx, %esi
-; AVX512-NEXT:    vcvtph2ps %xmm13, %xmm1
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vcvtph2ps %xmm13, %xmm2
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $0, %edi
 ; AVX512-NEXT:    cmovel %ecx, %edi
-; AVX512-NEXT:    vcvtph2ps %xmm14, %xmm1
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vcvtph2ps %xmm14, %xmm2
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $0, %r8d
 ; AVX512-NEXT:    cmovel %ecx, %r8d
-; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm1 # 16-byte Folded Reload
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 16-byte Folded Reload
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $0, %r9d
 ; AVX512-NEXT:    cmovel %ecx, %r9d
-; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm1 # 16-byte Folded Reload
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 16-byte Folded Reload
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $0, %r10d
 ; AVX512-NEXT:    cmovel %ecx, %r10d
-; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm1 # 16-byte Folded Reload
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
+; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 16-byte Folded Reload
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
 ; AVX512-NEXT:    movl $0, %r11d
 ; AVX512-NEXT:    cmovel %ecx, %r11d
-; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm1 # 16-byte Folded Reload
-; AVX512-NEXT:    vucomiss %xmm2, %xmm1
-; AVX512-NEXT:    vmovd %esi, %xmm1
-; AVX512-NEXT:    vpinsrw $1, %edx, %xmm1, %xmm1
-; AVX512-NEXT:    vpinsrw $2, %edi, %xmm1, %xmm1
-; AVX512-NEXT:    vpinsrw $3, %r8d, %xmm1, %xmm1
-; AVX512-NEXT:    vpinsrw $4, %r9d, %xmm1, %xmm1
-; AVX512-NEXT:    vpinsrw $5, %r10d, %xmm1, %xmm1
-; AVX512-NEXT:    vpinsrw $6, %r11d, %xmm1, %xmm1
+; AVX512-NEXT:    vcvtph2ps {{[-0-9]+}}(%r{{[sb]}}p), %xmm2 # 16-byte Folded Reload
+; AVX512-NEXT:    vucomiss %xmm3, %xmm2
+; AVX512-NEXT:    vmovd %esi, %xmm2
+; AVX512-NEXT:    vpinsrw $1, %edx, %xmm2, %xmm2
+; AVX512-NEXT:    vpinsrw $2, %edi, %xmm2, %xmm2
+; AVX512-NEXT:    vpinsrw $3, %r8d, %xmm2, %xmm2
+; AVX512-NEXT:    vpinsrw $4, %r9d, %xmm2, %xmm2
+; AVX512-NEXT:    vpinsrw $5, %r10d, %xmm2, %xmm2
+; AVX512-NEXT:    vpinsrw $6, %r11d, %xmm2, %xmm2
 ; AVX512-NEXT:    cmovel %ecx, %eax
-; AVX512-NEXT:    vpinsrw $7, %eax, %xmm1, %xmm1
-; AVX512-NEXT:    vpblendvb %xmm1, %xmm0, %xmm6, %xmm0
+; AVX512-NEXT:    vpinsrw $7, %eax, %xmm2, %xmm2
+; AVX512-NEXT:    vpblendvb %xmm2, %xmm1, %xmm0, %xmm0
 ; AVX512-NEXT:    addq $88, %rsp
 ; AVX512-NEXT:    retq
 ;
@@ -2204,19 +2204,19 @@ define <4 x half> @test_fmaximumnum_v4f16(<4 x half> %x, <4 x half> %y) nounwind
 ; X86-NEXT:    fstps {{[0-9]+}}(%esp)
 ; X86-NEXT:    fldt {{[-0-9]+}}(%e{{[sb]}}p) # 10-byte Folded Reload
 ; X86-NEXT:    fstps {{[0-9]+}}(%esp)
-; X86-NEXT:    vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero
-; X86-NEXT:    vmovd %xmm2, %eax
+; X86-NEXT:    vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; X86-NEXT:    vmovd %xmm1, %eax
 ; X86-NEXT:    vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; X86-NEXT:    testl %eax, %eax
 ; X86-NEXT:    js .LBB33_1
 ; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovdqa %xmm2, %xmm1
+; X86-NEXT:    vmovdqa %xmm1, %xmm2
 ; X86-NEXT:    jmp .LBB33_3
 ; X86-NEXT:  .LBB33_1:
-; X86-NEXT:    vmovdqa %xmm0, %xmm1
-; X86-NEXT:    vmovdqa %xmm2, %xmm0
+; X86-NEXT:    vmovdqa %xmm0, %xmm2
+; X86-NEXT:    vmovdqa %xmm1, %xmm0
 ; X86-NEXT:  .LBB33_3:
-; X86-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vmaxss %xmm2, %xmm0, %xmm1
 ; X86-NEXT:    vcmpordss %xmm0, %xmm0, %xmm2
 ; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovups %xmm0, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
@@ -2327,43 +2327,43 @@ define <4 x bfloat> @test_fmaximumnum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) n
 ; SSE2-NEXT:    pushq %r14
 ; SSE2-NEXT:    pushq %rbx
 ; SSE2-NEXT:    subq $56, %rsp
-; SSE2-NEXT:    movdqa %xmm1, %xmm4
-; SSE2-NEXT:    movdqa %xmm0, %xmm5
 ; SSE2-NEXT:    pextrw $0, %xmm1, %r14d
 ; SSE2-NEXT:    pextrw $0, %xmm0, %r15d
-; SSE2-NEXT:    movdqa %xmm1, %xmm0
-; SSE2-NEXT:    psrld $16, %xmm0
-; SSE2-NEXT:    pextrw $0, %xmm0, %eax
-; SSE2-NEXT:    movdqa %xmm5, %xmm0
-; SSE2-NEXT:    psrld $16, %xmm0
-; SSE2-NEXT:    pextrw $0, %xmm0, %ecx
+; SSE2-NEXT:    movdqa %xmm1, %xmm2
+; SSE2-NEXT:    psrld $16, %xmm2
+; SSE2-NEXT:    pextrw $0, %xmm2, %eax
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    psrld $16, %xmm2
+; SSE2-NEXT:    pextrw $0, %xmm2, %ecx
 ; SSE2-NEXT:    shll $16, %ecx
 ; SSE2-NEXT:    movd %ecx, %xmm3
 ; SSE2-NEXT:    shll $16, %eax
 ; SSE2-NEXT:    movd %eax, %xmm2
 ; SSE2-NEXT:    testl %ecx, %ecx
-; SSE2-NEXT:    movdqa %xmm3, %xmm1
+; SSE2-NEXT:    movdqa %xmm3, %xmm7
 ; SSE2-NEXT:    js .LBB34_2
 ; SSE2-NEXT:  # %bb.1:
-; SSE2-NEXT:    movdqa %xmm2, %xmm1
+; SSE2-NEXT:    movdqa %xmm2, %xmm7
 ; SSE2-NEXT:  .LBB34_2:
-; SSE2-NEXT:    movdqa %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
-; SSE2-NEXT:    shufps {{.*#+}} xmm4 = xmm4[1,1,1,1]
-; SSE2-NEXT:    movdqa %xmm5, (%rsp) # 16-byte Spill
-; SSE2-NEXT:    shufps {{.*#+}} xmm5 = xmm5[1,1,1,1]
-; SSE2-NEXT:    movdqa %xmm1, %xmm0
-; SSE2-NEXT:    cmpordss %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm6
-; SSE2-NEXT:    andps %xmm1, %xmm6
+; SSE2-NEXT:    movdqa %xmm1, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
+; SSE2-NEXT:    movdqa %xmm1, %xmm5
+; SSE2-NEXT:    shufps {{.*#+}} xmm5 = xmm5[1,1],xmm1[1,1]
+; SSE2-NEXT:    movdqa %xmm0, (%rsp) # 16-byte Spill
+; SSE2-NEXT:    movdqa %xmm0, %xmm6
+; SSE2-NEXT:    shufps {{.*#+}} xmm6 = xmm6[1,1],xmm0[1,1]
+; SSE2-NEXT:    movdqa %xmm7, %xmm0
+; SSE2-NEXT:    cmpordss %xmm7, %xmm0
+; SSE2-NEXT:    movaps %xmm0, %xmm4
+; SSE2-NEXT:    andps %xmm7, %xmm4
 ; SSE2-NEXT:    js .LBB34_4
 ; SSE2-NEXT:  # %bb.3:
 ; SSE2-NEXT:    movdqa %xmm3, %xmm2
 ; SSE2-NEXT:  .LBB34_4:
-; SSE2-NEXT:    pextrw $0, %xmm4, %ebp
-; SSE2-NEXT:    pextrw $0, %xmm5, %ebx
-; SSE2-NEXT:    maxss %xmm2, %xmm1
-; SSE2-NEXT:    andnps %xmm1, %xmm0
-; SSE2-NEXT:    orps %xmm6, %xmm0
+; SSE2-NEXT:    pextrw $0, %xmm5, %ebp
+; SSE2-NEXT:    pextrw $0, %xmm6, %ebx
+; SSE2-NEXT:    maxss %xmm2, %xmm7
+; SSE2-NEXT:    andnps %xmm7, %xmm0
+; SSE2-NEXT:    orps %xmm4, %xmm0
 ; SSE2-NEXT:    callq __truncsfbf2 at PLT
 ; SSE2-NEXT:    movaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
 ; SSE2-NEXT:    shll $16, %r15d
@@ -2467,7 +2467,7 @@ define <4 x bfloat> @test_fmaximumnum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) n
 ; AVX1-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm0[1,1,3,3]
 ; AVX1-NEXT:    vpextrw $0, %xmm4, %ebx
 ; AVX1-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
-; AVX1-NEXT:    vpextrw $0, %xmm4, %ebp
+; AVX1-NEXT:    vpextrw $0, %xmm4, %r14d
 ; AVX1-NEXT:    vpextrw $0, %xmm0, %r12d
 ; AVX1-NEXT:    vpextrw $0, %xmm1, %r13d
 ; AVX1-NEXT:    vpsrld $16, %xmm0, %xmm0
@@ -2486,7 +2486,7 @@ define <4 x bfloat> @test_fmaximumnum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) n
 ; AVX1-NEXT:    vmovdqa %xmm0, %xmm1
 ; AVX1-NEXT:    vmovdqa %xmm4, %xmm0
 ; AVX1-NEXT:  .LBB34_3:
-; AVX1-NEXT:    vpextrw $0, %xmm2, %r14d
+; AVX1-NEXT:    vpextrw $0, %xmm2, %ebp
 ; AVX1-NEXT:    vpextrw $0, %xmm3, %r15d
 ; AVX1-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
 ; AVX1-NEXT:    vcmpordss %xmm0, %xmm0, %xmm2
@@ -2510,8 +2510,8 @@ define <4 x bfloat> @test_fmaximumnum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) n
 ; AVX1-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX1-NEXT:    callq __truncsfbf2 at PLT
 ; AVX1-NEXT:    vmovaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
-; AVX1-NEXT:    shll $16, %ebp
-; AVX1-NEXT:    vmovd %ebp, %xmm0
+; AVX1-NEXT:    shll $16, %r14d
+; AVX1-NEXT:    vmovd %r14d, %xmm0
 ; AVX1-NEXT:    shll $16, %ebx
 ; AVX1-NEXT:    vmovd %ebx, %xmm2
 ; AVX1-NEXT:    js .LBB34_7
@@ -2529,8 +2529,8 @@ define <4 x bfloat> @test_fmaximumnum_v4bf16(<4 x bfloat> %x, <4 x bfloat> %y) n
 ; AVX1-NEXT:    vmovaps %xmm0, (%rsp) # 16-byte Spill
 ; AVX1-NEXT:    shll $16, %r15d
 ; AVX1-NEXT:    vmovd %r15d, %xmm0
-; AVX1-NEXT:    shll $16, %r14d
-; AVX1-NEXT:    vmovd %r14d, %xmm2
+; AVX1-NEXT:    shll $16, %ebp
+; AVX1-NEXT:    vmovd %ebp, %xmm2
 ; AVX1-NEXT:    js .LBB34_10
 ; AVX1-NEXT:  # %bb.11:
 ; AVX1-NEXT:    vmovdqa %xmm2, %xmm1
diff --git a/llvm/test/CodeGen/X86/half.ll b/llvm/test/CodeGen/X86/half.ll
index 6e7f109a5da5c2..9ae4a64cfafabe 100644
--- a/llvm/test/CodeGen/X86/half.ll
+++ b/llvm/test/CodeGen/X86/half.ll
@@ -1603,15 +1603,15 @@ define <8 x half> @maxnum_v8f16(<8 x half> %0, <8 x half> %1) #0 {
 ; BWON-F16C-NEXT:    vcvtps2ph $4, %xmm4, %xmm2
 ; BWON-F16C-NEXT:    vmovd %xmm2, %ecx
 ; BWON-F16C-NEXT:    vpsrldq {{.*#+}} xmm2 = xmm1[10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
-; BWON-F16C-NEXT:    vcvtph2ps %xmm2, %xmm3
-; BWON-F16C-NEXT:    vpsrldq {{.*#+}} xmm2 = xmm0[10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
 ; BWON-F16C-NEXT:    vcvtph2ps %xmm2, %xmm2
-; BWON-F16C-NEXT:    vucomiss %xmm3, %xmm2
+; BWON-F16C-NEXT:    vpsrldq {{.*#+}} xmm3 = xmm0[10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
+; BWON-F16C-NEXT:    vcvtph2ps %xmm3, %xmm3
+; BWON-F16C-NEXT:    vucomiss %xmm2, %xmm3
 ; BWON-F16C-NEXT:    ja .LBB26_6
 ; BWON-F16C-NEXT:  # %bb.5:
-; BWON-F16C-NEXT:    vmovaps %xmm3, %xmm2
+; BWON-F16C-NEXT:    vmovaps %xmm2, %xmm3
 ; BWON-F16C-NEXT:  .LBB26_6:
-; BWON-F16C-NEXT:    vcvtps2ph $4, %xmm2, %xmm2
+; BWON-F16C-NEXT:    vcvtps2ph $4, %xmm3, %xmm2
 ; BWON-F16C-NEXT:    vmovd %xmm2, %edx
 ; BWON-F16C-NEXT:    vshufpd {{.*#+}} xmm2 = xmm1[1,0]
 ; BWON-F16C-NEXT:    vcvtph2ps %xmm2, %xmm3
diff --git a/llvm/test/CodeGen/X86/smax.ll b/llvm/test/CodeGen/X86/smax.ll
index 2d59422953eb3d..86891e964d96d3 100644
--- a/llvm/test/CodeGen/X86/smax.ll
+++ b/llvm/test/CodeGen/X86/smax.ll
@@ -642,11 +642,11 @@ define i16 @test_signbits_i16(i16 %a, i16 %b) nounwind {
 ;
 ; X86-LABEL: test_signbits_i16:
 ; X86:       # %bb.0:
-; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %ecx
-; X86-NEXT:    movswl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    shrl $15, %eax
-; X86-NEXT:    cmpw %cx, %ax
-; X86-NEXT:    cmovlel %ecx, %eax
+; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movswl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    shrl $15, %ecx
+; X86-NEXT:    cmpw %ax, %cx
+; X86-NEXT:    cmovgl %ecx, %eax
 ; X86-NEXT:    # kill: def $ax killed $ax killed $eax
 ; X86-NEXT:    retl
   %ax = ashr i16 %a, 15
diff --git a/llvm/test/CodeGen/X86/smin.ll b/llvm/test/CodeGen/X86/smin.ll
index bde61d5738ed5c..8907f6c4cd5989 100644
--- a/llvm/test/CodeGen/X86/smin.ll
+++ b/llvm/test/CodeGen/X86/smin.ll
@@ -643,11 +643,11 @@ define i16 @test_signbits_i16(i16 %a, i16 %b) nounwind {
 ;
 ; X86-LABEL: test_signbits_i16:
 ; X86:       # %bb.0:
-; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %ecx
-; X86-NEXT:    movswl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    shrl $15, %eax
-; X86-NEXT:    cmpw %cx, %ax
-; X86-NEXT:    cmovgel %ecx, %eax
+; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movswl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    shrl $15, %ecx
+; X86-NEXT:    cmpw %ax, %cx
+; X86-NEXT:    cmovll %ecx, %eax
 ; X86-NEXT:    # kill: def $ax killed $ax killed $eax
 ; X86-NEXT:    retl
   %ax = ashr i16 %a, 15
diff --git a/llvm/test/CodeGen/X86/test-shrink.ll b/llvm/test/CodeGen/X86/test-shrink.ll
index 6a7a0ad24ed8e9..03bba9c67455cb 100644
--- a/llvm/test/CodeGen/X86/test-shrink.ll
+++ b/llvm/test/CodeGen/X86/test-shrink.ll
@@ -546,7 +546,6 @@ define void @testw(i16 inreg %x) nounwind minsize {
 ; CHECK-WIN32-64-LABEL: testw:
 ; CHECK-WIN32-64:       # %bb.0:
 ; CHECK-WIN32-64-NEXT:    subq $40, %rsp
-; CHECK-WIN32-64-NEXT:    # kill: def $cx killed $cx def $ecx
 ; CHECK-WIN32-64-NEXT:    testw $2049, %cx # imm = 0x801
 ; CHECK-WIN32-64-NEXT:    jne .LBB12_2
 ; CHECK-WIN32-64-NEXT:  # %bb.1: # %yes
diff --git a/llvm/test/CodeGen/X86/umax.ll b/llvm/test/CodeGen/X86/umax.ll
index f0479aea1b82c8..f589d4a7b04a90 100644
--- a/llvm/test/CodeGen/X86/umax.ll
+++ b/llvm/test/CodeGen/X86/umax.ll
@@ -1237,11 +1237,11 @@ define i16 @test_signbits_i16(i16 %a, i16 %b) nounwind {
 ;
 ; X86-LABEL: test_signbits_i16:
 ; X86:       # %bb.0:
-; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %ecx
-; X86-NEXT:    movswl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    shrl $15, %eax
-; X86-NEXT:    cmpw %cx, %ax
-; X86-NEXT:    cmovbel %ecx, %eax
+; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movswl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    shrl $15, %ecx
+; X86-NEXT:    cmpw %ax, %cx
+; X86-NEXT:    cmoval %ecx, %eax
 ; X86-NEXT:    # kill: def $ax killed $ax killed $eax
 ; X86-NEXT:    retl
   %ax = ashr i16 %a, 15
diff --git a/llvm/test/CodeGen/X86/umin.ll b/llvm/test/CodeGen/X86/umin.ll
index e4ce08966a8946..7a5cdbb9ce758e 100644
--- a/llvm/test/CodeGen/X86/umin.ll
+++ b/llvm/test/CodeGen/X86/umin.ll
@@ -652,11 +652,11 @@ define i16 @test_signbits_i16(i16 %a, i16 %b) nounwind {
 ;
 ; X86-LABEL: test_signbits_i16:
 ; X86:       # %bb.0:
-; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %ecx
-; X86-NEXT:    movswl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    shrl $15, %eax
-; X86-NEXT:    cmpw %cx, %ax
-; X86-NEXT:    cmovael %ecx, %eax
+; X86-NEXT:    movsbl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movswl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    shrl $15, %ecx
+; X86-NEXT:    cmpw %ax, %cx
+; X86-NEXT:    cmovbl %ecx, %eax
 ; X86-NEXT:    # kill: def $ax killed $ax killed $eax
 ; X86-NEXT:    retl
   %ax = ashr i16 %a, 15
diff --git a/llvm/test/CodeGen/X86/vector-compress.ll b/llvm/test/CodeGen/X86/vector-compress.ll
index 87a948a4f1f7ee..e88387a8b7c691 100644
--- a/llvm/test/CodeGen/X86/vector-compress.ll
+++ b/llvm/test/CodeGen/X86/vector-compress.ll
@@ -2390,25 +2390,24 @@ define <64 x i8> @test_compress_v64i8(<64 x i8> %vec, <64 x i1> %mask, <64 x i8>
 ; AVX512F-NEXT:    kshiftrw $13, %k1, %k1
 ; AVX512F-NEXT:    korw %k1, %k0, %k0
 ; AVX512F-NEXT:    movw $-9, %ax
-; AVX512F-NEXT:    kmovw %eax, %k7
-; AVX512F-NEXT:    kandw %k7, %k0, %k0
-; AVX512F-NEXT:    kmovw %k7, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kmovw %eax, %k5
+; AVX512F-NEXT:    kandw %k5, %k0, %k0
 ; AVX512F-NEXT:    movzbl 376(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k1
 ; AVX512F-NEXT:    kshiftlw $15, %k1, %k1
 ; AVX512F-NEXT:    kshiftrw $12, %k1, %k1
 ; AVX512F-NEXT:    korw %k1, %k0, %k0
 ; AVX512F-NEXT:    movw $-17, %ax
-; AVX512F-NEXT:    kmovw %eax, %k5
-; AVX512F-NEXT:    kandw %k5, %k0, %k0
+; AVX512F-NEXT:    kmovw %eax, %k6
+; AVX512F-NEXT:    kandw %k6, %k0, %k0
 ; AVX512F-NEXT:    movzbl 384(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k1
 ; AVX512F-NEXT:    kshiftlw $15, %k1, %k1
 ; AVX512F-NEXT:    kshiftrw $11, %k1, %k1
 ; AVX512F-NEXT:    korw %k1, %k0, %k0
 ; AVX512F-NEXT:    movw $-33, %ax
-; AVX512F-NEXT:    kmovw %eax, %k6
-; AVX512F-NEXT:    kandw %k6, %k0, %k0
+; AVX512F-NEXT:    kmovw %eax, %k7
+; AVX512F-NEXT:    kandw %k7, %k0, %k0
 ; AVX512F-NEXT:    movzbl 392(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k1
 ; AVX512F-NEXT:    kshiftlw $15, %k1, %k1
@@ -2516,26 +2515,26 @@ define <64 x i8> @test_compress_v64i8(<64 x i8> %vec, <64 x i1> %mask, <64 x i8>
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $13, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kandw %k5, %k1, %k1
 ; AVX512F-NEXT:    movzbl 248(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $12, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k5, %k1, %k1
+; AVX512F-NEXT:    kandw %k6, %k1, %k1
 ; AVX512F-NEXT:    movzbl 256(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $11, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k6, %k1, %k1
+; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 264(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $10, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; AVX512F-NEXT:    kandw %k0, %k1, %k1
 ; AVX512F-NEXT:    movzbl 272(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -2626,27 +2625,31 @@ define <64 x i8> @test_compress_v64i8(<64 x i8> %vec, <64 x i1> %mask, <64 x i8>
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $13, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k4 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k4, %k1, %k1
+; AVX512F-NEXT:    kmovw %k5, %k4
+; AVX512F-NEXT:    kmovw %k5, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kandw %k5, %k1, %k1
 ; AVX512F-NEXT:    movzbl 120(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $12, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw %k5, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; AVX512F-NEXT:    kandw %k5, %k1, %k1
+; AVX512F-NEXT:    kmovw %k6, %k5
+; AVX512F-NEXT:    kmovw %k6, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kandw %k6, %k1, %k1
 ; AVX512F-NEXT:    movzbl 128(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $11, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw %k6, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; AVX512F-NEXT:    kandw %k6, %k1, %k1
+; AVX512F-NEXT:    kmovw %k7, %k6
+; AVX512F-NEXT:    kmovw %k7, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 136(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $10, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
 ; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 144(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
@@ -2667,8 +2670,8 @@ define <64 x i8> @test_compress_v64i8(<64 x i8> %vec, <64 x i1> %mask, <64 x i8>
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $7, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; AVX512F-NEXT:    kandw %k2, %k1, %k1
 ; AVX512F-NEXT:    movzbl 168(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -2741,8 +2744,7 @@ define <64 x i8> @test_compress_v64i8(<64 x i8> %vec, <64 x i1> %mask, <64 x i8>
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $10, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k2, %k1, %k1
+; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 16(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -2764,7 +2766,8 @@ define <64 x i8> @test_compress_v64i8(<64 x i8> %vec, <64 x i1> %mask, <64 x i8>
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $7, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; AVX512F-NEXT:    kandw %k2, %k1, %k1
 ; AVX512F-NEXT:    movzbl 40(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -3763,25 +3766,24 @@ define <64 x i32> @test_compress_large(<64 x i1> %mask, <64 x i32> %vec, <64 x i
 ; AVX512F-NEXT:    kshiftrw $13, %k1, %k1
 ; AVX512F-NEXT:    korw %k1, %k0, %k0
 ; AVX512F-NEXT:    movw $-9, %ax
-; AVX512F-NEXT:    kmovw %eax, %k7
-; AVX512F-NEXT:    kandw %k7, %k0, %k0
-; AVX512F-NEXT:    kmovw %k7, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kmovw %eax, %k5
+; AVX512F-NEXT:    kandw %k5, %k0, %k0
 ; AVX512F-NEXT:    movzbl 376(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k1
 ; AVX512F-NEXT:    kshiftlw $15, %k1, %k1
 ; AVX512F-NEXT:    kshiftrw $12, %k1, %k1
 ; AVX512F-NEXT:    korw %k1, %k0, %k0
 ; AVX512F-NEXT:    movw $-17, %ax
-; AVX512F-NEXT:    kmovw %eax, %k5
-; AVX512F-NEXT:    kandw %k5, %k0, %k0
+; AVX512F-NEXT:    kmovw %eax, %k6
+; AVX512F-NEXT:    kandw %k6, %k0, %k0
 ; AVX512F-NEXT:    movzbl 384(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k1
 ; AVX512F-NEXT:    kshiftlw $15, %k1, %k1
 ; AVX512F-NEXT:    kshiftrw $11, %k1, %k1
 ; AVX512F-NEXT:    korw %k1, %k0, %k0
 ; AVX512F-NEXT:    movw $-33, %ax
-; AVX512F-NEXT:    kmovw %eax, %k6
-; AVX512F-NEXT:    kandw %k6, %k0, %k0
+; AVX512F-NEXT:    kmovw %eax, %k7
+; AVX512F-NEXT:    kandw %k7, %k0, %k0
 ; AVX512F-NEXT:    movzbl 392(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k1
 ; AVX512F-NEXT:    kshiftlw $15, %k1, %k1
@@ -3889,26 +3891,26 @@ define <64 x i32> @test_compress_large(<64 x i1> %mask, <64 x i32> %vec, <64 x i
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $13, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kandw %k5, %k1, %k1
 ; AVX512F-NEXT:    movzbl 248(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $12, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k5, %k1, %k1
+; AVX512F-NEXT:    kandw %k6, %k1, %k1
 ; AVX512F-NEXT:    movzbl 256(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $11, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k6, %k1, %k1
+; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 264(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $10, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k0 # 2-byte Reload
+; AVX512F-NEXT:    kandw %k0, %k1, %k1
 ; AVX512F-NEXT:    movzbl 272(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -3999,27 +4001,31 @@ define <64 x i32> @test_compress_large(<64 x i1> %mask, <64 x i32> %vec, <64 x i
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $13, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k4 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k4, %k1, %k1
+; AVX512F-NEXT:    kmovw %k5, %k4
+; AVX512F-NEXT:    kmovw %k5, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kandw %k5, %k1, %k1
 ; AVX512F-NEXT:    movzbl 120(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $12, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw %k5, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; AVX512F-NEXT:    kandw %k5, %k1, %k1
+; AVX512F-NEXT:    kmovw %k6, %k5
+; AVX512F-NEXT:    kmovw %k6, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kandw %k6, %k1, %k1
 ; AVX512F-NEXT:    movzbl 128(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $11, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw %k6, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
-; AVX512F-NEXT:    kandw %k6, %k1, %k1
+; AVX512F-NEXT:    kmovw %k7, %k6
+; AVX512F-NEXT:    kmovw %k7, {{[-0-9]+}}(%r{{[sb]}}p) # 2-byte Spill
+; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 136(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $10, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
 ; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 144(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
@@ -4040,8 +4046,8 @@ define <64 x i32> @test_compress_large(<64 x i1> %mask, <64 x i32> %vec, <64 x i
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $7, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k7 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; AVX512F-NEXT:    kandw %k2, %k1, %k1
 ; AVX512F-NEXT:    movzbl 168(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -4114,8 +4120,7 @@ define <64 x i32> @test_compress_large(<64 x i1> %mask, <64 x i32> %vec, <64 x i
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $10, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
-; AVX512F-NEXT:    kandw %k2, %k1, %k1
+; AVX512F-NEXT:    kandw %k7, %k1, %k1
 ; AVX512F-NEXT:    movzbl 16(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
@@ -4137,7 +4142,8 @@ define <64 x i32> @test_compress_large(<64 x i1> %mask, <64 x i32> %vec, <64 x i
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
 ; AVX512F-NEXT:    kshiftrw $7, %k2, %k2
 ; AVX512F-NEXT:    korw %k2, %k1, %k1
-; AVX512F-NEXT:    kandw %k7, %k1, %k1
+; AVX512F-NEXT:    kmovw {{[-0-9]+}}(%r{{[sb]}}p), %k2 # 2-byte Reload
+; AVX512F-NEXT:    kandw %k2, %k1, %k1
 ; AVX512F-NEXT:    movzbl 40(%rbp), %eax
 ; AVX512F-NEXT:    kmovw %eax, %k2
 ; AVX512F-NEXT:    kshiftlw $15, %k2, %k2
diff --git a/llvm/test/CodeGen/X86/vector-fshl-256.ll b/llvm/test/CodeGen/X86/vector-fshl-256.ll
index 193e570c5f9a87..32ad72b2aa56a8 100644
--- a/llvm/test/CodeGen/X86/vector-fshl-256.ll
+++ b/llvm/test/CodeGen/X86/vector-fshl-256.ll
@@ -1312,7 +1312,6 @@ define void @fancierRotate2(ptr %arr, ptr %control, i32 %rot0, i32 %rot1) {
 ; AVX1-NEXT:    addq $8, %rax
 ; AVX1-NEXT:    jne .LBB8_1
 ; AVX1-NEXT:  # %bb.2: # %exit
-; AVX1-NEXT:    vzeroupper
 ; AVX1-NEXT:    retq
 ;
 ; AVX2-LABEL: fancierRotate2:
diff --git a/llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll b/llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll
index 43d2a997c81d21..338e104fbe8f03 100644
--- a/llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll
+++ b/llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll
@@ -3296,27 +3296,27 @@ define void @lshr_64bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    andl $63, %ecx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrl $3, %eax
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    andl $56, %eax
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -120(%rsp,%rax), %rdi
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -112(%rsp,%rax), %r8
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %rdi, %r15
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -120(%rsp,%rax), %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -112(%rsp,%rax), %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r8, %r15
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, -128(%rsp,%rax), %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movl %ecx, %esi
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -104(%rsp,%rax), %r9
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r9, %r13
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r8, %r10
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %rdi, %r10
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -88(%rsp,%rax), %r11
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r11, %r14
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movl %ecx, %r12d
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    notl %r12d
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %r8, %r8
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %r12, %r8, %r8
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %r15, %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %rdi, %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %r12, %rdi, %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %r15, %rdi
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -96(%rsp,%rax), %r15
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r15, %rbp
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    xorb $63, %sil
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %rdi, %rdi
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %rsi, %rdi, %rdi
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %rbx, %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %r8, %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %rsi, %r8, %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %rbx, %r8
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    leaq (%r15,%r15), %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %r12, %rbx, %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %r13, %rbx
@@ -3342,8 +3342,8 @@ define void @lshr_64bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r10, 40(%rdx)
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r9, 16(%rdx)
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %rbx, 24(%rdx)
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %rdi, (%rdx)
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r8, 8(%rdx)
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r8, (%rdx)
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %rdi, 8(%rdx)
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq $8, %rsp
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    popq %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    popq %r12
@@ -5571,27 +5571,27 @@ define void @ashr_64bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    andl $63, %ecx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrl $3, %eax
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    andl $56, %eax
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -120(%rsp,%rax), %rdi
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -112(%rsp,%rax), %r8
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %rdi, %r15
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -120(%rsp,%rax), %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -112(%rsp,%rax), %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r8, %r15
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, -128(%rsp,%rax), %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movl %ecx, %esi
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -104(%rsp,%rax), %r9
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r9, %r13
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r8, %r10
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %rdi, %r10
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -88(%rsp,%rax), %r11
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r11, %r14
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movl %ecx, %r12d
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    notl %r12d
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %r8, %r8
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %r12, %r8, %r8
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %r15, %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %rdi, %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %r12, %rdi, %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %r15, %rdi
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq -96(%rsp,%rax), %r15
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shrxq %rcx, %r15, %rbp
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    xorb $63, %sil
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %rdi, %rdi
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %rsi, %rdi, %rdi
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %rbx, %rdi
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq %r8, %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %rsi, %r8, %r8
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %rbx, %r8
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    leaq (%r15,%r15), %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    shlxq %r12, %rbx, %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    orq %r13, %rbx
@@ -5617,8 +5617,8 @@ define void @ashr_64bytes(ptr %src.ptr, ptr %bitOff.ptr, ptr %dst) nounwind {
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r10, 40(%rdx)
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r9, 16(%rdx)
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %rbx, 24(%rdx)
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %rdi, (%rdx)
-; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r8, 8(%rdx)
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %r8, (%rdx)
+; X64-HAVE-BMI2-NO-SHLD-NEXT:    movq %rdi, 8(%rdx)
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    addq $8, %rsp
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    popq %rbx
 ; X64-HAVE-BMI2-NO-SHLD-NEXT:    popq %r12
diff --git a/llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll b/llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll
index fbbf2a6c127a53..81c4d5d71084ca 100644
--- a/llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll
+++ b/llvm/test/CodeGen/X86/widen-load-of-small-alloca-with-zero-upper-half.ll
@@ -1672,9 +1672,10 @@ define void @load_16byte_chunk_of_32byte_alloca_with_zero_upper_half(ptr %src, i
 ;
 ; X64-HAVE-BMI2-HAVE-SHLD-LABEL: load_16byte_chunk_of_32byte_alloca_with_zero_upper_half:
 ; X64-HAVE-BMI2-HAVE-SHLD:       # %bb.0:
+; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movq %rsi, %rcx
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movups (%rdi), %xmm0
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    xorps %xmm1, %xmm1
-; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    leal (,%rsi,8), %ecx
+; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    shll $3, %ecx
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movaps %xmm1, -{{[0-9]+}}(%rsp)
diff --git a/llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll b/llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll
index ff13f4ba577f2e..8d36eef952a2be 100644
--- a/llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll
+++ b/llvm/test/CodeGen/X86/widen-load-of-small-alloca.ll
@@ -1946,9 +1946,10 @@ define void @load_16byte_chunk_of_32byte_alloca(ptr %src, i64 %byteOff, ptr %dst
 ;
 ; X64-HAVE-BMI2-HAVE-SHLD-LABEL: load_16byte_chunk_of_32byte_alloca:
 ; X64-HAVE-BMI2-HAVE-SHLD:       # %bb.0:
+; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movq %rsi, %rcx
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movups (%rdi), %xmm0
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movups 16(%rdi), %xmm1
-; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    leal (,%rsi,8), %ecx
+; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    shll $3, %ecx
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    xorps %xmm2, %xmm2
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movaps %xmm2, -{{[0-9]+}}(%rsp)
 ; X64-HAVE-BMI2-HAVE-SHLD-NEXT:    movaps %xmm2, -{{[0-9]+}}(%rsp)



More information about the llvm-branch-commits mailing list